Written by nodejs-style
on 2021-12-20

서울시 따릉이 자전거 이용 예측 AI모델 (Bike Sharing Demand)

Lv1. 의사결정회귀나무로 따릉이 데이터 예측하기

https://dacon.io/competitions/open/235576/overview

[데이터]

2017년 4월 1일부터 - 5월 31일까지 시간별 따릉이 대여수와 기상상황

id : 날짜와 시간별 id

hour_bef_temperature : 1시간 전 기온

hour_bef_precipitation : 1시간 전 비 정보, 비가 오지 않았으면 0, 비가 오면 1

hour_bef_windspeed : 1시간 전 풍속(평균)

hour_bef_humidity : 1시간 전 습도

hour_bef_visibility : 1시간 전 시정(視程), 시계(視界)(특정 기상 상태에 따른 가시성을 의미)

hour_bef_ozone : 1시간 전 오존

hour_bef_pm10 : 1시간 전 미세먼지(머리카락 굵기의 1/5에서 1/7 크기의 미세먼지)

hour_bef_pm2.5 : 1시간 전 미세먼지(머리카락 굵기의 1/20에서 1/30 크기의 미세먼지)

count : 시간에 따른 따릉이 대여 수

[목표]

각 날짜의 1시간 전의 기상상황으로 1시간 후의 따릉이 대여수를 예측하기

[평가지표]

RMSE

[과정 요약]

1. EDA

2. 전처리 : 보간법 or 시간대별 평균치

3. 모델링 : 의사결정나무. 변수 중요도가 낮은 컬럼을 제외해가며 RMSE 체크. 하이퍼파라미터 값을 수정해가며 튜닝하기

1. 파일 불러오기 ~ EDA

train.head()

train.shape

train.info()

train.discribe()

train.groupby('hour').mean()['count'].plot()

히트맵 등

import pandas as pd from sklearn.ensemble import RandomForestRegressor train = pd.read_csv(r'./data/train.csv') test = pd.read_csv(r'./data/test.csv') submission = pd.read_csv(r'./data/submission.csv') # 답안지 train.info() # 결측치 확인 # 변수 간 상관관계 확인 import seaborn as sns plt.figure(figsize=(10,10)) sns.heatmap(train.corr(), annot=True, fmt='.2f') # annot_kws={'size':10, }, cmap = 'rainbow' train.isna().sum() train[train['hour_bef_temperature'].isna()] # 결측치 위치 확인 # 0시, 18시인데 두 시간대의 온도는 매우 다를 것이므로 온도 컬럼 전체의 중앙값/평균값 같은 걸로 대체하는건 무리가 있다고 판단.

2. 전처리

결측치를 대체하기 위해 여기서 사용한 방법은

1) 보간법

2) 시간별 평균값을 이용해서 결측치를 대체 - 컬럼마다 시간별 평균값을 체크해야한다는 점에서 보간법보다 시간이 더 걸림

# 보간법 train.interpolate(inplace=True) test.interpolate(inplace=True)

# 시간별 평균값 확인해서 결측값을 메꾸기 train.groupby('hour')['hour_bef_temperature'].mean() train['hour_bef_temperature'].fillna({934:14.788136, 1035:20.926667}, inplace=True) # 저장하기 위해서 inplace 옵션 True # 마찬가지로 test 결측치 처리. 이 때 결측값을 대체할 값은 train에서 가져온다는 점 주의

3. 모델링

from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(criterion = 'mse') # 따릉이 대회의 평가지표는 RMSE. RMSE는 MSE 평가지표에 루트를 씌운 것이므로 MSE를 평가척도로 정함. X_train = train.drop(['count'], axis=1) Y_train = train['count'] model.fit(X_train, Y_train) # 모델 학습

변수의 중요도를 파악해서, 낮은 변수부터 하나씩 제거해가며 여러 모델을 테스트 해본다.

model.feature_importances_ # 변수의 중요도 파악 # 첫번째 모델 = id, count 컬럼 제외 X_train_1 = train.drop(['id','count'], axis=1) test_1 = test.drop(['id'], axis=1) model_1 = RandomForestRegressor(criterion = 'mse') model_1.fit(X_train_1, Y_train) ypread_1 = model_1.predict(test_1) submission_1 = pd.read_csv(r'./data/submission.csv') # 답안지 submission_1['count'] = ypread_1 submission_1.to_csv(r'./submission_1.csv', index=False) # 두번째 모델 = id, count, hour_bef_precipitation 제외 X_train_2 = train.drop(['id','count','hour_bef_precipitation'], axis=1) test_2 = test.drop(['id','hour_bef_precipitation'], axis=1) model_2 = RandomForestRegressor(criterion = 'mse') model_2.fit(X_train_2, Y_train) ypread_2 = model_2.predict(test_2) submission_2 = pd.read_csv(r'./data/submission.csv') # 답안지 submission_2['count'] = ypread_2 submission_2.to_csv(r'./submission_2.csv', index=False)

이런 식으로 네가지 모델을 돌려봤는데,

id, count, hour_bef_precipitation, hour_bef_pm2.5 변수를 제거한 모델3이 RMSE 46.87 정도로 가장 낮게 나왔다.

다음으로 하이퍼파라미터 튜닝.

이를 통해 모델의 과적합을 방지하고 스코어를 올린다.

Scikit-learn 패키지의 GridSearchCV 이용

GridSearch는 완전 탐색(Exhaustive Search)을 사용해서 가능한 모든 조합 중 가장 우수한 조합을 찾아줌(시간이 오래 걸린다는 단점)

1. n_estimators=100 학습시킬 의사결정나무 개수(이를 종합해서 최종값 산출하게 됨)

2. max_depth=None 뿌리 노드로부터 최대로 내려갈 수 있는 나무의 깊이. 과대적합 방지용.

3. n_jobs=None 사용하고자 하는 CPU 개수. 많이 쓸수록 빠름. 모두 활용하려면 -1

4. min_samples_leaf leaf 노드에서 필요한 최소한의 샘플 수. 너무 적으면 과적합이 일어날 수 있다

5. criterion 노드 분리기준. gini( / entropy(데이터 혼잡도)

6. splitter 노드 분리 방법. random / best

from sklearn.model_selection import GridSearchCV model = RandomForestRegressor(criterion = 'mse', random_state=0) parameters = {'n_estimators': [200, 300, 500], 'max_features': [5, 6, 7, 8], 'min_samples_leaf': [1, 3, 5]} greedy_CV = GridSearchCV(model, param_grid=parameters, cv = 3, n_jobs = -1) greedy_CV.fit(X_train, Y_train) greedy_CV.best_params_ greedy_CV_pred = greedy_CV.predict(test) submission['count'] = greedy_CV_pred submission.to_csv(r'./submission_greedyCV.csv', index=False)

max_features : 8, min_samples_leaf : 3, n_estimators : 300 일 때

앞서 테스트했던 모델3보다 더 낮은 RMSE 기록.

더 공부해서 디벨롭하기 !

https://www.kaggle.com/c/bike-sharing-demand/data

https://throwexception.tistory.com/1078

https://diane-space.tistory.com/67

https://dailyheumsi.tistory.com/95

https://continuous-development.tistory.com/173

https://github.com/BaekKyunShin/Kaggle/blob/master/Bike_Sharing_Demand/Bike%20Sharing%20Demand%20by%20Ensemble.ipynb

from http://hjryu09.tistory.com/15 by ccl(A) rewrite - 2021-12-20 18:00:24

Top