Mini Project III | likelion |
[ MiniProject III ] ์๊ฐ ๋ฐ์ด์ฝ ์ ์ฉ์นด๋ ์ฌ์ฉ์ ์ฐ์ฒด ์์ธก AI ๊ฒฝ์ง๋ํ
๐ฆ Project
๋ฉ์์ด์ฌ์์ฒ๋ผ AI School 7th MiniProject 3
๐โโ๏ธ๐ Team 7ใ
์ฆ์
์ผ์ฐ์ผ์ฐ2ํ - ์ด์นํ, ์์ฐ์ง, ๋จํ์ค, ๊น์ค๋ชจ
๐๏ธ When
2022.11.02 - 2022.11.06
์ด๋ฒ ๋ฏธ๋ ํ๋ก์ ํธ๋ก 7ใ ์ฆ์ ํ์ 2021๋ 5์์ ๋ง๊ฐํ ์๊ฐ ๋ฐ์ด์ฝ ์ ์ฉ์นด๋ ์ฌ์ฉ์ ์ฐ์ฒด ์์ธก AI ๊ฒฝ์ง๋ํ๋ฅผ ์ฃผ์ ๋ก ๋ชจ๋ธ์ ๋ง๋ค๊ณ ๋์ ์ฑ๋ฅ์ ๋ผ ์ ์๋ ๋ฐฉ๋ฒ์ ๋ํด ๋ชจ์ํ๊ธฐ๋ก ํ์๋ค.
๐๏ธ Dataset
- index
- gender: ์ฑ๋ณ
- car: ์ฐจ๋ ์์ ์ฌ๋ถ
- reality: ๋ถ๋์ฐ ์์ ์ฌ๋ถ
- child_num: ์๋ ์
- income_total: ์ฐ๊ฐ ์๋
- income_type: ์๋ ๋ถ๋ฅ
['Commercial associate', 'Working', 'State servant', 'Pensioner', 'Student']
- edu_type: ๊ต์ก ์์ค
['Higher education' ,'Secondary / secondary special', 'Incomplete higher', 'Lower secondary', 'Academic degree']
- family_type: ๊ฒฐํผ ์ฌ๋ถ
['Married', 'Civil marriage', 'Separated', 'Single / not married', 'Widow']
- house_type: ์ํ ๋ฐฉ์
['Municipal apartment', 'House / apartment', 'With parents', 'Co-op apartment', 'Rented apartment', 'Office apartment']
-
DAYS_BIRTH: ์ถ์์ผ
๋ฐ์ดํฐ ์์ง ๋น์ (0)๋ถํฐ ์ญ์ผ๋ก ์ .
์ฆ, -1์ ๋ฐ์ดํฐ ์์ง์ผ ํ๋ฃจ ์ ์ ํ์ด๋ฌ์์ ์๋ฏธ -
DAYS_EMPLOYED: ์ ๋ฌด ์์์ผ
๋ฐ์ดํฐ ์์ง ๋น์ (0)๋ถํฐ ์ญ์ผ๋ก ์ .
์ฆ, -1์ ๋ฐ์ดํฐ ์์ง์ผ ํ๋ฃจ ์ ๋ถํฐ ์ผ์ ์์ํจ์ ์๋ฏธ.
์์ ๊ฐ์ ๊ณ ์ฉ๋์ง ์์ ์ํ๋ฅผ ์๋ฏธํจ. - FLAG_MOBIL: ํธ๋ํฐ ์์ ์ฌ๋ถ
- work_phone: ์ ๋ฌด์ฉ ์ ํ ์์ ์ฌ๋ถ
- phone: ์ ํ ์์ ์ฌ๋ถ
- email: ์ด๋ฉ์ผ ์์ ์ฌ๋ถ
- occyp_type: ์ง์ ์ ํ
- family_size: ๊ฐ์กฑ ๊ท๋ชจ
-
begin_month: ์ ์ฉ์นด๋ ๋ฐ๊ธ ์
๋ฐ์ดํฐ ์์ง ๋น์ (0)๋ถํฐ ์ญ์ผ๋ก ์ , ์ฆ, -1์ ๋ฐ์ดํฐ ์์ง์ผ ํ ๋ฌ ์ ์ ์ ์ฉ์นด๋๋ฅผ ๋ฐ๊ธํจ์ ์๋ฏธ - credit: ์ฌ์ฉ์์ ์ ์ฉ์นด๋ ๋๊ธ ์ฐ์ฒด๋ฅผ ๊ธฐ์ค์ผ๋ก ํ ์ ์ฉ๋
=> ๋ฎ์์๋ก ๋์ ์ ์ฉ์ ์ ์ฉ์นด๋ ์ฌ์ฉ์๋ฅผ ์๋ฏธํจ
๐ป Closer Look
์ฐ์ ๊ธฐ๋ณธ์ผ๋ก ์ ๊ณต๋๋ sample_submission.csv
๋ฅผ ๊ทธ๋๋ก ์ ์ถํ๋ 0.88113 / 0.87223 (Public / Private) ์ ์ฑ๋ฅ์ด ๋์๋ค. ํด๋น Baseline ์ฝ๋๋ Random Forest Classifer์ ์ฌ์ฉํ๋ค.
1. Binary variables
train['gender'] = train['gender'].replace(['F','M'],[0,1])
test['gender'] = test['gender'].replace(['F','M'],[0,1])
print('gender :')
print(train['gender'].value_counts())
print('--------------')
print('Having a car or not : ')
train['car'] = train['car'].replace(['N','Y'],[0,1])
test['car'] = test['car'].replace(['N','Y'],[0,1])
print(train['car'].value_counts())
print('--------------')
print('Having house reality or not: ')
train['reality'] = train['reality'].replace(['N','Y'],[0,1])
test['reality'] = test['reality'].replace(['N','Y'],[0,1])
print(train['reality'].value_counts())
print('--------------')
print('Having a phone or not: ')
print(train['phone'].value_counts())
print('--------------')
print('Having a email or not: ')
print(train['email'].value_counts())
print('--------------')
print('Having a work phone or not: ')
print(train['work_phone'].value_counts())
print('--------------')
2. Child_num
child_num ๊ฐ์ด 3๋ณด๋ค ํฌ๋ฉด ํฌ์๊ฐ์ผ๋ก ํ๋จํ์ฌ 3์ผ๋ก ๊ฐ์ ํต์ผํด์ฃผ์๋ค.
train['child_num'].value_counts(sort=False).plot.bar()
train.loc[train['child_num'] >= 3,'child_num']=3
test.loc[test['child_num'] >= 3, 'child_num']=3
3. Family_size
train['family_size'].value_counts().sort_values(ascending = False)
2.0 14106
1.0 5109
3.0 4632
4.0 2260
5.0 291
6.0 44
7.0 9
15.0 3
9.0 2
20.0 1
Name: family_size, dtype: int64
famliy_size๊ฐ 7๋ณด๋ค ํฌ๋ฉด ํฌ์๊ฐ์ด๋ผ๊ณ ํ๋จํ์ฌ 7๋ก ๋ชจ๋ ๊ฐ์ ํต์ผํด์ฃผ์๋ค.
train.loc[train['family_size'] > 7, 'family_size'] = 7
test.loc[test['family_size'] > 7, 'family_size'] = 7
4. income
income๊ฐ ๋๊ฒ ํผ์ ธ ์๋ค. ์ค์ผ์ผ๋งํด์ค ํ ๋ฒ์ฃผํํ์ฌ ์๊ฐํํด๋ณด์.
train['income_total'] = train['income_total'].astype(object)
train['income_total'] = train['income_total']/10000
test['income_total'] = test['income_total']/10000
##############################################################3
print(train['income_total'].value_counts(bins=10,sort=False))
train['income_total'].plot(kind='hist',bins=50,density=True)
์๋์ ํ๊ท ๊ฐ๋ ๊ณ์ฐํ์ฌ ์ถ๊ฐํด์ฃผ์.
train['income_mean'] = train['income_total'] / train['family_size']
test['income_mean'] = test['income_total'] / test['family_size']
train['income_mean']
์ด์๋ ๋ฒ์ฃผํํด์ฃผ๊ธฐ!!
count, bin_dividers =np.histogram(train['income_total'], bins=7)
bin_names=['์๋'+str(i) for i in range(7) ]
#bin_dividers๋ train๊ธฐ์ค!!
train['income_total']=pd.cut(x=train['income_total'], bins=bin_dividers, labels=bin_names, include_lowest=True)
test['income_total']=pd.cut(x=test['income_total'], bins=bin_dividers, labels=bin_names, include_lowest=True)
5. Label Encoding
from sklearn import preprocessing
label_encoder=preprocessing.LabelEncoder()
train['income_type']=label_encoder.fit_transform(train['income_type'])
test['income_type']=label_encoder.transform(test['income_type'])
########################################################################
train['edu_type']=label_encoder.fit_transform(train['edu_type'])
test['edu_type']=label_encoder.transform(test['edu_type'])
########################################################################
train['family_type']=label_encoder.fit_transform(train['family_type'])
test['family_type']=label_encoder.transform(test['family_type'])
########################################################################
train['house_type']=label_encoder.fit_transform(train['house_type'])
test['house_type']=label_encoder.transform(test['house_type'])
########################################################################
train['income_total']=label_encoder.fit_transform(train['income_total'])
test['income_total']=label_encoder.fit_transform(test['income_total'])
########################################################################
train['occyp_type']=label_encoder.fit_transform(train['occyp_type'])
test['occyp_type']=label_encoder.fit_transform(test['occyp_type'])
6. Minus continuous variables
def make_bin(variable, n):
train[variable]=-train[variable]
test[variable]=-test[variable]
count, bin_dividers =np.histogram(train[variable], bins=n) #train์ ๊ตฌ๊ฐํ๋ฅผ ์ ์ฉ
bin_names=[str(i) for i in range(n)]
train[variable]=pd.cut(x=train[variable], bins=bin_dividers, labels=bin_names, include_lowest=True)
test[variable]=pd.cut(x=test[variable], bins=bin_dividers, labels=bin_names, include_lowest=True)
test[variable].fillna(str(0), inplace=True) #test์๋ ์๋ ๊ฒ์ ์์์ ๊ฐ์ผ๋ก ์ฑ์
##########################################################
train[variable]=label_encoder.fit_transform(train[variable])
test[variable]=label_encoder.transform(test[variable])
bins๋ฅผ ๋๋ ์ค ๊ธฐ์ค์ ๋ค์๊ณผ ๊ฐ์ด ์๊ฐํ๋ค.
# DAYS_BIRTH
# ์ถ์์ผ
# ์ต๋ - ์ต์ = 17,447 (์ผ) = 581 (๊ฐ์) = 48.46 (๋
) -> ์ฝ 50 (๋
)
# DAYS_EMPLOYED
# ์
๋ฌด ์์์ผ
# ์ต๋ - ์ต์ = 380,956 (์ผ) = 12698.53 (๊ฐ์) = 1058 (๋
)
# ์์ ๊ฐ ์ฒ๋ฆฌ -> ๊ณ ์ฉ๋์ง ์์ ์ํ์ -> 0
# ์ฒ๋ฆฌ ํ -> 523 (๊ฐ์) -> ์ฝ 43.64 (๋
)
# begin_month
# ์ ์ฉ์นด๋ ๋ฐ๊ธ์
# ์ต๋ - ์ต์ = 60 (๊ฐ์) = 5 (๋
)
train.loc[train['DAYS_EMPLOYED'] > 0,'DAYS_EMPLOYED'] = 0
test.loc[test['DAYS_EMPLOYED'] > 0, 'DAYS_EMPLOYED'] = 0
train['Age'] = train['DAYS_BIRTH'] // 365
test['Age'] = test['DAYS_BIRTH'] // 365
make_bin('DAYS_BIRTH', n=10)
make_bin('DAYS_EMPLOYED', n=4)
make_bin('begin_month', n=5)
7. Correlation
plt.figure(figsize = (14, 6))
sns.heatmap(train.corr(), fmt = '.2f',
# vmin = -0.5, vmax = 0.5,
annot = True,
cmap = 'coolwarm')
8. Modeling
CatBoost Classifer๋ฅผ ์ด์ฉํด ๋ชจ๋ธ์ ํ์ต์์ผ์ฃผ์๋ค.
clf = CatBoostClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_val)
# print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")
print(f"log_loss: {log_loss(y_val['credit'], y_pred)}")
log_loss_val = log_loss(y_val['credit'], y_pred)
log_loss_val = np.around(log_loss_val, 3)
log_loss(y_val['credit'], y_pred)
=> 0.8091264167319345
9. Feature importance ์๊ฐํ
def plot_feature_importance(importance, names, model_type):
feature_importance = np.array(importance)
feature_names = np.array(names)
data = {'feature_names': feature_names,
'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
plt.figure(figsize=(10, 8))
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
plt.title(model_type + ' Feature Importance')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Names')
plot_feature_importance(clf.get_feature_importance(),test_x.columns,'CATBOOST')
10. Cross Validation
from sklearn.model_selection import KFold, StratifiedKFold
def run_kfold(clf):
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=55)
outcomes=[]
sub=np.zeros((test_x.shape[0], 3))
for n_fold, (train_index, val_index) in enumerate(folds.split(train_x, train_y)):
X_train, X_val = train_x.iloc[train_index], train_x.iloc[val_index]
y_train, y_val = train_y.iloc[train_index], train_y.iloc[val_index]
clf.fit(X_train, y_train)
predictions=clf.predict_proba(X_val)
logloss=log_loss(y_val['credit'], predictions)
outcomes.append(logloss)
print(f"FOLD {n_fold} : logloss:{logloss}")
sub+=clf.predict_proba(test_x)
mean_outcome=np.mean(outcomes)
print("Mean:{}".format(mean_outcome))
return sub/folds.n_splits
my_submission = run_kfold(clf)
โก๏ธ ์๋ History์ ์ต์ข ์ฑ๋ฅ ๋น๊ต
History
Baseline | Final Score |
---|---|
0.872 | 0.7836 |
# basline => 0.88
# ์ ์ : ( Public / Private )
########## RandomFroest ##########
# 1
# baseline_submission_0.983 -> 0.86583 / 0.8516
# ์ง์
์ ํ(occyp_type) label encoding
# 2
# baseline_submission_0.951 -> 0.86071 / 0.8465
# DAYS_EMPLOYED ์์ ์ฒ๋ฆฌ
# 3
# baseline_submission_0.923 -> 0.85454
# DAYS_BIRTH / DAYS_EMPLOYED / begin_month ํต๊ณ์๋ฃ -> bins ๊ฐ์ ์กฐ์
############ XGBoost ############
# 4
# baseline_submission_0.856 -> 0.8738 / 0.8654
# XGBoost
# ์คํ๋ ค ์ฑ๋ฅ์ด ๋ ๋จ์ด์ง
########## CatBoost ##########
# 5
# baseline_submission_0.824 -> 0.8013 / 0.7931
# ๊ฒฐ์ธก์น ์ ๊ฑฐ
# ์๋ฏธ ์๋ ์ปฌ๋ผ ์ ๊ฑฐ : index, FLAG_MOBIL
# FLAG_MOBIL : ์ ๋ถ 1๋ก, ๋์ผํ ๊ฐ์ ๊ฐ์ง
# 6
# baseline_submission_0.819 -> 0.7978 / 0.7978
# Age ์ปฌ๋ผ ์ถ๊ฐ
# 7
# baseline_submission_0.821
# family_size 7 ์ด์ ํต์ผ
# 8
# ์ฑ๋ฅ ์ ํ -> # 9์์ DAYS_BIRTH ์ด๋ฆฌ๊ธฐ
# DAYS_BIRTH, work_phone
# work_phone : credit๊ณผ์ ์๊ด๊ณ์ 0
# 9
# baseline_submission_0.808 -> 0.7904 / 0.7836
# work_phone ์ปฌ๋ผ๋ง ์ ์ธ, DAYS_BIRTH ์ด๋ฆฌ๊ณ
# income_mean ๊ณ์ฐ
๐ You need to log in to GitHub to write comments. ๐
If you can't see comments, please refresh page(F5).