Mini Project III | likelion |
๐Ÿฆ ์›”๊ฐ„ ๋ฐ์ด์ฝ˜ ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์ž ์—ฐ์ฒด ์˜ˆ์ธก AI ๊ฒฝ์ง„๋Œ€ํšŒ

[ MiniProject III ] ์›”๊ฐ„ ๋ฐ์ด์ฝ˜ ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์ž ์—ฐ์ฒด ์˜ˆ์ธก AI ๊ฒฝ์ง„๋Œ€ํšŒ

๐Ÿฆ Project

๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ AI School 7th MiniProject 3

๐Ÿ™†โ€โ™€๏ธ๐Ÿ™† Team 7ใ…์ฆˆ์•„
์œผ์Œฐ์œผ์Œฐ2ํŒ€ - ์ด์Šนํ›„, ์ƒ์šฐ์ง„, ๋‚จํ•˜์œค, ๊น€์ค€๋ชจ

๐Ÿ—“๏ธ When
2022.11.02 - 2022.11.06


์ด๋ฒˆ ๋ฏธ๋‹ˆ ํ”„๋กœ์ ํŠธ๋กœ 7ใ…์ฆˆ์•„ ํŒ€์€ 2021๋…„ 5์›”์— ๋งˆ๊ฐํ•œ ์›”๊ฐ„ ๋ฐ์ด์ฝ˜ ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์ž ์—ฐ์ฒด ์˜ˆ์ธก AI ๊ฒฝ์ง„๋Œ€ํšŒ๋ฅผ ์ฃผ์ œ๋กœ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋ชจ์ƒ‰ํ•˜๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

๐Ÿ—‚๏ธ Dataset

  • index
  • gender: ์„ฑ๋ณ„
  • car: ์ฐจ๋Ÿ‰ ์†Œ์œ  ์—ฌ๋ถ€
  • reality: ๋ถ€๋™์‚ฐ ์†Œ์œ  ์—ฌ๋ถ€
  • child_num: ์ž๋…€ ์ˆ˜
  • income_total: ์—ฐ๊ฐ„ ์†Œ๋“
  • income_type: ์†Œ๋“ ๋ถ„๋ฅ˜
      ['Commercial associate', 'Working', 'State servant', 'Pensioner', 'Student']
    
  • edu_type: ๊ต์œก ์ˆ˜์ค€
      ['Higher education' ,'Secondary / secondary special', 'Incomplete higher', 
       'Lower secondary', 'Academic degree']
    
  • family_type: ๊ฒฐํ˜ผ ์—ฌ๋ถ€
      ['Married', 'Civil marriage', 'Separated', 'Single / not married', 'Widow']
    
  • house_type: ์ƒํ™œ ๋ฐฉ์‹
      ['Municipal apartment', 'House / apartment', 'With parents',
       'Co-op apartment', 'Rented apartment', 'Office apartment']
    
  • DAYS_BIRTH: ์ถœ์ƒ์ผ
    ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ (0)๋ถ€ํ„ฐ ์—ญ์œผ๋กœ ์…ˆ.
    ์ฆ‰, -1์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ผ ํ•˜๋ฃจ ์ „์— ํƒœ์–ด๋‚ฌ์Œ์„ ์˜๋ฏธ

  • DAYS_EMPLOYED: ์—…๋ฌด ์‹œ์ž‘์ผ
    ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ (0)๋ถ€ํ„ฐ ์—ญ์œผ๋กœ ์…ˆ.
    ์ฆ‰, -1์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ผ ํ•˜๋ฃจ ์ „๋ถ€ํ„ฐ ์ผ์„ ์‹œ์ž‘ํ•จ์„ ์˜๋ฏธ.
    ์–‘์ˆ˜ ๊ฐ’์€ ๊ณ ์šฉ๋˜์ง€ ์•Š์€ ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•จ.

  • FLAG_MOBIL: ํ•ธ๋“œํฐ ์†Œ์œ  ์—ฌ๋ถ€
  • work_phone: ์—…๋ฌด์šฉ ์ „ํ™” ์†Œ์œ  ์—ฌ๋ถ€
  • phone: ์ „ํ™” ์†Œ์œ  ์—ฌ๋ถ€
  • email: ์ด๋ฉ”์ผ ์†Œ์œ  ์—ฌ๋ถ€
  • occyp_type: ์ง์—… ์œ ํ˜•
  • family_size: ๊ฐ€์กฑ ๊ทœ๋ชจ
  • begin_month: ์‹ ์šฉ์นด๋“œ ๋ฐœ๊ธ‰ ์›”
    ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ (0)๋ถ€ํ„ฐ ์—ญ์œผ๋กœ ์…ˆ, ์ฆ‰, -1์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ผ ํ•œ ๋‹ฌ ์ „์— ์‹ ์šฉ์นด๋“œ๋ฅผ ๋ฐœ๊ธ‰ํ•จ์„ ์˜๋ฏธ

  • credit: ์‚ฌ์šฉ์ž์˜ ์‹ ์šฉ์นด๋“œ ๋Œ€๊ธˆ ์—ฐ์ฒด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•œ ์‹ ์šฉ๋„
    => ๋‚ฎ์„์ˆ˜๋ก ๋†’์€ ์‹ ์šฉ์˜ ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์ž๋ฅผ ์˜๋ฏธํ•จ

๐Ÿ’ป Closer Look

์šฐ์„  ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณต๋˜๋Š” sample_submission.csv๋ฅผ ๊ทธ๋Œ€๋กœ ์ œ์ถœํ•˜๋‹ˆ 0.88113 / 0.87223 (Public / Private) ์˜ ์„ฑ๋Šฅ์ด ๋‚˜์™”๋‹ค. ํ•ด๋‹น Baseline ์ฝ”๋“œ๋Š” Random Forest Classifer์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

1. Binary variables

train['gender'] = train['gender'].replace(['F','M'],[0,1])
test['gender'] = test['gender'].replace(['F','M'],[0,1])
print('gender :')
print(train['gender'].value_counts())
print('--------------')

print('Having a car or not : ')
train['car'] = train['car'].replace(['N','Y'],[0,1])
test['car'] = test['car'].replace(['N','Y'],[0,1])
print(train['car'].value_counts())
print('--------------')

print('Having house reality or not: ')
train['reality'] = train['reality'].replace(['N','Y'],[0,1])
test['reality'] = test['reality'].replace(['N','Y'],[0,1])
print(train['reality'].value_counts())
print('--------------')
      
print('Having a phone or not: ')
print(train['phone'].value_counts())
print('--------------')
      

print('Having a email or not: ')
print(train['email'].value_counts())
print('--------------')
      

print('Having a work phone or not: ')
print(train['work_phone'].value_counts())
print('--------------')

2. Child_num

child_num ๊ฐ’์ด 3๋ณด๋‹ค ํฌ๋ฉด ํฌ์†Œ๊ฐ’์œผ๋กœ ํŒ๋‹จํ•˜์—ฌ 3์œผ๋กœ ๊ฐ’์„ ํ†ต์ผํ•ด์ฃผ์—ˆ๋‹ค.

train['child_num'].value_counts(sort=False).plot.bar()

train.loc[train['child_num'] >= 3,'child_num']=3
test.loc[test['child_num'] >= 3, 'child_num']=3

3. Family_size

train['family_size'].value_counts().sort_values(ascending = False)
2.0     14106
1.0      5109
3.0      4632
4.0      2260
5.0       291
6.0        44
7.0         9
15.0        3
9.0         2
20.0        1
Name: family_size, dtype: int64

famliy_size๊ฐ€ 7๋ณด๋‹ค ํฌ๋ฉด ํฌ์†Œ๊ฐ’์ด๋ผ๊ณ  ํŒ๋‹จํ•˜์—ฌ 7๋กœ ๋ชจ๋‘ ๊ฐ’์„ ํ†ต์ผํ•ด์ฃผ์—ˆ๋‹ค.

train.loc[train['family_size'] > 7, 'family_size'] = 7
test.loc[test['family_size'] > 7, 'family_size'] = 7

4. income

income๊ฐ€ ๋„“๊ฒŒ ํผ์ ธ ์žˆ๋‹ค. ์Šค์ผ€์ผ๋งํ•ด์ค€ ํ›„ ๋ฒ”์ฃผํ™”ํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•ด๋ณด์ž.

train['income_total'] = train['income_total'].astype(object)
train['income_total'] = train['income_total']/10000 
test['income_total'] = test['income_total']/10000
##############################################################3
print(train['income_total'].value_counts(bins=10,sort=False))
train['income_total'].plot(kind='hist',bins=50,density=True)


์†Œ๋“์˜ ํ‰๊ท ๊ฐ’๋„ ๊ณ„์‚ฐํ•˜์—ฌ ์ถ”๊ฐ€ํ•ด์ฃผ์ž.

train['income_mean'] = train['income_total'] / train['family_size']
test['income_mean'] = test['income_total'] / test['family_size']
train['income_mean']

์ด์†Œ๋“ ๋ฒ”์ฃผํ™”ํ•ด์ฃผ๊ธฐ!!

count, bin_dividers =np.histogram(train['income_total'], bins=7)
bin_names=['์†Œ๋“'+str(i) for i in range(7) ]
#bin_dividers๋Š” train๊ธฐ์ค€!!
train['income_total']=pd.cut(x=train['income_total'], bins=bin_dividers, labels=bin_names, include_lowest=True)
test['income_total']=pd.cut(x=test['income_total'], bins=bin_dividers, labels=bin_names, include_lowest=True)

5. Label Encoding

from sklearn import preprocessing
label_encoder=preprocessing.LabelEncoder()
train['income_type']=label_encoder.fit_transform(train['income_type'])
test['income_type']=label_encoder.transform(test['income_type'])
########################################################################
train['edu_type']=label_encoder.fit_transform(train['edu_type'])
test['edu_type']=label_encoder.transform(test['edu_type'])
########################################################################
train['family_type']=label_encoder.fit_transform(train['family_type'])
test['family_type']=label_encoder.transform(test['family_type'])
########################################################################
train['house_type']=label_encoder.fit_transform(train['house_type'])
test['house_type']=label_encoder.transform(test['house_type'])
########################################################################
train['income_total']=label_encoder.fit_transform(train['income_total'])
test['income_total']=label_encoder.fit_transform(test['income_total'])
########################################################################
train['occyp_type']=label_encoder.fit_transform(train['occyp_type'])
test['occyp_type']=label_encoder.fit_transform(test['occyp_type'])

6. Minus continuous variables

def make_bin(variable, n):
    train[variable]=-train[variable]
    test[variable]=-test[variable]
    count, bin_dividers =np.histogram(train[variable], bins=n) #train์˜ ๊ตฌ๊ฐ„ํ™”๋ฅผ ์ ์šฉ
    bin_names=[str(i) for i in range(n)]
    train[variable]=pd.cut(x=train[variable], bins=bin_dividers, labels=bin_names, include_lowest=True)
    test[variable]=pd.cut(x=test[variable], bins=bin_dividers, labels=bin_names, include_lowest=True)
    test[variable].fillna(str(0), inplace=True) #test์—๋Š” ์—†๋Š” ๊ฒƒ์„ ์ž„์˜์˜ ๊ฐ’์œผ๋กœ ์ฑ„์›€
    ##########################################################
    train[variable]=label_encoder.fit_transform(train[variable])
    test[variable]=label_encoder.transform(test[variable])

bins๋ฅผ ๋‚˜๋ˆ ์ค€ ๊ธฐ์ค€์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ๊ฐํ–ˆ๋‹ค.

# DAYS_BIRTH
# ์ถœ์ƒ์ผ
# ์ตœ๋Œ€ - ์ตœ์†Œ = 17,447 (์ผ) = 581 (๊ฐœ์›”) = 48.46 (๋…„) -> ์•ฝ 50 (๋…„)

# DAYS_EMPLOYED
# ์—…๋ฌด ์‹œ์ž‘์ผ
# ์ตœ๋Œ€ - ์ตœ์†Œ = 380,956 (์ผ) = 12698.53 (๊ฐœ์›”) = 1058 (๋…„)
# ์–‘์ˆ˜ ๊ฐ’ ์ฒ˜๋ฆฌ -> ๊ณ ์šฉ๋˜์ง€ ์•Š์€ ์ƒํƒœ์ž„ -> 0
# ์ฒ˜๋ฆฌ ํ›„ -> 523 (๊ฐœ์›”) -> ์•ฝ 43.64 (๋…„)

# begin_month
# ์‹ ์šฉ์นด๋“œ ๋ฐœ๊ธ‰์›”
# ์ตœ๋Œ€ - ์ตœ์†Œ = 60 (๊ฐœ์›”) = 5 (๋…„)
train.loc[train['DAYS_EMPLOYED'] > 0,'DAYS_EMPLOYED'] = 0
test.loc[test['DAYS_EMPLOYED'] > 0, 'DAYS_EMPLOYED'] = 0

train['Age'] = train['DAYS_BIRTH'] // 365
test['Age'] = test['DAYS_BIRTH'] // 365
make_bin('DAYS_BIRTH', n=10)        
make_bin('DAYS_EMPLOYED', n=4)      
make_bin('begin_month', n=5)

7. Correlation

plt.figure(figsize = (14, 6))
sns.heatmap(train.corr(), fmt = '.2f',
            # vmin = -0.5, vmax = 0.5, 
            annot = True,
            cmap = 'coolwarm')

8. Modeling

CatBoost Classifer๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ๋‹ค.

clf = CatBoostClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_val)

# print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")
print(f"log_loss: {log_loss(y_val['credit'], y_pred)}")
log_loss_val = log_loss(y_val['credit'], y_pred)
log_loss_val = np.around(log_loss_val, 3)
log_loss(y_val['credit'], y_pred)
=> 0.8091264167319345

9. Feature importance ์‹œ๊ฐํ™”

def plot_feature_importance(importance, names, model_type):

    feature_importance = np.array(importance)
    feature_names = np.array(names)

    data = {'feature_names': feature_names,
            'feature_importance': feature_importance}
    fi_df = pd.DataFrame(data)

    fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)

    plt.figure(figsize=(10, 8))

    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])

    plt.title(model_type + ' Feature Importance')
    plt.xlabel('Feature Importance')
    plt.ylabel('Feature Names')
plot_feature_importance(clf.get_feature_importance(),test_x.columns,'CATBOOST')

10. Cross Validation

from sklearn.model_selection import KFold, StratifiedKFold

def run_kfold(clf):
    folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=55)
    outcomes=[]
    sub=np.zeros((test_x.shape[0], 3))  
    for n_fold, (train_index, val_index) in enumerate(folds.split(train_x, train_y)):
        X_train, X_val = train_x.iloc[train_index], train_x.iloc[val_index]
        y_train, y_val = train_y.iloc[train_index], train_y.iloc[val_index]
        clf.fit(X_train, y_train)
        
        predictions=clf.predict_proba(X_val)
        
        logloss=log_loss(y_val['credit'], predictions)
        outcomes.append(logloss)
        print(f"FOLD {n_fold} : logloss:{logloss}")
        
        sub+=clf.predict_proba(test_x)
        
        
    mean_outcome=np.mean(outcomes)
    
    print("Mean:{}".format(mean_outcome))
    return sub/folds.n_splits

my_submission = run_kfold(clf)

โžก๏ธ ์‹œ๋„ History์™€ ์ตœ์ข… ์„ฑ๋Šฅ ๋น„๊ต

History

Baseline Final Score
0.872 0.7836
# basline => 0.88
# ์ ์ˆ˜ : ( Public / Private )

########## RandomFroest ##########
# 1
# baseline_submission_0.983 -> 0.86583 / 0.8516
# ์ง์—…์œ ํ˜•(occyp_type) label encoding

# 2
# baseline_submission_0.951 -> 0.86071 / 0.8465
# DAYS_EMPLOYED ์–‘์ˆ˜ ์ฒ˜๋ฆฌ

# 3
# baseline_submission_0.923 -> 0.85454
# DAYS_BIRTH / DAYS_EMPLOYED / begin_month ํ†ต๊ณ„์ž๋ฃŒ -> bins ๊ฐœ์ˆ˜ ์กฐ์ •


############ XGBoost ############
# 4
# baseline_submission_0.856 -> 0.8738 / 0.8654
# XGBoost
# ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋” ๋–จ์–ด์ง

########## CatBoost ##########
# 5
# baseline_submission_0.824  -> 0.8013 / 0.7931
# ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ
# ์˜๋ฏธ ์—†๋Š” ์ปฌ๋Ÿผ ์ œ๊ฑฐ : index, FLAG_MOBIL
# FLAG_MOBIL : ์ „๋ถ€ 1๋กœ, ๋™์ผํ•œ ๊ฐ’์„ ๊ฐ€์ง

# 6
# baseline_submission_0.819 -> 0.7978 / 0.7978
# Age ์ปฌ๋Ÿผ ์ถ”๊ฐ€

# 7
# baseline_submission_0.821
# family_size 7 ์ด์ƒ ํ†ต์ผ

# 8
# ์„ฑ๋Šฅ ์ €ํ•˜ -> # 9์—์„œ DAYS_BIRTH ์‚ด๋ฆฌ๊ธฐ
# DAYS_BIRTH, work_phone
# work_phone : credit๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜ 0

# 9
# baseline_submission_0.808 -> 0.7904 / 0.7836
# work_phone ์ปฌ๋Ÿผ๋งŒ ์ œ์™ธ, DAYS_BIRTH ์‚ด๋ฆฌ๊ณ 
# income_mean ๊ณ„์‚ฐ
 


๐Ÿ’™ You need to log in to GitHub to write comments. ๐Ÿ’™
If you can't see comments, please refresh page(F5).