TIL | PCA (Principal Component Analysis) ๋ž€?

Table of Contents
  • ๋ชฉ์ฐจ
  • 1. ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ Principal Component Analysis
  • 2. PCA์˜ ์›๋ฆฌ How It Works
  • 3. ๋ถ„์‚ฐ์„ ์ตœ๋Œ€๋กœ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋Š” ์ถ•์„ ์„ ํƒํ•˜๋Š” ์ด์œ ?
  • 4. PCA ์ ์šฉ
  • 5. ๊ฒฐ๊ณผ ํ™•์ธ
  • ๐Ÿ’ป ์‹ค์Šต ์˜ˆ์ œ ์ฝ”๋“œ
  • ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉด์„œ..
  • ์ฐธ๊ณ 
  • ๋ชฉ์ฐจ
  • 1. ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ Principal Component Analysis
  • 2. PCA์˜ ์›๋ฆฌ How It Works
  • 3. ๋ถ„์‚ฐ์„ ์ตœ๋Œ€๋กœ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋Š” ์ถ•์„ ์„ ํƒํ•˜๋Š” ์ด์œ ?
  • 4. PCA ์ ์šฉ
  • 5. ๊ฒฐ๊ณผ ํ™•์ธ
  • ๐Ÿ’ป ์‹ค์Šต ์˜ˆ์ œ ์ฝ”๋“œ
  • ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉด์„œ..
  • ์ฐธ๊ณ 
  • [ AI / ML ] ๋จธ์‹ ๋Ÿฌ๋‹ - PCA (Principal Component Analysis)

    ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป K-MOOC ์‹ค์Šต์œผ๋กœ ๋ฐฐ์šฐ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฐ•์˜

    ๐Ÿ“™ ํ•ด๋‹น ํฌ์ŠคํŠธ๋Š” K-MOOC ๊ฐ•์˜ ๋‚ด์šฉ๊ณผ ์ถ”๊ฐ€๋กœ ๋‹ค๋ฅธ ์ž๋ฃŒ๋“ค์„ ์ฐพ์•„ ๋‚ด์šฉ์„ ์ž‘์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋ก  ๋ฐ ๊ฐœ๋…์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•˜๊ณ  ์˜ˆ์ œ ์‹ค์Šต๋„ ์ง„ํ–‰ํ•œ ํ›„ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜์˜€๋‹ค.

    [ AI ] ์ธ๊ณต์ง€๋Šฅ๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹, ๊ทธ๋ฆฌ๊ณ  ๋”ฅ๋Ÿฌ๋‹์™€ ๊ฐ™์€ ๋‚  ์ž‘์„ฑ๋œ ํฌ์ŠคํŠธ์ด๋‹ค.



    ๋ชฉ์ฐจ



    1. ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ Principal Component Analysis

    ์ฐจ์›์ถ•์†Œ(dimensionality reduction)์™€ ๋ณ€์ˆ˜์ถ”์ถœ(feature extraction) ๊ธฐ๋ฒ•์œผ๋กœ ๋„๋ฆฌ ์“ฐ์ด๊ณ  ์žˆ๋Š” PCA (Principal Component Analysis)์€ ๋น„์ง€๋„ํ•™์Šต Unsupervised Learning์—์„œ ์ž๋ฃŒ์— ์ค‘๋ณต๋œ ์ •๋ณด๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ, ์ž๋ฃŒ๊ฐ€ ๊ฐ–๋Š” ์ฐจ์›๋ณด๋‹ค ๋” ์ž‘์€ ์ˆ˜์˜ ์ฐจ์›์œผ๋กœ๋„ ์ž๋ฃŒ์— ๋‚ด์žฌํ•œ ์ •๋ณด๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ๋Š” ์•„์ด๋””์–ด์—์„œ ์†Œ๊ฐœ๋œ ๊ฐœ๋…์ด๋‹ค.

    • ์ฃผ์„ฑ๋ถ„์ด๋ž€
      • ์ „์ฒด ๋ฐ์ดํ„ฐ (๋…๋ฆฝ๋ณ€์ˆ˜)์˜ ๋ถ„์‚ฐ์„ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์„ฑ๋ถ„
    • ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜ = ์ฐจ์›์˜ ๊ฐœ์ˆ˜
      โ†’ ์ฐจ์›์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ํ‘œํ˜„ํ•ด์•ผ ํ•˜๋Š” ๊ณต๊ฐ„์€ ๋ณต์žกํ•ด์ง„๋‹ค.

    ๋”ฐ๋ผ์„œ PCA๋Š” ์ฃผ๋กœ

    • ๋ณ€์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„ ๊ธฐ์กด ๋ณ€์ˆ˜๋ฅผ ์กฐํ•ฉํ•ด ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ๋ง์„ ํ•˜๋ ค๊ณ  ํ•˜๊ฑฐ๋‚˜
    • ํšŒ๊ท€ ๋ถ„์„๊ณผ ๊ฐ™์€ ์ข…์†๊ด€๊ณ„ ๋ถ„์„์„ ํ•  ๋•Œ ๋‹ค์ค‘ ๊ณต์‚ฐ์„ฑ multicollinearity์„ ์—†์• ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.

    2. PCA์˜ ์›๋ฆฌ How It Works

    ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ถ•์†Œํ•  ๋•Œ ๊ธฐ์กด์˜ ์ •๋ณด๊ฐ€ ์ตœ๋Œ€ํ•œ ๋ณด์กด๋  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ถ•์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ฐพ์€ ์ถ•์„ Principle Component๋ผ๊ณ  ํ•˜๋ฉฐ, ์ฃผ๋กœ ์ค„์—ฌ์„œ PC๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.


    PC๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” covaiance matrix(๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ) ์˜ eigen vector(๊ณ ์œ  ๋ฒกํ„ฐ) ๊ฐ’์„ ์ฐพ์•„์•ผ ํ•˜๊ณ , ์ด ๊ฐ’ ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’์ด ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” PC ์— ๋งŒ์กฑํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    3. ๋ถ„์‚ฐ์„ ์ตœ๋Œ€๋กœ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋Š” ์ถ•์„ ์„ ํƒํ•˜๋Š” ์ด์œ ?


    ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•œ 2์ฐจ์› ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ์„ ๋•Œ c2์˜ ์ง์„ ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ๋ถ„์‚ฐ์„ ๊ฐ€์žฅ ์ ๊ฒŒ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ, ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์œ ์‹คํ•˜๊ธฐ๊ฐ€ ์‰ฌ์›Œ์ง„๋‹ค.

    ๋”ฐ๋ผ์„œ, ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ํˆฌ์˜ํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค ๋ถ„์‚ฐ์„ ์ตœ๋Œ€๋กœ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋Š” ์ถ•์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ •๋ณด๋ฅผ ๊ฐ€์žฅ ์ ๊ฒŒ ์†์‹คํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ถ„์‚ฐ์ด ์ปค์ ธ์•ผ ๋ฐ์ดํ„ฐ๋“ค์‚ฌ์ด์˜ ์ฐจ์ด์ ์ด ๋ช…ํ™•ํ•ด์ง€๊ณ , ๊ทธ๊ฒƒ์ด ๋ชจ๋ธ์„ ๋”์šฑ ์ข‹์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    4. PCA ์ ์šฉ

    Step 1 : ๋ฐ์ดํ„ฐ ์ •๊ทœํ™” (๊ฐ ๋ณ€์ˆ˜ ๊ฐ’๋“ค์˜ ํ‰๊ท  = 0)


    Step 2 : ์ตœ์ ํ™” ๋ฌธ์ œ ์ •์˜

    • ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์˜์‹œํ‚จ ํ›„์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ถ•์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ!


    Step 3 : ์ตœ์ ํ•ด ๋„์ถœ


    • Lagrangian multiplier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์•ฝ์‹์„ ๋ชฉ์ ์‹์— ์ถ”๊ฐ€ํ•œ ์ƒˆ ๋ชฉ์ ์‹ ์ƒ์„ฑ
    • ์ƒˆ ๋ชฉ์ ์‹์„ ๋ฏธ๋ถ„ํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ด ๋˜๋Š” ์ ์—์„œ ์ตœ์ ํ•ด ๋ฐœ์ƒ

    Step 4 : ๊ณ ์œ ๋ฒกํ„ฐ (eigenvector) ๋“ค์„ ๊ณ ์œ ๊ฐ’ (eigenvalue) ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

    • ๊ฐ ๊ณ ์œ ๋ฒกํ„ฐ๋Š” ์„ ํ˜•๋ณ€ํ™˜๋œ ๊ณต๊ฐ„์—์„œ ์„œ๋กœ ์ง๊ตํ•˜๋Š” ์ƒˆ๋กœ์šด ์ถ•์ด ๋จ

    Step 5 : ๋ณ€์ˆ˜ ์ถ”์ถœ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜


    Step 6 : ์ถ”์ถœ๋œ ๋ณ€์ˆ˜ ์ค‘ ์ผ๋ถ€๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์—ญ๋ณ€ํ™˜



    5. ๊ฒฐ๊ณผ ํ™•์ธ

    โžก๏ธ Scree Plot


    ์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๋„ค๋ชจ์นœ ๊ณณ์ฒ˜๋Ÿผ ์ •๋ณด์˜ ๊ฐ์†Œ๋Ÿ‰์ด ํ™• ์ค„์–ด๋“œ๋Š” ๊ตฌ๊ฐ„์„ Elbow point๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. Eigenvalue์˜ Elbowpoint๋ฅผ ํ™•์ธํ•˜๊ณ  ์ ์ ˆํ•˜๊ฒŒ ๋ช‡ ์ฐจ์›์œผ๋กœ ์ถ•์†Œํ• ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.


    โžก๏ธ Loading Plot


    ํ•ด๋‹น plot์€ ๊ฐ ์ฃผ์„ฑ๋ถ„์„ ๋งŒ๋“ค ๋•Œ, ๊ธฐ์กด ๋ฐ์ดํ„ฐ x์˜ ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ๊ธฐ์—ฌํ•˜๋Š” ์ •๋„๋ฅผ ํŒ๋‹จํ•˜์—ฌ ์‚ฌํ›„์ ์ธ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํ•ด์„์„ ํ•  ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.



    ๐Ÿ’ป ์‹ค์Šต ์˜ˆ์ œ ์ฝ”๋“œ

    import seaborn as sns
    import pandas as pd
    from sklearn.decomposition import PCA
    
    url = "https://archive.ics.uci.edu/ml/machine-learning-datebase/iris/irs.data"
    df = pd.read_csv(url, 
                     naems = ['sepal length', 'sepal width', 'petal length',
                              'petal width', 'target'])
    


    # Use only continuous data
    data = df[df.columns[0:4]]
    
    # Create PCA object with number of principal component
    pca = PCA(n_components = len(df.columns) - 1)
    pca_fit = pca.fit(data)
    
    print('\n====== PCA Reulst Summary ======\n')
    print('Singular value : \n', pca.singular_values_)
    print('\n Singular vector : \n', pca.components_.T)
    print('\n Explain Standard deviations : \n', np.sqrt(pca.explained_variance_))
    print('\n Explain Variance Ratio : \n', pca.explained_variance_ratio_)
    print('\n Noise Variance : \n', pca.noise_variance_)
    
    ====== PCA Reulst Summary ======
    
    Singular value : 
     [25.08986398  6.00785254  3.42053538  1.87850234]
    
     Singular vector : 
     [[ 0.36158968  0.65653988 -0.58099728  0.31725455]
     [-0.08226889  0.72971237  0.59641809 -0.32409435]
     [ 0.85657211 -0.1757674   0.07252408 -0.47971899]
     [ 0.35884393 -0.07470647  0.54906091  0.75112056]]
    
     Explain Standard deviations : 
     [2.05544175 0.49218246 0.28022118 0.15389291]
    
     Explain Variance Ratio : 
     [0.92461621 0.05301557 0.01718514 0.00518309]
    
     Noise Variance : 
     0.0
    
    # Scree Plot
    plt.title("Scree Plot")
    plt.xlabel("Number of Components")
    plt.ylabel("Cumulative Explained Variance")
    plt.plot(pca.explained_variance_, 'o-')
    


    # get predict values
    pca_pred = pd.DataFrame(pca.fit_transform(data))
    pca_pred = pd.concat([pca_pred, df['target']], axis = 1)
    pca_pred
    


    sns.scatterplot(pca_pred[0], pca_pred[1], data = pca_pred, hue = 'target',
                    style = 'target', s = 100);
    


    ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉด์„œ..

    ์ง€๋„ํ•™์Šต๋งŒ ์ฃผ๋กœ ๋‹ค๋ฃจ๋‹ค ๋ณด๋‹ˆ PCA๋Š” ๊ฐœ๋…๋งŒ ์•Œ๊ณ  ์žˆ๊ณ  ์ง์ ‘ ํ•ด๋ณผ ๊ธฐํšŒ๊ฐ€ ์—†์—ˆ๋Š”๋ฐ ์ด๋ฒˆ์— ํ•ด๋‹น ๋‚ด์šฉ์— ๋Œ€ํ•ด ์ •๋ฆฌํ•˜๋ฉด์„œ ์šฐ์—ฐํžˆ ์ฐจ์› ์ถ•์†Œ ์‹ค์Šต ์ฝ”๋“œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๋‹ค. ์ง์ ‘ ํ•ด๋ณด๋‹ˆ ๊ฐ„๋‹จํ•˜๊ณ  ๋” ์ง๊ด€์ ์œผ๋กœ ํ•ด๋‹น ๋‚ด์šฉ์— ๋Œ€ํ•ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋น„์ง€๋„ํ•™์Šต์„ ๋‹ค๋ฃจ๊ฒŒ ๋˜๋Š” ๊ทธ ์–ด๋Š ๋‚  ์˜ค๋Š˜ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์ด ๋„์›€์ด ๋˜๊ธธ!!

    ๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ ๋งŒ๋‚˜์š” ๐Ÿ™Œ



    ์ฐธ๊ณ 

    K-MOOC ์‹ค์Šต์œผ๋กœ ๋ฐฐ์šฐ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹

    ๋จธ์‹ ๋Ÿฌ๋‹ - PCA (Principal Component Analysis)

    Stack Exchange - Making sense of principal component analysis, eigenvectors & eigenvalues

    [sklearn] PCA (Principal Component Analysis)

     

    Related Posts



    ๐Ÿ’™ You need to log in to GitHub to write comments. ๐Ÿ’™
    If you can't see comments, please refresh page(F5).