공부 기록
ML(0812_day7) - 실습_보스톤 주택 가격 예측(선형 회귀) 본문
보스턴 주택 가격 예측¶
선형 회귀¶
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_boston
%matplotlib inline
In [2]:
boston = load_boston()
bostonDF = pd.DataFrame(boston.data , columns = boston.feature_names)
bostonDF['PRICE'] = boston.target
In [3]:
bostonDF.shape
Out[3]:
(506, 14)
In [4]:
bostonDF.head()
Out[4]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
- CRIM: 지역별 범죄 발생률
- ZN: 25,000평방피트를 초과하는 거주 지역의 비율
- NDUS: 비상업 지역 넓이 비율
- CHAS: 찰스강에 대한 더미 변수(강의 경계에 위치한 경우는 1, 아니면 0)
- NOX: 일산화질소 농도
- RM: 거주할 수 있는 방 개수
- AGE: 1940년 이전에 건축된 소유 주택의 비율
- DIS: 5개 주요 고용센터까지의 가중 거리
- RAD: 고속도로 접근 용이도
- TAX: 10,000달러당 재산세율
- PTRATIO: 지역의 교사와 학생 수 비율
- B: 지역의 흑인 거주 비율
- LSTAT: 하위 계층의 비율
- MEDV: 본인 소유의 주택 가격(중앙값)
- 각 컬럼별로 주택가격에 미치는 영향도를 조사
In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error , r2_score
y_target = bostonDF['PRICE']
X_data = bostonDF.drop(['PRICE'],axis=1,inplace=False)
X_train , X_test , y_train , y_test = train_test_split(X_data , y_target ,test_size=0.2, random_state=42)
In [6]:
lr = LinearRegression()
lr.fit(X_train ,y_train )
y_preds = lr.predict(X_test)
In [7]:
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)
rmse
Out[7]:
4.928602182665355
In [8]:
print('절편 값:',lr.intercept_)
print('회귀 계수값:', np.round(lr.coef_, 1))
절편 값: 30.24675099392408 회귀 계수값: [ -0.1 0. 0. 2.8 -17.2 4.4 -0. -1.4 0.3 -0. -0.9 0. -0.5]
In [9]:
coeff = pd.Series(data=lr.coef_ , index=X_data.columns )
coeff.sort_values(ascending=False)
Out[9]:
RM 4.438835 CHAS 2.784438 RAD 0.262430 INDUS 0.040381 ZN 0.030110 B 0.012351 AGE -0.006296 TAX -0.010647 CRIM -0.113056 LSTAT -0.508571 PTRATIO -0.915456 DIS -1.447865 NOX -17.202633 dtype: float64
규제 모델¶
릿지 회귀¶
In [10]:
from sklearn.linear_model import Ridge
- alpha값이 커질수록 theta값들이 줄어들고 있음
In [11]:
coeff_df = pd.DataFrame()
alphas = [0 , 0.1 , 1 , 10 , 100]
for pos , alpha in enumerate(alphas) :
ridge = Ridge(alpha = alpha)
ridge.fit(X_data , y_target)
coeff = pd.Series(data=ridge.coef_ , index=X_data.columns )
colname='alpha:'+str(alpha)
coeff_df[colname] = coeff
coeff = coeff.sort_values(ascending=False)
coeff_df.sort_values(by = "alpha:0", ascending=False)
Out[11]:
alpha:0 | alpha:0.1 | alpha:1 | alpha:10 | alpha:100 | |
---|---|---|---|---|---|
RM | 3.809865 | 3.818233 | 3.854000 | 3.702272 | 2.334536 |
CHAS | 2.686734 | 2.670019 | 2.552393 | 1.952021 | 0.638335 |
RAD | 0.306049 | 0.303515 | 0.290142 | 0.279596 | 0.315358 |
ZN | 0.046420 | 0.046572 | 0.047443 | 0.049579 | 0.054496 |
INDUS | 0.020559 | 0.015999 | -0.008805 | -0.042962 | -0.052826 |
B | 0.009312 | 0.009368 | 0.009673 | 0.010037 | 0.009393 |
AGE | 0.000692 | -0.000269 | -0.005415 | -0.010707 | 0.001212 |
TAX | -0.012335 | -0.012421 | -0.012912 | -0.013993 | -0.015856 |
CRIM | -0.108011 | -0.107474 | -0.104595 | -0.101435 | -0.102202 |
LSTAT | -0.524758 | -0.525966 | -0.533343 | -0.559366 | -0.660764 |
PTRATIO | -0.952747 | -0.940759 | -0.876074 | -0.797945 | -0.829218 |
DIS | -1.475567 | -1.459626 | -1.372654 | -1.248808 | -1.153390 |
NOX | -17.766611 | -16.684645 | -10.777015 | -2.371619 | -0.262847 |
라쏘 회귀¶
In [14]:
from sklearn.linear_model import Lasso, ElasticNet
In [15]:
coeff_df = pd.DataFrame()
alphas = [ 0.07, 0.1, 0.5, 1, 3]
for pos , alpha in enumerate(alphas) :
lasso = Lasso(alpha = alpha)
lasso.fit(X_data , y_target)
coeff = pd.Series(data=lasso.coef_ , index=X_data.columns )
colname='alpha:'+str(alpha)
coeff_df[colname] = coeff
coeff = coeff.sort_values(ascending=False)
coeff_df.sort_values(by = "alpha:0.07", ascending=False)
Out[15]:
alpha:0.07 | alpha:0.1 | alpha:0.5 | alpha:1 | alpha:3 | |
---|---|---|---|---|---|
RM | 3.789725 | 3.703202 | 2.498212 | 0.949811 | 0.000000 |
CHAS | 1.434343 | 0.955190 | 0.000000 | 0.000000 | 0.000000 |
RAD | 0.270936 | 0.274707 | 0.277451 | 0.264206 | 0.061864 |
ZN | 0.049059 | 0.049211 | 0.049544 | 0.049165 | 0.037231 |
B | 0.010248 | 0.010249 | 0.009469 | 0.008247 | 0.006510 |
NOX | -0.000000 | -0.000000 | -0.000000 | -0.000000 | 0.000000 |
AGE | -0.011706 | -0.010037 | 0.003604 | 0.020910 | 0.042495 |
TAX | -0.014290 | -0.014570 | -0.015442 | -0.015212 | -0.008602 |
INDUS | -0.042120 | -0.036619 | -0.005253 | -0.000000 | -0.000000 |
CRIM | -0.098193 | -0.097894 | -0.083289 | -0.063437 | -0.000000 |
LSTAT | -0.560431 | -0.568769 | -0.656290 | -0.761115 | -0.807679 |
PTRATIO | -0.765107 | -0.770654 | -0.758752 | -0.722966 | -0.265072 |
DIS | -1.176583 | -1.160538 | -0.936605 | -0.668790 | -0.000000 |
엘라스틱넷 회귀¶
In [16]:
alphas = [ 0.07, 0.1, 0.5, 1, 3]
coeff_df = pd.DataFrame()
alphas = [ 0.07, 0.1, 0.5, 1, 3]
for pos , alpha in enumerate(alphas) :
elastic = ElasticNet(alpha = alpha, l1_ratio=0.7)
elastic.fit(X_data , y_target)
coeff = pd.Series(data=elastic.coef_ , index=X_data.columns )
colname='alpha:'+str(alpha)
coeff_df[colname] = coeff
coeff = coeff.sort_values(ascending=False)
coeff_df.sort_values(by = "alpha:0.07", ascending=False)
Out[16]:
alpha:0.07 | alpha:0.1 | alpha:0.5 | alpha:1 | alpha:3 | |
---|---|---|---|---|---|
RM | 3.574162 | 3.414154 | 1.918419 | 0.938789 | 0.000000 |
CHAS | 1.330724 | 0.979706 | 0.000000 | 0.000000 | 0.000000 |
RAD | 0.278880 | 0.283443 | 0.300761 | 0.289299 | 0.146846 |
ZN | 0.050107 | 0.050617 | 0.052878 | 0.052136 | 0.038268 |
B | 0.010122 | 0.010067 | 0.009114 | 0.008320 | 0.007020 |
AGE | -0.010116 | -0.008276 | 0.007760 | 0.020348 | 0.043446 |
TAX | -0.014522 | -0.014814 | -0.016046 | -0.016218 | -0.011417 |
INDUS | -0.044855 | -0.042719 | -0.023252 | -0.000000 | -0.000000 |
CRIM | -0.099468 | -0.099213 | -0.089070 | -0.073577 | -0.019058 |
NOX | -0.175072 | -0.000000 | -0.000000 | -0.000000 | -0.000000 |
LSTAT | -0.574822 | -0.587702 | -0.693861 | -0.760457 | -0.800368 |
PTRATIO | -0.779498 | -0.784725 | -0.790969 | -0.738672 | -0.423065 |
DIS | -1.189438 | -1.173647 | -0.975902 | -0.725174 | -0.031208 |
In [ ]:
'playdata' 카테고리의 다른 글
ML(0813_day8) - 실습_붓꽃 데이터 품종 예측하기(로지스틱 회귀, 소프트맥스 회귀) (0) | 2021.08.13 |
---|---|
ML(0813_day8) - 실습_Predict survival on the Titanic- using scikit-learn(모델 훈련) (0) | 2021.08.13 |
ML(0811_day6) - 실습_Pima Indians Diabetes Datasets (0) | 2021.08.11 |
ML(0811_day6) - 실습_MNIST 데이터를 활용해 분류 성능 평가하기 (0) | 2021.08.11 |
ML(0810_day5) - 실습_캘리포니아 주택 가격 예측하기(최종 모델 평가) (0) | 2021.08.10 |
Comments