공부 기록
ML(0809_day4) - 실습_캘리포니아 주택 가격 예측하기(모델 선택과 훈련) 본문
캘리포니아 주택 가격 예측¶
In [1]:
import sklearn
In [2]:
sklearn.__version__
Out[2]:
'0.24.1'
데이터 가져오기¶
In [3]:
import pandas as pd
housing = pd.read_csv('datasets/housing.csv')
데이터 훑어보기¶
In [4]:
housing.head()
Out[4]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
- longitude : 경도
- latitude : 위도
- housing_median_age : 주택 나이(중앙값)
- total_rooms : 전체 방 수
- total_bedrooms : 전체 침실 수
- population : 인구 수
- households : 세대 수
- median_income : 소득(중앙값)
- median_house_value : 주택 가치(중앙값)
- ocean_proximity : 바다 근접도
In [5]:
housing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
In [6]:
housing['ocean_proximity'].value_counts()
Out[6]:
<1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: ocean_proximity, dtype: int64
In [7]:
housing.describe()
Out[7]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
In [8]:
import matplotlib.pyplot as plt
In [9]:
housing.hist(bins=50, figsize=(20, 15))
plt.show()
테스트 세트 분리¶
In [10]:
from sklearn.model_selection import train_test_split
In [11]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
In [12]:
test_set.head()
Out[12]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
20046 | -119.01 | 36.06 | 25.0 | 1505.0 | NaN | 1392.0 | 359.0 | 1.6812 | 47700.0 | INLAND |
3024 | -119.46 | 35.14 | 30.0 | 2943.0 | NaN | 1565.0 | 584.0 | 2.5313 | 45800.0 | INLAND |
15663 | -122.44 | 37.80 | 52.0 | 3830.0 | NaN | 1310.0 | 963.0 | 3.4801 | 500001.0 | NEAR BAY |
20484 | -118.72 | 34.28 | 17.0 | 3051.0 | NaN | 1705.0 | 495.0 | 5.7376 | 218600.0 | <1H OCEAN |
9814 | -121.93 | 36.62 | 34.0 | 2351.0 | NaN | 1063.0 | 428.0 | 3.7250 | 278000.0 | NEAR OCEAN |
In [13]:
housing['median_income'].hist()
Out[13]:
<AxesSubplot:>
In [14]:
import numpy as np
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3., 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])
In [15]:
housing['income_cat'].value_counts()
Out[15]:
3 7236 2 6581 4 3639 5 2362 1 822 Name: income_cat, dtype: int64
In [16]:
housing['income_cat'].hist()
Out[16]:
<AxesSubplot:>
In [17]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
- income_cat의 비율¶
In [18]:
housing['income_cat'].value_counts() / len(housing)
Out[18]:
3 0.350581 2 0.318847 4 0.176308 5 0.114438 1 0.039826 Name: income_cat, dtype: float64
1) 계층 정보(median_income)가 반영되어 샘플링된 테스트 데이터¶
In [19]:
strat_test_set['income_cat'].value_counts() / len(strat_test_set)
Out[19]:
3 0.350533 2 0.318798 4 0.176357 5 0.114583 1 0.039729 Name: income_cat, dtype: float64
2) 무작위 추출된 테스트 데이터¶
In [20]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
test_set['income_cat'].value_counts() / len(test_set)
Out[20]:
3 0.358527 2 0.324370 4 0.167393 5 0.109496 1 0.040213 Name: income_cat, dtype: float64
==> 계층 정보를 반영한 것이 무작위로 샘플링한 것보다 비율을 잘 반영해줌¶
option 2¶
In [21]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42, stratify= housing['income_cat'])
In [22]:
test_set['income_cat'].value_counts() / len(test_set)
Out[22]:
3 0.350533 2 0.318798 4 0.176357 5 0.114583 1 0.039729 Name: income_cat, dtype: float64
- 비율에 맞게 나눈 후에는 income_cat 컬럼 사용할 일 없기 때문에 지우기¶
In [23]:
strat_train_set.drop('income_cat', axis=1, inplace=True)
strat_test_set.drop('income_cat', axis=1, inplace=True)
In [24]:
strat_train_set.head()
Out[24]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | 286600.0 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | 340600.0 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | 196900.0 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | 46300.0 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | 254500.0 | <1H OCEAN |
지리적 데이터 시각화¶
In [25]:
housing.plot(kind='scatter', x='longitude', y='latitude')
Out[25]:
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
In [26]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)
Out[26]:
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
- s : 점이 population에 따라 크기 다름
- c : 색이 median_house_value에 따라 다름
In [27]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
s=housing['population']/100, label='population', figsize=(10, 7),
c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True,
sharex=False)
Out[27]:
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
==> 해안가 주택 가격이 높음, 밀집지역이 주택 가격이 높음¶
상관관계 조사¶
In [28]:
corr_matrix = housing.corr()
corr_matrix
Out[28]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
longitude | 1.000000 | -0.924664 | -0.108197 | 0.044568 | 0.069608 | 0.099773 | 0.055310 | -0.015176 | -0.045967 |
latitude | -0.924664 | 1.000000 | 0.011173 | -0.036100 | -0.066983 | -0.108785 | -0.071035 | -0.079809 | -0.144160 |
housing_median_age | -0.108197 | 0.011173 | 1.000000 | -0.361262 | -0.320451 | -0.296244 | -0.302916 | -0.119034 | 0.105623 |
total_rooms | 0.044568 | -0.036100 | -0.361262 | 1.000000 | 0.930380 | 0.857126 | 0.918484 | 0.198050 | 0.134153 |
total_bedrooms | 0.069608 | -0.066983 | -0.320451 | 0.930380 | 1.000000 | 0.877747 | 0.979728 | -0.007723 | 0.049686 |
population | 0.099773 | -0.108785 | -0.296244 | 0.857126 | 0.877747 | 1.000000 | 0.907222 | 0.004834 | -0.024650 |
households | 0.055310 | -0.071035 | -0.302916 | 0.918484 | 0.979728 | 0.907222 | 1.000000 | 0.013033 | 0.065843 |
median_income | -0.015176 | -0.079809 | -0.119034 | 0.198050 | -0.007723 | 0.004834 | 0.013033 | 1.000000 | 0.688075 |
median_house_value | -0.045967 | -0.144160 | 0.105623 | 0.134153 | 0.049686 | -0.024650 | 0.065843 | 0.688075 | 1.000000 |
In [29]:
corr_matrix['median_house_value'].sort_values(ascending=False)
Out[29]:
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
==> 소특이 높을수록 주택 가격이 높다
피어슨의 상관 계수(위키백과):
==> 선형관계에 있는 것들만 값이 나옴
In [30]:
from pandas.plotting import scatter_matrix
In [31]:
attributes = ['median_house_value', 'median_income', 'total_rooms', 'housing_median_age']
scatter_matrix(housing[attributes], figsize=(12, 8))
Out[31]:
array([[<AxesSubplot:xlabel='median_house_value', ylabel='median_house_value'>, <AxesSubplot:xlabel='median_income', ylabel='median_house_value'>, <AxesSubplot:xlabel='total_rooms', ylabel='median_house_value'>, <AxesSubplot:xlabel='housing_median_age', ylabel='median_house_value'>], [<AxesSubplot:xlabel='median_house_value', ylabel='median_income'>, <AxesSubplot:xlabel='median_income', ylabel='median_income'>, <AxesSubplot:xlabel='total_rooms', ylabel='median_income'>, <AxesSubplot:xlabel='housing_median_age', ylabel='median_income'>], [<AxesSubplot:xlabel='median_house_value', ylabel='total_rooms'>, <AxesSubplot:xlabel='median_income', ylabel='total_rooms'>, <AxesSubplot:xlabel='total_rooms', ylabel='total_rooms'>, <AxesSubplot:xlabel='housing_median_age', ylabel='total_rooms'>], [<AxesSubplot:xlabel='median_house_value', ylabel='housing_median_age'>, <AxesSubplot:xlabel='median_income', ylabel='housing_median_age'>, <AxesSubplot:xlabel='total_rooms', ylabel='housing_median_age'>, <AxesSubplot:xlabel='housing_median_age', ylabel='housing_median_age'>]], dtype=object)
In [32]:
housing.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)
plt.axis([0, 16, 0, 550000])
Out[32]:
(0.0, 16.0, 0.0, 550000.0)
특성 조합으로 실험¶
In [33]:
housing.head()
Out[33]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 5 |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 5 |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 5 |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | 4 |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 3 |
<특성 추출>¶
- 가구 당 전체 방 수¶
In [34]:
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
- 전체 방 당 침실의 수¶
In [35]:
housing['bedrooms_per_room'] = housing['total_bedrooms'] / housing['total_rooms']
- 가구 당 인구 수¶
In [36]:
housing['population_per_household'] = housing['population'] / housing['households']
In [37]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Out[37]:
median_house_value 1.000000 median_income 0.688075 rooms_per_household 0.151948 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population_per_household -0.023737 population -0.024650 longitude -0.045967 latitude -0.144160 bedrooms_per_room -0.255880 Name: median_house_value, dtype: float64
==> rooms_per_household, population_per_household는 total_rooms, population보다 median_house_value와의 상관 관계 더 높다,
데이터 전처리¶
In [38]:
print(strat_test_set.size, strat_train_set.size)
41280 165120
In [39]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_label = strat_train_set['median_house_value'].copy()
In [40]:
housing.shape
Out[40]:
(16512, 9)
In [41]:
housing_label.shape
Out[41]:
(16512,)
In [42]:
housing.isnull().sum()
Out[42]:
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 158 population 0 households 0 median_income 0 ocean_proximity 0 dtype: int64
In [43]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
In [44]:
sample_incomplete_rows
Out[44]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
4629 | -118.30 | 34.07 | 18.0 | 3759.0 | NaN | 3296.0 | 1462.0 | 2.2708 | <1H OCEAN |
6068 | -117.86 | 34.01 | 16.0 | 4632.0 | NaN | 3038.0 | 727.0 | 5.1762 | <1H OCEAN |
17923 | -121.97 | 37.35 | 30.0 | 1955.0 | NaN | 999.0 | 386.0 | 4.6328 | <1H OCEAN |
13656 | -117.30 | 34.05 | 6.0 | 2155.0 | NaN | 1039.0 | 391.0 | 1.6675 | INLAND |
19252 | -122.79 | 38.48 | 7.0 | 6837.0 | NaN | 3468.0 | 1405.0 | 3.1662 | <1H OCEAN |
option 1 : 해당 샘플(row) 삭제¶
In [45]:
sample_incomplete_rows.dropna(subset=['total_bedrooms'])
Out[45]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity |
---|
option 2 : 특성 자체 삭제¶
In [46]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)
Out[46]:
longitude | latitude | housing_median_age | total_rooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|
4629 | -118.30 | 34.07 | 18.0 | 3759.0 | 3296.0 | 1462.0 | 2.2708 | <1H OCEAN |
6068 | -117.86 | 34.01 | 16.0 | 4632.0 | 3038.0 | 727.0 | 5.1762 | <1H OCEAN |
17923 | -121.97 | 37.35 | 30.0 | 1955.0 | 999.0 | 386.0 | 4.6328 | <1H OCEAN |
13656 | -117.30 | 34.05 | 6.0 | 2155.0 | 1039.0 | 391.0 | 1.6675 | INLAND |
19252 | -122.79 | 38.48 | 7.0 | 6837.0 | 3468.0 | 1405.0 | 3.1662 | <1H OCEAN |
option 3 : 특정 값으로 대체¶
In [47]:
median = housing['total_bedrooms'].median()
median
Out[47]:
433.0
In [48]:
sample_incomplete_rows
Out[48]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
4629 | -118.30 | 34.07 | 18.0 | 3759.0 | NaN | 3296.0 | 1462.0 | 2.2708 | <1H OCEAN |
6068 | -117.86 | 34.01 | 16.0 | 4632.0 | NaN | 3038.0 | 727.0 | 5.1762 | <1H OCEAN |
17923 | -121.97 | 37.35 | 30.0 | 1955.0 | NaN | 999.0 | 386.0 | 4.6328 | <1H OCEAN |
13656 | -117.30 | 34.05 | 6.0 | 2155.0 | NaN | 1039.0 | 391.0 | 1.6675 | INLAND |
19252 | -122.79 | 38.48 | 7.0 | 6837.0 | NaN | 3468.0 | 1405.0 | 3.1662 | <1H OCEAN |
In [49]:
sample_incomplete_rows['total_bedrooms'].fillna(median, inplace=True)
sample_incomplete_rows
Out[49]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
4629 | -118.30 | 34.07 | 18.0 | 3759.0 | 433.0 | 3296.0 | 1462.0 | 2.2708 | <1H OCEAN |
6068 | -117.86 | 34.01 | 16.0 | 4632.0 | 433.0 | 3038.0 | 727.0 | 5.1762 | <1H OCEAN |
17923 | -121.97 | 37.35 | 30.0 | 1955.0 | 433.0 | 999.0 | 386.0 | 4.6328 | <1H OCEAN |
13656 | -117.30 | 34.05 | 6.0 | 2155.0 | 433.0 | 1039.0 | 391.0 | 1.6675 | INLAND |
19252 | -122.79 | 38.48 | 7.0 | 6837.0 | 433.0 | 3468.0 | 1405.0 | 3.1662 | <1H OCEAN |
- 변환기¶
In [50]:
from sklearn.impute import SimpleImputer
In [51]:
imputer = SimpleImputer(strategy='median')
- ocean_proximity는 수치값이 아니므로 잠시 제외¶
In [52]:
housing_num = housing.drop('ocean_proximity', axis=1)
housing_num
Out[52]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 |
12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 |
13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 |
11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 |
15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.5750 |
16512 rows × 8 columns
- housing_num의 속성들의 중앙값을 다 구해줌¶
In [53]:
imputer.fit(housing_num)
Out[53]:
SimpleImputer(strategy='median')
In [54]:
imputer.statistics_
Out[54]:
array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. , 408. , 3.5409])
In [55]:
housing_num.median() #pandas의 산술통계 함수
Out[55]:
longitude -118.5100 latitude 34.2600 housing_median_age 29.0000 total_rooms 2119.5000 total_bedrooms 433.0000 population 1164.0000 households 408.0000 median_income 3.5409 dtype: float64
- 모든 결측치에 대해서 중앙값으로 대체시킴¶
In [56]:
X = imputer.transform(housing_num)
In [57]:
X
Out[57]:
array([[-121.89 , 37.29 , 38. , ..., 710. , 339. , 2.7042], [-121.93 , 37.05 , 14. , ..., 306. , 113. , 6.4214], [-117.2 , 32.77 , 31. , ..., 936. , 462. , 2.8621], ..., [-116.4 , 34.09 , 9. , ..., 2098. , 765. , 3.2723], [-118.01 , 33.82 , 31. , ..., 1356. , 356. , 4.0625], [-122.45 , 37.77 , 52. , ..., 1269. , 639. , 3.575 ]])
In [58]:
housing_num.columns
Out[58]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'], dtype='object')
In [59]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
In [60]:
sample_incomplete_rows.index
Out[60]:
Int64Index([4629, 6068, 17923, 13656, 19252], dtype='int64')
In [61]:
housing_tr.loc[sample_incomplete_rows.index]
Out[61]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
4629 | -118.30 | 34.07 | 18.0 | 3759.0 | 433.0 | 3296.0 | 1462.0 | 2.2708 |
6068 | -117.86 | 34.01 | 16.0 | 4632.0 | 433.0 | 3038.0 | 727.0 | 5.1762 |
17923 | -121.97 | 37.35 | 30.0 | 1955.0 | 433.0 | 999.0 | 386.0 | 4.6328 |
13656 | -117.30 | 34.05 | 6.0 | 2155.0 | 433.0 | 1039.0 | 391.0 | 1.6675 |
19252 | -122.79 | 38.48 | 7.0 | 6837.0 | 433.0 | 3468.0 | 1405.0 | 3.1662 |
In [62]:
imputer.strategy
Out[62]:
'median'
데이터 인코딩¶
In [63]:
housing.head()
Out[63]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | <1H OCEAN |
In [64]:
housing_cat = housing[['ocean_proximity']]
In [65]:
housing_cat[:10]
Out[65]:
ocean_proximity | |
---|---|
17606 | <1H OCEAN |
18632 | <1H OCEAN |
14650 | NEAR OCEAN |
3230 | INLAND |
3555 | <1H OCEAN |
19480 | INLAND |
8879 | <1H OCEAN |
13685 | INLAND |
4937 | <1H OCEAN |
4861 | <1H OCEAN |
<레이블 인코딩> - OrdinalEncoder¶
In [66]:
from sklearn.preprocessing import OrdinalEncoder
- 변환기 생성¶
In [67]:
ordinal_encoder = OrdinalEncoder()
In [68]:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
Out[68]:
array([[0.], [0.], [4.], [1.], [0.], [1.], [0.], [1.], [0.], [0.]])
In [69]:
ordinal_encoder.categories_
Out[69]:
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
<원-핫 인코딩> - OneHotEncoder¶
In [70]:
from sklearn.preprocessing import OneHotEncoder
In [71]:
cat_encoder = OneHotEncoder()
In [72]:
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
Out[72]:
<16512x5 sparse matrix of type '<class 'numpy.float64'>' with 16512 stored elements in Compressed Sparse Row format>
- 결과가 sparse로 나오기때문에 toarray()로 변환시켜줘야 실제 값을 볼 수 있음¶
In [73]:
housing_cat_1hot.toarray()[:10]
Out[73]:
array([[1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 0., 1.], [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.]])
In [74]:
cat_encoder.categories_
Out[74]:
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
- 참고
pd.factorize() : sklearn의 OrdinalEncoder와 같은 역할
pd.get_dummies() : sklearn의 OneHotEncoder와 같은 역할
In [75]:
pd.factorize(housing['ocean_proximity'])
Out[75]:
(array([0, 0, 1, ..., 2, 0, 3], dtype=int64), Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object'))
In [76]:
pd.get_dummies(housing['ocean_proximity'])
Out[76]:
<1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
---|---|---|---|---|---|
17606 | 1 | 0 | 0 | 0 | 0 |
18632 | 1 | 0 | 0 | 0 | 0 |
14650 | 0 | 0 | 0 | 0 | 1 |
3230 | 0 | 1 | 0 | 0 | 0 |
3555 | 1 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... |
6563 | 0 | 1 | 0 | 0 | 0 |
12053 | 0 | 1 | 0 | 0 | 0 |
13908 | 0 | 1 | 0 | 0 | 0 |
11159 | 1 | 0 | 0 | 0 | 0 |
15775 | 0 | 0 | 0 | 1 | 0 |
16512 rows × 5 columns
In [77]:
from sklearn.base import BaseEstimator, TransformerMixin
# 열 인덱스
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # *args 또는 **kargs 없음
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # 아무것도 하지 않습니다
def transform(self, X):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
In [78]:
housing_extra_attribs[0]
Out[78]:
array([-121.89, 37.29, 38.0, 1568.0, 351.0, 710.0, 339.0, 2.7042, '<1H OCEAN', 4.625368731563422, 2.094395280235988], dtype=object)
In [79]:
pd.DataFrame(housing_extra_attribs, columns= list(housing.columns) + ['rooms_per_household', 'population_per_household'])
Out[79]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | rooms_per_household | population_per_household | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | <1H OCEAN | 4.625369 | 2.094395 |
1 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | <1H OCEAN | 6.00885 | 2.707965 |
2 | -117.2 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | NEAR OCEAN | 4.225108 | 2.025974 |
3 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | INLAND | 5.232295 | 4.135977 |
4 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | <1H OCEAN | 4.50581 | 3.047847 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16507 | -118.13 | 34.2 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 | INLAND | 6.052381 | 2.728571 |
16508 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 | INLAND | 4.635659 | 4.077519 |
16509 | -116.4 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 | INLAND | 6.346405 | 2.742484 |
16510 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 | <1H OCEAN | 5.505618 | 3.808989 |
16511 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.575 | NEAR BAY | 4.843505 | 1.985915 |
16512 rows × 11 columns
In [80]:
#수치형 특성 전처리
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median"))
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
In [81]:
housing.isnull().sum()
Out[81]:
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 158 population 0 households 0 median_income 0 ocean_proximity 0 dtype: int64
In [82]:
housing_num_tr
Out[82]:
array([[-121.89 , 37.29 , 38. , ..., 710. , 339. , 2.7042], [-121.93 , 37.05 , 14. , ..., 306. , 113. , 6.4214], [-117.2 , 32.77 , 31. , ..., 936. , 462. , 2.8621], ..., [-116.4 , 34.09 , 9. , ..., 2098. , 765. , 3.2723], [-118.01 , 33.82 , 31. , ..., 1356. , 356. , 4.0625], [-122.45 , 37.77 , 52. , ..., 1269. , 639. , 3.575 ]])
In [83]:
pd.DataFrame(housing_num_tr, columns=housing.columns[:-1]).isnull().sum()
Out[83]:
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 dtype: int64
In [84]:
#수치형 특성 전처리
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder())
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
In [85]:
housing_num_tr.shape
Out[85]:
(16512, 11)
In [86]:
housing_num_tr[0]
Out[86]:
array([-1.21890000e+02, 3.72900000e+01, 3.80000000e+01, 1.56800000e+03, 3.51000000e+02, 7.10000000e+02, 3.39000000e+02, 2.70420000e+00, 4.62536873e+00, 2.09439528e+00, 2.23852041e-01])
In [87]:
#수치형 특성 전처리
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
In [88]:
housing_num_tr.mean()
Out[88]:
-1.9225479402820057e-16
In [89]:
housing_num_tr.std()
Out[89]:
0.9999999999999999
In [90]:
g = pd.DataFrame(housing_num_tr, columns= list(housing.columns[:-1]) + ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
- 모든 값들이 0근처로 바뀜¶
In [91]:
g.plot(figsize=(20, 10))
Out[91]:
<AxesSubplot:>
In [92]:
housing_cat = housing[['ocean_proximity']]
In [93]:
housing_cat
Out[93]:
ocean_proximity | |
---|---|
17606 | <1H OCEAN |
18632 | <1H OCEAN |
14650 | NEAR OCEAN |
3230 | INLAND |
3555 | <1H OCEAN |
... | ... |
6563 | INLAND |
12053 | INLAND |
13908 | INLAND |
11159 | <1H OCEAN |
15775 | NEAR BAY |
16512 rows × 1 columns
In [94]:
housing_cat = housing[['ocean_proximity']]
In [95]:
#범주형 특성 전처리
cat_pipeline = Pipeline([
('cat', OneHotEncoder(sparse=False)),
])
housing_cat_tr = cat_pipeline.fit_transform(housing_cat)
In [96]:
housing_cat_tr
Out[96]:
array([[1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 0., 1.], ..., [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 1., 0.]])
- ColumnTransformer : 컬럼별로 변환¶
In [97]:
from sklearn.compose import ColumnTransformer
In [98]:
num_attribs = list(housing_num.columns)
cat_attribs = ['ocean_proximity']
In [99]:
num_attribs
Out[99]:
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
In [100]:
cat_attribs
Out[100]:
['ocean_proximity']
(이름, 파이프라인, 컬럼명)¶
In [101]:
full_pipeline = ColumnTransformer([
('num', num_pipeline, num_attribs),
('cat', cat_pipeline, cat_attribs)
# ('cat', OneHotEncoder(), cat_attribs)
])
In [102]:
housing_prepared = full_pipeline.fit_transform(housing)
In [103]:
housing_prepared.shape
Out[103]:
(16512, 16)
In [104]:
full_pipeline2 = ColumnTransformer([
('num', num_pipeline, num_attribs),
('cat', 'drop', cat_attribs)
])
tmp = full_pipeline2.fit_transform(housing)
tmp.shape
Out[104]:
(16512, 11)
In [105]:
full_pipeline3 = ColumnTransformer([
('num', num_pipeline, num_attribs),
('cat', 'passthrough', cat_attribs)
])
tmp = full_pipeline3.fit_transform(housing)
tmp.shape
Out[105]:
(16512, 12)
In [106]:
tmp[0]
Out[106]:
array([-1.1560428086829155, 0.7719496164846016, 0.7433308916510305, -0.49323393384425046, -0.4454382074687401, -0.6362114070375079, -0.4206984222235789, -0.6149374443958345, -0.31205451913809157, -0.0864987054157523, 0.15531753037148296, '<1H OCEAN'], dtype=object)
모델 선택과 훈련¶
In [107]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_label)
Out[107]:
LinearRegression()
In [108]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_label, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[108]:
68628.19819848923
- 참고 rmse
In [109]:
housing_predictions = lin_reg.predict(housing_prepared)
lin_rmse = mean_squared_error(housing_label, housing_predictions, squared=False)
lin_rmse
Out[109]:
68628.19819848923
In [110]:
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(housing_label, housing_predictions)
lin_mae
Out[110]:
49439.89599001897
In [111]:
lin_reg.score(housing_prepared, housing_label)
Out[111]:
0.6481624842804428
In [112]:
from sklearn.metrics import r2_score
r2_score(housing_label, housing_predictions)
Out[112]:
0.6481624842804428
- 결정 트리 모델
In [113]:
from sklearn.tree import DecisionTreeRegressor
In [114]:
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_label)
housing_predictions = tree_reg.predict(housing_prepared)
- tree_rmse 0이 나옴(과대적합)¶
In [115]:
tree_rmse = mean_squared_error(housing_label, housing_predictions, squared=False)
tree_rmse
Out[115]:
0.0
In [116]:
tree_reg.score(housing_prepared, housing_label)
Out[116]:
1.0
- 교차 검증을 사용한 평가
In [117]:
from sklearn.model_selection import cross_val_score
In [118]:
scores = cross_val_score(tree_reg, housing_prepared, housing_label, scoring='neg_mean_squared_error', cv=10)
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores.mean() # 10개의 평균
Out[118]:
71407.68766037929
In [119]:
scores = cross_val_score(lin_reg, housing_prepared, housing_label, scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-scores)
lin_rmse_scores.mean() # 10개의 평균
Out[119]:
69052.46136345083
- 랜덤 포레스트 모델
In [120]:
from sklearn.ensemble import RandomForestRegressor
- n_estimators=100 : 100개의 모델을 엮는다는 의미¶
In [121]:
forest_reg = RandomForestRegressor(random_state=42)
In [122]:
%time forest_reg.fit(housing_prepared, housing_label)
Wall time: 16.1 s
Out[122]:
RandomForestRegressor(random_state=42)
In [123]:
housing_predictions = forest_reg.predict(housing_prepared)
In [124]:
forest_rmse = mean_squared_error(housing_label, housing_predictions, squared=False)
forest_rmse
Out[124]:
18603.515021376355
In [125]:
%time forest_scores = cross_val_score(forest_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error", cv=10)
Wall time: 2min 21s
In [126]:
forest_rmse_scores = np.sqrt(-forest_scores)
In [127]:
forest_rmse_scores
Out[127]:
array([49519.80364233, 47461.9115823 , 50029.02762854, 52325.28068953, 49308.39426421, 53446.37892622, 48634.8036574 , 47585.73832311, 53490.10699751, 50021.5852922 ])
In [128]:
forest_rmse_scores.mean()
Out[128]:
50182.303100336096
모델 세부 튜닝¶
In [129]:
from sklearn.model_selection import GridSearchCV
- n_estimators : 앙상블에서 사용할 분류기의 수
- max_features : 선택할 최대 특성의 수
- bootstrap : True(중복 허용), False(중복 허용 x)
In [130]:
# 검사할 하이퍼 파리미터 (12 + 6)개
param_grid = [
{'n_estimators' : [3, 10, 30], 'max_features' : [2, 4, 6, 8]},
{'bootstrap' : [False], 'n_estimators' : [3, 10], 'max_features' : [2, 3, 4]}
]
In [131]:
forest_reg = RandomForestRegressor(random_state=42)
In [132]:
grid_search = GridSearchCV(forest_reg, param_grid,
scoring='neg_mean_squared_error', cv=5,
return_train_score=True, n_jobs=-1) # (12+6) * 5 = 90번 학습과 검증
In [133]:
grid_search.fit(housing_prepared, housing_label)
Out[133]:
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1, param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30]}, {'bootstrap': [False], 'max_features': [2, 3, 4], 'n_estimators': [3, 10]}], return_train_score=True, scoring='neg_mean_squared_error')
In [134]:
grid_search.best_params_
Out[134]:
{'max_features': 8, 'n_estimators': 30}
In [135]:
grid_search.best_estimator_
Out[135]:
RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)
In [136]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
63669.11631261028 {'max_features': 2, 'n_estimators': 3} 55627.099719926795 {'max_features': 2, 'n_estimators': 10} 53384.57275149205 {'max_features': 2, 'n_estimators': 30} 60965.950449450494 {'max_features': 4, 'n_estimators': 3} 52741.04704299915 {'max_features': 4, 'n_estimators': 10} 50377.40461678399 {'max_features': 4, 'n_estimators': 30} 58663.93866579625 {'max_features': 6, 'n_estimators': 3} 52006.19873526564 {'max_features': 6, 'n_estimators': 10} 50146.51167415009 {'max_features': 6, 'n_estimators': 30} 57869.25276169646 {'max_features': 8, 'n_estimators': 3} 51711.127883959234 {'max_features': 8, 'n_estimators': 10} 49682.273345071546 {'max_features': 8, 'n_estimators': 30} 62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3} 54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10} 59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3} 52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10} 57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3} 51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
In [137]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint # 분포를 알려주는 함수
In [138]:
param_distribs = {
'n_estimators' : randint(low=1, high=200),
'max_features' : randint(low=1, high=8),
}
In [139]:
forest_reg = RandomForestRegressor(random_state=42)
In [140]:
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring="neg_mean_squared_error",
random_state=42, n_jobs=-1)
In [141]:
rnd_search.fit(housing_prepared, housing_label)
Out[141]:
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1, param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001864CE64B80>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001864CE648E0>}, random_state=42, scoring='neg_mean_squared_error')
In [142]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
print(np.sqrt(-mean_score), params)
49150.70756927707 {'max_features': 7, 'n_estimators': 180} 51389.889203389284 {'max_features': 5, 'n_estimators': 15} 50796.155224308866 {'max_features': 3, 'n_estimators': 72} 50835.13360315349 {'max_features': 5, 'n_estimators': 21} 49280.9449827171 {'max_features': 7, 'n_estimators': 122} 50774.90662363929 {'max_features': 3, 'n_estimators': 75} 50682.78888164288 {'max_features': 3, 'n_estimators': 88} 49608.99608105296 {'max_features': 5, 'n_estimators': 100} 50473.61930350219 {'max_features': 3, 'n_estimators': 150} 64429.84143294435 {'max_features': 5, 'n_estimators': 2}
- .featureimportances : 특성에 대한 중요도
In [143]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[143]:
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02, 1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01, 5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02, 1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])
In [144]:
type(full_pipeline)
Out[144]:
sklearn.compose._column_transformer.ColumnTransformer
In [145]:
full_pipeline.named_transformers_
Out[145]:
{'num': Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler())]), 'cat': Pipeline(steps=[('cat', OneHotEncoder(sparse=False))])}
In [146]:
type(full_pipeline.named_transformers_)
Out[146]:
sklearn.utils.Bunch
In [147]:
type(full_pipeline.named_transformers_["cat"])
Out[147]:
sklearn.pipeline.Pipeline
In [148]:
type(full_pipeline.named_transformers_["cat"]["cat"])
Out[148]:
sklearn.preprocessing._encoders.OneHotEncoder
In [155]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # 예전 방식
cat_one_hot_attribs = list(full_pipeline.named_transformers_["cat"]["cat"].categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Out[155]:
[(0.36615898061813423, 'median_income'), (0.16478099356159054, 'INLAND'), (0.10879295677551575, 'pop_per_hhold'), (0.07334423551601243, 'longitude'), (0.06290907048262032, 'latitude'), (0.056419179181954014, 'rooms_per_hhold'), (0.053351077347675815, 'bedrooms_per_room'), (0.04114379847872964, 'housing_median_age'), (0.014874280890402769, 'population'), (0.014672685420543239, 'total_rooms'), (0.014257599323407808, 'households'), (0.014106483453584104, 'total_bedrooms'), (0.010311488326303788, '<1H OCEAN'), (0.0028564746373201584, 'NEAR OCEAN'), (0.0019604155994780706, 'NEAR BAY'), (6.0280386727366e-05, 'ISLAND')]
In [173]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # 예전 방식
cat_encoder2 = full_pipeline.named_transformers_["cat"]["cat"]
cat_one_hot_attribs2 = list(cat_encoder2.get_feature_names(['ocean_proximity']))
attributes = num_attribs + extra_attribs + cat_one_hot_attribs2
sorted(zip(feature_importances, attributes), reverse=True)
Out[173]:
[(0.36615898061813423, 'median_income'), (0.16478099356159054, 'ocean_proximity_INLAND'), (0.10879295677551575, 'pop_per_hhold'), (0.07334423551601243, 'longitude'), (0.06290907048262032, 'latitude'), (0.056419179181954014, 'rooms_per_hhold'), (0.053351077347675815, 'bedrooms_per_room'), (0.04114379847872964, 'housing_median_age'), (0.014874280890402769, 'population'), (0.014672685420543239, 'total_rooms'), (0.014257599323407808, 'households'), (0.014106483453584104, 'total_bedrooms'), (0.010311488326303788, 'ocean_proximity_<1H OCEAN'), (0.0028564746373201584, 'ocean_proximity_NEAR OCEAN'), (0.0019604155994780706, 'ocean_proximity_NEAR BAY'), (6.0280386727366e-05, 'ocean_proximity_ISLAND')]
In [ ]:
'playdata' 카테고리의 다른 글
ML(0810_day5) - 실습_MNIST 데이터를 활용해 분류 성능 평가하기 (0) | 2021.08.10 |
---|---|
[0809]정리 - pipeline, 교차검증 (0) | 2021.08.09 |
ML(0809_day4) - 실습_Predict survival on the Titanic - using scikit-learn (0) | 2021.08.09 |
ML(0809_day4) - 실습_Concrete Compressive Strength Dataset (0) | 2021.08.09 |
ML(0809_day4) - 교차 검증 예제 (0) | 2021.08.09 |
Comments