공부 기록
ML(0809_day4) - 실습_Predict survival on the Titanic - using scikit-learn 본문
playdata
ML(0809_day4) - 실습_Predict survival on the Titanic - using scikit-learn
_JAEJAE_ 2021. 8. 9. 20:01타이타닉 데이터셋 도전¶
- 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표
1. 데이터 적재¶
In [1]:
import pandas as pd
train_data = pd.read_csv("datasets/titanic_train.csv")
test_data = pd.read_csv("datasets/titanic_test.csv")
2. 데이터 탐색¶
train_data 살펴보기¶
In [2]:
train_data.head()
Out[2]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
- Survived: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
- Pclass: 승객 등급. 1, 2, 3등석.
- Name, Sex, Age: 이름 그대로의 의미
- SibSp: 함께 탑승한 형제, 배우자의 수
- Parch: 함께 탑승한 자녀, 부모의 수
- Ticket: 티켓 아이디
- Fare: 티켓 요금 (파운드)
- Cabin: 객실 번호
- Embarked: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)
누락 데이터 살펴보기¶
In [3]:
train_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
- Age, Cabin, Embarked 속성의 일부가 null
- 특히 Cabin은 77%가 null. 일단 Cabin은 무시하고 나머지를 활용
- Age는 177개(19%)가 null이므로 이를 어떻게 처리할지 결정해야 함 - null을 중간 나이로 바꾸기 고려
- Name과 Ticket 속성은 숫자로 변환하는 것이 조금 까다로와서 지금은 무시
통계치 살펴보기¶
In [4]:
train_data.describe()
Out[4]:
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
- 38%만 Survived
- 평균 Fare는 32.20 파운드
- 평균 Age는 30보다 적음
Survived(머신러닝에서 타깃)가 0과 1로 이루어졌는지 확인¶
In [5]:
train_data['Survived'].value_counts()
Out[5]:
0 549 1 342 Name: Survived, dtype: int64
범주형(카테고리) 특성들을 확인¶
In [6]:
train_data['Pclass'].value_counts()
Out[6]:
3 491 1 216 2 184 Name: Pclass, dtype: int64
In [7]:
train_data['Sex'].value_counts()
Out[7]:
male 577 female 314 Name: Sex, dtype: int64
In [8]:
train_data['Embarked'].value_counts()
Out[8]:
S 644 C 168 Q 77 Name: Embarked, dtype: int64
Embarked 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.
3. 전처리 파이프라인¶
- 특성과 레이블 분리
In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
In [10]:
data = train_data.drop('Survived', axis=1)
label = train_data['Survived'].copy()
In [11]:
num_attribs = ['Age', 'SibSp', 'Parch', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked']
- 나만의 파이프라인
In [12]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
col_names = "SibSp", "Parch"
# 열 인덱스
SibSp_ix, Parch_ix = [num_attribs.index(c) for c in col_names]
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X):
return self
def transform(self, X):
RelativesOnboard = X[:, SibSp_ix] + X[:, Parch_ix] + 1
return np.c_[X, RelativesOnboard]
- 범주형 파이프라인 구성
In [ ]:
# 1. 누락값을 most_frequent로 대체
# 2. OneHot Encoding
In [13]:
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('cat', OneHotEncoder(sparse=False))
])
In [14]:
tmp = cat_pipeline.fit_transform(data[cat_attribs])
In [15]:
tmp.shape
Out[15]:
(891, 8)
- 수치형 파이프라인 구성
In [ ]:
# 1. 누락값을 median로 대체
In [16]:
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder())
])
In [17]:
tmp2 = num_pipeline.fit_transform(data[num_attribs])
In [18]:
tmp2.shape
Out[18]:
(891, 5)
- 범주형 파이프라인 + 수치형 파이프라인
In [19]:
from sklearn.compose import ColumnTransformer
In [20]:
full_pipeline = ColumnTransformer([
('cat', cat_pipeline, cat_attribs),
('num', num_pipeline, num_attribs)
])
In [21]:
data_prepared = full_pipeline.fit_transform(data)
In [22]:
data_prepared.shape
Out[22]:
(891, 13)
In [23]:
data.head(10)
Out[23]:
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
In [96]:
# 참고
# columns = []
# cat_encoder = full_pipeline.named_transformers_["cat"]["cat"]
# for i in range(len(cat_encoder.categories_)):
# columns.extend(cat_encoder.categories_[i])
# columns
- 방법 1 : one-hot encoding의 categories_ 속성 활용하기
In [24]:
cat_encoder = full_pipeline.named_transformers_["cat"]["cat"]
columns = list(cat_encoder.get_feature_names(cat_attribs))
columns
Out[24]:
['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
In [98]:
pd.DataFrame(data_prepared, columns = columns+num_attribs+['RelativesOnboard'])
Out[98]:
Pclass_1 | Pclass_2 | Pclass_3 | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Age | SibSp | Parch | Fare | RelativesOnboard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 22.0 | 1.0 | 0.0 | 7.2500 | 2.0 |
1 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 38.0 | 1.0 | 0.0 | 71.2833 | 2.0 |
2 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 7.9250 | 1.0 |
3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 35.0 | 1.0 | 0.0 | 53.1000 | 2.0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 35.0 | 0.0 | 0.0 | 8.0500 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 27.0 | 0.0 | 0.0 | 13.0000 | 1.0 |
887 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 19.0 | 0.0 | 0.0 | 30.0000 | 1.0 |
888 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 28.0 | 1.0 | 2.0 | 23.4500 | 4.0 |
889 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 26.0 | 0.0 | 0.0 | 30.0000 | 1.0 |
890 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 32.0 | 0.0 | 0.0 | 7.7500 | 1.0 |
891 rows × 13 columns
- 방법 2 : 컬럼명 직접 지정하기
In [35]:
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'female', 'male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'] + num_attribs + ['RelativesOnboard']
pd.DataFrame(data_prepared, columns=columns)
Out[35]:
Pclass_1 | Pclass_2 | Pclass_3 | female | male | Embarked_C | Embarked_Q | Embarked_S | Age | SibSp | Parch | Fare | RelativesOnboard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 22.0 | 1.0 | 0.0 | 7.2500 | 2.0 |
1 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 38.0 | 1.0 | 0.0 | 71.2833 | 2.0 |
2 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 7.9250 | 1.0 |
3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 35.0 | 1.0 | 0.0 | 53.1000 | 2.0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 35.0 | 0.0 | 0.0 | 8.0500 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 27.0 | 0.0 | 0.0 | 13.0000 | 1.0 |
887 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 19.0 | 0.0 | 0.0 | 30.0000 | 1.0 |
888 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 28.0 | 1.0 | 2.0 | 23.4500 | 4.0 |
889 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 26.0 | 0.0 | 0.0 | 30.0000 | 1.0 |
890 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 32.0 | 0.0 | 0.0 | 7.7500 | 1.0 |
891 rows × 13 columns
'playdata' 카테고리의 다른 글
[0809]정리 - pipeline, 교차검증 (0) | 2021.08.09 |
---|---|
ML(0809_day4) - 실습_캘리포니아 주택 가격 예측하기(모델 선택과 훈련) (0) | 2021.08.09 |
ML(0809_day4) - 실습_Concrete Compressive Strength Dataset (0) | 2021.08.09 |
ML(0809_day4) - 교차 검증 예제 (0) | 2021.08.09 |
[0808]정리 - Python 객체와 클래스 (0) | 2021.08.08 |
Comments