공부 기록
Pandas(0728_day4) - 실습_타이타닉 데이터셋 본문
타이타닉 데이터셋¶
In [1]:
import seaborn as sns
import pandas as pd
# titanic 데이터셋 가져오기
df = sns.load_dataset('titanic')
df
Out[1]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | NaN | Southampton | no | False |
889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
891 rows × 15 columns
- 누락 데이터 확인
In [2]:
# deck 열의 빈도수 구하기(NaN 포함)
In [3]:
df['deck'].value_counts(dropna=False)
Out[3]:
NaN 688 C 59 B 47 D 33 E 32 A 15 F 13 G 4 Name: deck, dtype: int64
In [4]:
#deck 열의 NaN 개수 계산하기
In [5]:
df['deck'].isnull().sum()
Out[5]:
688
In [6]:
# 전체 df의 누락 데이터 개수 구하기
In [7]:
df.isnull().sum()
Out[7]:
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
- 누락 데이터 제거
In [8]:
# 500개 이상 남아있지 않은 열을 모두 삭제
In [9]:
df.dropna(thresh=500, axis = 1, inplace=True)
In [10]:
# 삭제 되었는지 전체 컬럼 명 출력
In [11]:
df.columns
Out[11]:
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive', 'alone'], dtype='object')
In [12]:
# age 열에 데이터가 없는 모든 행을 삭제 (subset 옵션 이용)
In [13]:
# 삭제 후 길이 확인
In [14]:
dfff = df.dropna(subset=['age'])
len(dfff)
Out[14]:
714
- 평균으로 누락 데이터 바꾸기
In [15]:
# age 열의 첫 10개 데이터 출력 (5 행에 NaN 값 확인하기)
In [16]:
df['age'].head(10)
Out[16]:
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 NaN 6 54.0 7 2.0 8 27.0 9 14.0 Name: age, dtype: float64
In [17]:
# age 열의 평균 계산
In [18]:
df['age'].mean()
Out[18]:
29.69911764705882
In [19]:
# age 열의 NaN값을 위에서 구한 평균으로 변경하기, 원본에 적용하기
In [20]:
df['age'].fillna(df['age'].mean(), inplace=True)
In [21]:
# age 열의 첫 10개 데이터 출력 (5 행에 NaN 값이 평균으로 대체된것 확인)
In [22]:
df['age'].head(10)
Out[22]:
0 22.000000 1 38.000000 2 26.000000 3 35.000000 4 35.000000 5 29.699118 6 54.000000 7 2.000000 8 27.000000 9 14.000000 Name: age, dtype: float64
- 가장 많이 나타나는 값으로 바꾸기
In [23]:
# embark_town 열의 825행에서 829행까지 출력하기 (829행의 NaN값 확인)
In [24]:
df['embark_town'][825:830]
Out[24]:
825 Queenstown 826 Southampton 827 Cherbourg 828 Queenstown 829 NaN Name: embark_town, dtype: object
In [25]:
# embark_town 열에 대해 빈도수 확인하기
In [26]:
df['embark_town'].value_counts()
Out[26]:
Southampton 644 Cherbourg 168 Queenstown 77 Name: embark_town, dtype: int64
In [27]:
# embark_town 열에서 가장 많이 출현한 값(도시) 구하기
In [28]:
df['embark_town'].value_counts().idxmax()
Out[28]:
'Southampton'
In [29]:
df['embark_town'].mode()[0]
Out[29]:
'Southampton'
In [30]:
# embark_town 열의 NaN값을 위에서 구한 값(가장 많이 출현한 값)으로 채우기, 원본에 적용하기
In [31]:
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)
In [32]:
# embark_town 열의 825행에서 829행까지 출력하기 (NaN 값이 가장 많이 출현한 값으로 대체된 것 확인)
In [33]:
df['embark_town'][825:830]
Out[33]:
825 Queenstown 826 Southampton 827 Cherbourg 828 Queenstown 829 Southampton Name: embark_town, dtype: object
- 이웃하고 있는 값으로 바꾸기
In [34]:
# titanic 데이터셋 가져오기
df = sns.load_dataset('titanic')
In [35]:
# embark_town 열의 825행에서 829행까지 출력하기 (829행의 NaN값 확인)
In [36]:
df['embark_town'][825:830]
Out[36]:
825 Queenstown 826 Southampton 827 Cherbourg 828 Queenstown 829 NaN Name: embark_town, dtype: object
In [37]:
# embark_town 열의 NaN값을 바로 앞에 있는 828행의 값으로 변경하기, 원본에 적용하기
In [38]:
df['embark_town'].fillna(method='ffill', inplace=True)
In [39]:
# embark_town 열의 825행에서 829행까지 출력하기 (NaN 값이 바로 앞에 있는 값으로 대체된 것 확인)
In [40]:
df['embark_town'][825:830]
Out[40]:
825 Queenstown 826 Southampton 827 Cherbourg 828 Queenstown 829 Queenstown Name: embark_town, dtype: object
- 중복 데이터 확인
In [41]:
import pandas as pd
In [42]:
# 중복 데이터를 갖는 데이터프레임 만들기
df = pd.DataFrame({'c1':['a', 'a', 'b', 'a', 'b'],
'c2':[1, 1, 1, 2, 2],
'c3':[1, 1, 2, 2, 2]})
df
Out[42]:
c1 | c2 | c3 | |
---|---|---|---|
0 | a | 1 | 1 |
1 | a | 1 | 1 |
2 | b | 1 | 2 |
3 | a | 2 | 2 |
4 | b | 2 | 2 |
In [43]:
# 데이터프레임 전체 행 데이터 중에서 중복값 찾기
In [44]:
df.duplicated()
Out[44]:
0 False 1 True 2 False 3 False 4 False dtype: bool
In [45]:
# 데이터프레임의 c2열 데이터에서 중복값 찾기
In [46]:
df['c2'].duplicated()
Out[46]:
0 False 1 True 2 True 3 False 4 True Name: c2, dtype: bool
- 중복 데이터 제거
In [47]:
# 중복 데이터를 갖는 데이터프레임 만들기
df = pd.DataFrame({'c1':['a', 'a', 'b', 'a', 'b'],
'c2':[1, 1, 1, 2, 2],
'c3':[1, 1, 2, 2, 2]})
df
Out[47]:
c1 | c2 | c3 | |
---|---|---|---|
0 | a | 1 | 1 |
1 | a | 1 | 1 |
2 | b | 1 | 2 |
3 | a | 2 | 2 |
4 | b | 2 | 2 |
In [48]:
# 데이터프레임에서 중복 행을 제거
In [49]:
df.drop_duplicates()
Out[49]:
c1 | c2 | c3 | |
---|---|---|---|
0 | a | 1 | 1 |
2 | b | 1 | 2 |
3 | a | 2 | 2 |
4 | b | 2 | 2 |
In [50]:
# c2, c3열을 기준으로 중복 행을 제거
In [51]:
df.drop_duplicates(['c2','c3'])
Out[51]:
c1 | c2 | c3 | |
---|---|---|---|
0 | a | 1 | 1 |
2 | b | 1 | 2 |
3 | a | 2 | 2 |
- 그룹 연산
In [3]:
import seaborn as sns
import pandas as pd
# titanic 데이터셋 가져오기
titanic = sns.load_dataset('titanic')
titanic.head()
Out[3]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
In [5]:
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
df.head()
Out[5]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
In [8]:
grouped = df.groupby(['class']) #iterator로 나옴
grouped
Out[8]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F8CD812070>
In [9]:
len(grouped)
Out[9]:
3
In [13]:
for key, group in grouped:
print("key : ", key)
print(group)
key : First age sex class fare survived 1 38.0 female First 71.2833 1 3 35.0 female First 53.1000 1 6 54.0 male First 51.8625 0 11 58.0 female First 26.5500 1 23 28.0 male First 35.5000 1 .. ... ... ... ... ... 871 47.0 female First 52.5542 1 872 33.0 male First 5.0000 0 879 56.0 female First 83.1583 1 887 19.0 female First 30.0000 1 889 26.0 male First 30.0000 1 [216 rows x 5 columns] key : Second age sex class fare survived 9 14.0 female Second 30.0708 1 15 55.0 female Second 16.0000 1 17 NaN male Second 13.0000 1 20 35.0 male Second 26.0000 0 21 34.0 male Second 13.0000 1 .. ... ... ... ... ... 866 27.0 female Second 13.8583 1 874 28.0 female Second 24.0000 1 880 25.0 female Second 26.0000 1 883 28.0 male Second 10.5000 0 886 27.0 male Second 13.0000 0 [184 rows x 5 columns] key : Third age sex class fare survived 0 22.0 male Third 7.2500 0 2 26.0 female Third 7.9250 1 4 35.0 male Third 8.0500 0 5 NaN male Third 8.4583 0 7 2.0 male Third 21.0750 0 .. ... ... ... ... ... 882 22.0 female Third 10.5167 0 884 25.0 male Third 7.0500 0 885 39.0 female Third 29.1250 0 888 NaN female Third 23.4500 0 890 32.0 male Third 7.7500 0 [491 rows x 5 columns]
- 숫자에 해당하는 값들만 평균 구해줌¶
In [14]:
grouped.mean()
Out[14]:
age | fare | survived | |
---|---|---|---|
class | |||
First | 38.233441 | 84.154687 | 0.629630 |
Second | 29.877630 | 20.662183 | 0.472826 |
Third | 25.140620 | 13.675550 | 0.242363 |
- 원하는 그룹의 정보만 가져오기¶
In [16]:
grouped.get_group('Third')
Out[16]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
2 | 26.0 | female | Third | 7.9250 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
5 | NaN | male | Third | 8.4583 | 0 |
7 | 2.0 | male | Third | 21.0750 | 0 |
... | ... | ... | ... | ... | ... |
882 | 22.0 | female | Third | 10.5167 | 0 |
884 | 25.0 | male | Third | 7.0500 | 0 |
885 | 39.0 | female | Third | 29.1250 | 0 |
888 | NaN | female | Third | 23.4500 | 0 |
890 | 32.0 | male | Third | 7.7500 | 0 |
491 rows × 5 columns
In [20]:
grouped2 = df.groupby(["class", "sex"])
len(grouped2)
Out[20]:
6
In [27]:
for key, group in grouped2:
print('key :', key)
print('len :', len(group))
print(group)
print("*" * 100)
key : ('First', 'female') len : 94 age sex class fare survived 1 38.0 female First 71.2833 1 3 35.0 female First 53.1000 1 11 58.0 female First 26.5500 1 31 NaN female First 146.5208 1 52 49.0 female First 76.7292 1 .. ... ... ... ... ... 856 45.0 female First 164.8667 1 862 48.0 female First 25.9292 1 871 47.0 female First 52.5542 1 879 56.0 female First 83.1583 1 887 19.0 female First 30.0000 1 [94 rows x 5 columns] **************************************************************************************************** key : ('First', 'male') len : 122 age sex class fare survived 6 54.0 male First 51.8625 0 23 28.0 male First 35.5000 1 27 19.0 male First 263.0000 0 30 40.0 male First 27.7208 0 34 28.0 male First 82.1708 0 .. ... ... ... ... ... 839 NaN male First 29.7000 1 857 51.0 male First 26.5500 1 867 31.0 male First 50.4958 0 872 33.0 male First 5.0000 0 889 26.0 male First 30.0000 1 [122 rows x 5 columns] **************************************************************************************************** key : ('Second', 'female') len : 76 age sex class fare survived 9 14.0 female Second 30.0708 1 15 55.0 female Second 16.0000 1 41 27.0 female Second 21.0000 0 43 3.0 female Second 41.5792 1 53 29.0 female Second 26.0000 1 .. ... ... ... ... ... 854 44.0 female Second 26.0000 0 865 42.0 female Second 13.0000 1 866 27.0 female Second 13.8583 1 874 28.0 female Second 24.0000 1 880 25.0 female Second 26.0000 1 [76 rows x 5 columns] **************************************************************************************************** key : ('Second', 'male') len : 108 age sex class fare survived 17 NaN male Second 13.0 1 20 35.0 male Second 26.0 0 21 34.0 male Second 13.0 1 33 66.0 male Second 10.5 0 70 32.0 male Second 10.5 0 .. ... ... ... ... ... 848 28.0 male Second 33.0 0 861 21.0 male Second 11.5 0 864 24.0 male Second 13.0 0 883 28.0 male Second 10.5 0 886 27.0 male Second 13.0 0 [108 rows x 5 columns] **************************************************************************************************** key : ('Third', 'female') len : 144 age sex class fare survived 2 26.0 female Third 7.9250 1 8 27.0 female Third 11.1333 1 10 4.0 female Third 16.7000 1 14 14.0 female Third 7.8542 0 18 31.0 female Third 18.0000 0 .. ... ... ... ... ... 863 NaN female Third 69.5500 0 875 15.0 female Third 7.2250 1 882 22.0 female Third 10.5167 0 885 39.0 female Third 29.1250 0 888 NaN female Third 23.4500 0 [144 rows x 5 columns] **************************************************************************************************** key : ('Third', 'male') len : 347 age sex class fare survived 0 22.0 male Third 7.2500 0 4 35.0 male Third 8.0500 0 5 NaN male Third 8.4583 0 7 2.0 male Third 21.0750 0 12 20.0 male Third 8.0500 0 .. ... ... ... ... ... 877 19.0 male Third 7.8958 0 878 NaN male Third 7.8958 0 881 33.0 male Third 7.8958 0 884 25.0 male Third 7.0500 0 890 32.0 male Third 7.7500 0 [347 rows x 5 columns] ****************************************************************************************************
In [28]:
grouped2.mean()
Out[28]:
age | fare | survived | ||
---|---|---|---|---|
class | sex | |||
First | female | 34.611765 | 106.125798 | 0.968085 |
male | 41.281386 | 67.226127 | 0.368852 | |
Second | female | 28.722973 | 21.970121 | 0.921053 |
male | 30.740707 | 19.741782 | 0.157407 | |
Third | female | 21.750000 | 16.118810 | 0.500000 |
male | 26.507589 | 12.661633 | 0.135447 |
In [31]:
grouped2.get_group(("Third", "female"))
Out[31]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
2 | 26.0 | female | Third | 7.9250 | 1 |
8 | 27.0 | female | Third | 11.1333 | 1 |
10 | 4.0 | female | Third | 16.7000 | 1 |
14 | 14.0 | female | Third | 7.8542 | 0 |
18 | 31.0 | female | Third | 18.0000 | 0 |
... | ... | ... | ... | ... | ... |
863 | NaN | female | Third | 69.5500 | 0 |
875 | 15.0 | female | Third | 7.2250 | 1 |
882 | 22.0 | female | Third | 10.5167 | 0 |
885 | 39.0 | female | Third | 29.1250 | 0 |
888 | NaN | female | Third | 23.4500 | 0 |
144 rows × 5 columns
- 데이터 집계
In [32]:
grouped
Out[32]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F8CD812070>
In [33]:
grouped2
Out[33]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F8CF82ED60>
- 표준편차: 평균을 기준으로 얼마나 떨어져있는지¶
In [34]:
grouped.std()
Out[34]:
age | fare | survived | |
---|---|---|---|
class | |||
First | 14.802856 | 78.380373 | 0.484026 |
Second | 14.001077 | 13.417399 | 0.500623 |
Third | 12.495398 | 11.778142 | 0.428949 |
In [35]:
grouped.fare.std()
Out[35]:
class First 78.380373 Second 13.417399 Third 11.778142 Name: fare, dtype: float64
In [36]:
def min_max(x):
return x.max()-x.min()
In [38]:
grouped.agg(min_max)
Out[38]:
age | fare | survived | |
---|---|---|---|
class | |||
First | 79.08 | 512.3292 | 1 |
Second | 69.33 | 73.5000 | 1 |
Third | 73.58 | 69.5500 | 1 |
In [39]:
grouped.agg({'age':'mean', 'fare':['min','max']})
Out[39]:
age | fare | ||
---|---|---|---|
mean | min | max | |
class | |||
First | 38.233441 | 0.0 | 512.3292 |
Second | 29.877630 | 0.0 | 73.5000 |
Third | 25.140620 | 0.0 | 69.5500 |
In [40]:
grouped.apply(lambda x: x.describe())
Out[40]:
age | fare | survived | ||
---|---|---|---|---|
class | ||||
First | count | 186.000000 | 216.000000 | 216.000000 |
mean | 38.233441 | 84.154687 | 0.629630 | |
std | 14.802856 | 78.380373 | 0.484026 | |
min | 0.920000 | 0.000000 | 0.000000 | |
25% | 27.000000 | 30.923950 | 0.000000 | |
50% | 37.000000 | 60.287500 | 1.000000 | |
75% | 49.000000 | 93.500000 | 1.000000 | |
max | 80.000000 | 512.329200 | 1.000000 | |
Second | count | 173.000000 | 184.000000 | 184.000000 |
mean | 29.877630 | 20.662183 | 0.472826 | |
std | 14.001077 | 13.417399 | 0.500623 | |
min | 0.670000 | 0.000000 | 0.000000 | |
25% | 23.000000 | 13.000000 | 0.000000 | |
50% | 29.000000 | 14.250000 | 0.000000 | |
75% | 36.000000 | 26.000000 | 1.000000 | |
max | 70.000000 | 73.500000 | 1.000000 | |
Third | count | 355.000000 | 491.000000 | 491.000000 |
mean | 25.140620 | 13.675550 | 0.242363 | |
std | 12.495398 | 11.778142 | 0.428949 | |
min | 0.420000 | 0.000000 | 0.000000 | |
25% | 18.000000 | 7.750000 | 0.000000 | |
50% | 24.000000 | 8.050000 | 0.000000 | |
75% | 32.000000 | 15.500000 | 0.000000 | |
max | 74.000000 | 69.550000 | 1.000000 |
- 피벗
In [41]:
import seaborn as sns
import pandas as pd
# titanic 데이터셋 가져오기
titanic = sns.load_dataset('titanic')
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
df.head()
Out[41]:
age | sex | class | fare | survived | |
---|---|---|---|---|---|
0 | 22.0 | male | Third | 7.2500 | 0 |
1 | 38.0 | female | First | 71.2833 | 1 |
2 | 26.0 | female | Third | 7.9250 | 1 |
3 | 35.0 | female | First | 53.1000 | 1 |
4 | 35.0 | male | Third | 8.0500 | 0 |
In [43]:
pd.pivot_table(df, index="class", columns="sex", values="age", aggfunc="mean")
Out[43]:
sex | female | male |
---|---|---|
class | ||
First | 34.611765 | 41.281386 |
Second | 28.722973 | 30.740707 |
Third | 21.750000 | 26.507589 |
In [48]:
pd.pivot_table(df, index="class", columns="sex", values="survived", aggfunc="mean")
Out[48]:
sex | female | male |
---|---|---|
class | ||
First | 0.968085 | 0.368852 |
Second | 0.921053 | 0.157407 |
Third | 0.500000 | 0.135447 |
In [50]:
pd.pivot_table(df, index="class", columns="sex", values=["age", "survived"], aggfunc=["mean", "max"])
Out[50]:
mean | max | |||||||
---|---|---|---|---|---|---|---|---|
age | survived | age | survived | |||||
sex | female | male | female | male | female | male | female | male |
class | ||||||||
First | 34.611765 | 41.281386 | 0.968085 | 0.368852 | 63.0 | 80.0 | 1 | 1 |
Second | 28.722973 | 30.740707 | 0.921053 | 0.157407 | 57.0 | 70.0 | 1 | 1 |
Third | 21.750000 | 26.507589 | 0.500000 | 0.135447 | 63.0 | 74.0 | 1 | 1 |
- 참고
- agg(Series -> Scalar)
- filter(DataFrame -> Scalar)
- transform(DataFrame -> DataFrame)
- apply(DataFrame -> 모든양식)
In [ ]:
'playdata' 카테고리의 다른 글
ML(0805_day2) - 머신러닝 개요 정리 문제 (0) | 2021.08.05 |
---|---|
ML(0804_day1) - 실습_붓꽃 데이터 품종 예측하기 (0) | 2021.08.04 |
Pandas(0728_day4) - 실습_데이터 합치기 (0) | 2021.07.28 |
Pandas(0728_day4) - 실습_자동차 연비 분석 (데이터 전처리 및 시각화) (0) | 2021.07.28 |
Pandas(0728_day4) (0) | 2021.07.28 |
Comments