内容目录
Pandas的函数应用¶
In [115]:
import numpy as np
import pandas as pd
apply和applymap¶
In [139]:
df = pd.DataFrame(np.random.randn(5,4))
df
Out[139]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.026993 | 1.178644 | 0.678732 | -1.715583 |
1 | -0.806836 | 1.102237 | 0.443870 | 1.269880 |
2 | -0.139036 | -1.545954 | 0.148033 | -0.002969 |
3 | -1.357422 | -2.524853 | 0.371997 | 1.332291 |
4 | -1.133313 | 0.921616 | 1.372075 | -0.469255 |
In [140]:
# 1.1pandas对象可以直接使用numpy的函数
np.abs(df)
Out[140]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.026993 | 1.178644 | 0.678732 | 1.715583 |
1 | 0.806836 | 1.102237 | 0.443870 | 1.269880 |
2 | 0.139036 | 1.545954 | 0.148033 | 0.002969 |
3 | 1.357422 | 2.524853 | 0.371997 | 1.332291 |
4 | 1.133313 | 0.921616 | 1.372075 | 0.469255 |
In [141]:
# 1.2通过apply将函数应用到列(或者行)
f = lambda x: x.max()
df.apply(f,axis='index')
Out[141]:
0 0.026993 1 1.178644 2 1.372075 3 1.332291 dtype: float64
In [142]:
# 1.3通过applymap将函数应用到每个数据上
f2 = lambda x: '%.2f' % x
df.applymap(f2)
Out[142]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.03 | 1.18 | 0.68 | -1.72 |
1 | -0.81 | 1.10 | 0.44 | 1.27 |
2 | -0.14 | -1.55 | 0.15 | -0.00 |
3 | -1.36 | -2.52 | 0.37 | 1.33 |
4 | -1.13 | 0.92 | 1.37 | -0.47 |
排序¶
按照索引排序¶
In [144]:
s1 = pd.Series(np.arange(4),index=list('dbca'))
s1
Out[144]:
d 0 b 1 c 2 a 3 dtype: int64
In [158]:
# 默认降序排序
s1.sort_index()
Out[158]:
a 3 b 1 c 2 d 0 dtype: int64
In [148]:
# 进行升序排序
s1.sort_index(ascending=False)
Out[148]:
d 0 c 2 b 1 a 3 dtype: int64
In [151]:
pd1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('bdc'),columns=list('BDCA'))
pd1
Out[151]:
B | D | C | A | |
---|---|---|---|---|
b | 0 | 1 | 2 | 3 |
d | 4 | 5 | 6 | 7 |
c | 8 | 9 | 10 | 11 |
In [159]:
# 按照列进行升序排序
pd1.sort_index(axis=1, ascending=False)
Out[159]:
D | C | B | A | |
---|---|---|---|---|
b | 1 | 2 | 0 | 3 |
d | 5 | 6 | 4 | 7 |
c | 9 | 10 | 8 | 11 |
按照值排序¶
In [146]:
# 按照值进行升序排序
s1.sort_values()
Out[146]:
d 0 b 1 c 2 a 3 dtype: int64
In [ ]:
# 如果有nan,默认会排在最后
In [162]:
s1['a'] = np.NAN
s1.sort_values()
Out[162]:
d 0.0 b 1.0 c 2.0 a NaN dtype: float64
In [165]:
pd1
Out[165]:
B | D | C | A | |
---|---|---|---|---|
b | 0 | 1 | 2 | 3 |
d | 4 | 5 | 6 | 7 |
c | 8 | 9 | 10 | 11 |
In [167]:
# 以A列为基准,按照列进行降序排序
pd1.sort_values(by='A',axis=0, ascending=False)
Out[167]:
B | D | C | A | |
---|---|---|---|---|
c | 8 | 9 | 10 | 11 |
d | 4 | 5 | 6 | 7 |
b | 0 | 1 | 2 | 3 |
In [168]:
pd1.sort_values(by='A',axis=0)
Out[168]:
B | D | C | A | |
---|---|---|---|---|
b | 0 | 1 | 2 | 3 |
d | 4 | 5 | 6 | 7 |
c | 8 | 9 | 10 | 11 |
In [170]:
pd2 = pd.DataFrame({'a':[3,7,9,0],
'b':[2,-8,9,7],
'c':[10,-1,3,7]})
pd2
Out[170]:
a | b | c | |
---|---|---|---|
0 | 3 | 2 | 10 |
1 | 7 | -8 | -1 |
2 | 9 | 9 | 3 |
3 | 0 | 7 | 7 |
In [171]:
# 指定b列进行升序排序
pd2.sort_values(by='b')
Out[171]:
a | b | c | |
---|---|---|---|
1 | 7 | -8 | -1 |
0 | 3 | 2 | 10 |
3 | 0 | 7 | 7 |
2 | 9 | 9 | 3 |
In [172]:
# 指定多列进行排序
pd2.sort_values(by=['a','c'])
Out[172]:
a | b | c | |
---|---|---|---|
3 | 0 | 7 | 7 |
0 | 3 | 2 | 10 |
1 | 7 | -8 | -1 |
2 | 9 | 9 | 3 |
唯一值和成员属性¶
In [175]:
s1 = pd.Series([2,6,8,9,3,6], index=['a','a','b','c','d','e'])
s1
Out[175]:
a 2 a 6 b 8 c 9 d 3 e 6 dtype: int64
In [177]:
# 使用unique获取唯一值
s2 = s1.unique()
s2
Out[177]:
array([2, 6, 8, 9, 3])
In [180]:
# 使用value_counts可以获取值的出现次数
s1.value_counts()
Out[180]:
6 2 2 1 8 1 9 1 3 1 Name: count, dtype: int64
处理缺失数据¶
In [184]:
df3 = pd.DataFrame([np.random.randn(3),
[1, 2, np.nan],
[np.nan, 4, np.nan],[1,2,3]])
df3
Out[184]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.970443 | 1.240458 | -1.106679 |
1 | 1.000000 | 2.000000 | NaN |
2 | NaN | 4.000000 | NaN |
3 | 1.000000 | 2.000000 | 3.000000 |
In [190]:
# 1.判断是否存在缺失值 isnull()
df3.isnull()
Out[190]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.970443 | 1.240458 | -1.106679 |
1 | 1.000000 | 2.000000 | NaN |
2 | NaN | 4.000000 | NaN |
3 | 1.000000 | 2.000000 | 3.000000 |
In [193]:
# 2.丢弃缺失数据 dropna()
df3.dropna()
Out[193]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.970443 | 1.240458 | -1.106679 |
3 | 1.000000 | 2.000000 | 3.000000 |
In [194]:
df3.dropna(axis=1)
Out[194]:
1 | |
---|---|
0 | 1.240458 |
1 | 2.000000 |
2 | 4.000000 |
3 | 2.000000 |
In [195]:
# 3.填充缺失数据 fillna()
df3.fillna(0)
Out[195]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.970443 | 1.240458 | -1.106679 |
1 | 1.000000 | 2.000000 | 0.000000 |
2 | 0.000000 | 4.000000 | 0.000000 |
3 | 1.000000 | 2.000000 | 3.000000 |