内容目录
Pandas的数据清洗和准备¶
In [325]:
import numpy as np
import pandas as pd
df = pd.read_csv('testdata/Employee Sample Data.csv', encoding='latin1')
df
Out[325]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15% | United States | Seattle | 10/16/2021 |
1 | E04105 | Theodore Dinh | Technical Architect | IT | Manufacturing | Male | Asian | 59 | 11/29/1997 | $99,975 | 0% | China | Chongqing | NaN |
2 | E02572 | Luna Sanders | Director | Finance | Speciality Products | Female | Caucasian | 50 | 10/26/2006 | $163,099 | 20% | United States | Chicago | NaN |
3 | E02832 | Penelope Jordan | Computer Systems Manager | IT | Manufacturing | Female | Caucasian | 26 | 9/27/2019 | $84,913 | 7% | United States | Chicago | NaN |
4 | E01639 | Austin Vo | Sr. Analyst | Finance | Manufacturing | Male | Asian | 55 | 11/20/1995 | $95,409 | 0% | United States | Phoenix | NaN |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
995 | E03094 | Wesley Young | Sr. Analyst | Marketing | Speciality Products | Male | Caucasian | 33 | 9/18/2016 | $98,427 | 0% | United States | Columbus | NaN |
996 | E01909 | Lillian Khan | Analyst | Finance | Speciality Products | Female | Asian | 44 | 5/31/2010 | $47,387 | 0% | China | Chengdu | 1/8/2018 |
997 | E04398 | Oliver Yang | Director | Marketing | Speciality Products | Male | Asian | 31 | 6/10/2019 | $176,710 | 15% | United States | Miami | NaN |
998 | E02521 | Lily Nguyen | Sr. Analyst | Finance | Speciality Products | Female | Asian | 33 | 1/28/2012 | $95,960 | 0% | China | Chengdu | NaN |
999 | E03545 | Sofia Cheng | Vice President | Accounting | Corporate | Female | Asian | 63 | 7/26/2020 | $216,195 | 31% | United States | Miami | NaN |
1000 rows × 14 columns
处理缺失数据¶
In [326]:
# 将数据中的nan数据进行删除
df_no_na = df.dropna()
df_no_na
Out[326]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15% | United States | Seattle | 10/16/2021 |
7 | E04332 | Luke Martin | Analyst | Finance | Manufacturing | Male | Black | 25 | 5/16/2020 | $41,336 | 0% | United States | Miami | 5/20/2021 |
14 | E03496 | Robert Yang | Sr. Analyst | Accounting | Speciality Products | Male | Asian | 31 | 11/4/2017 | $97,078 | 0% | United States | Austin | 3/9/2020 |
40 | E01754 | Owen Lam | Sr. Business Partner | Human Resources | Speciality Products | Male | Asian | 30 | 5/29/2017 | $86,317 | 0% | China | Chengdu | 7/16/2017 |
61 | E00502 | Natalia Salazar | Sr. Analyst | Accounting | Manufacturing | Female | Latino | 44 | 1/2/2019 | $74,691 | 0% | Brazil | Manaus | 7/8/2020 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
963 | E03305 | Cooper Jiang | Analyst II | Accounting | Corporate | Male | Asian | 49 | 7/25/2019 | $50,883 | 0% | China | Chongqing | 3/2/2021 |
982 | E03247 | Aaliyah Mai | Vice President | IT | Speciality Products | Female | Asian | 57 | 11/11/2016 | $246,589 | 33% | United States | Phoenix | 3/26/2017 |
983 | E02703 | Austin Vang | Manager | Marketing | Speciality Products | Male | Asian | 49 | 5/20/2018 | $119,397 | 9% | China | Beijing | 3/14/2019 |
991 | E03430 | Leo Herrera | Sr. Business Partner | Human Resources | Research & Development | Male | Latino | 48 | 4/22/1998 | $85,369 | 0% | Brazil | Manaus | 11/27/2004 |
996 | E01909 | Lillian Khan | Analyst | Finance | Speciality Products | Female | Asian | 44 | 5/31/2010 | $47,387 | 0% | China | Chengdu | 1/8/2018 |
85 rows × 14 columns
数据转换¶
In [327]:
# 检查重复数据
df_no_na.duplicated()
Out[327]:
0 False 7 False 14 False 40 False 61 False ... 963 False 982 False 983 False 991 False 996 False Length: 85, dtype: bool
In [328]:
# 删除重复数据:按照指定的EEID列删除重复数据
df_no_na.drop_duplicates(['EEID'])
Out[328]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15% | United States | Seattle | 10/16/2021 |
7 | E04332 | Luke Martin | Analyst | Finance | Manufacturing | Male | Black | 25 | 5/16/2020 | $41,336 | 0% | United States | Miami | 5/20/2021 |
14 | E03496 | Robert Yang | Sr. Analyst | Accounting | Speciality Products | Male | Asian | 31 | 11/4/2017 | $97,078 | 0% | United States | Austin | 3/9/2020 |
40 | E01754 | Owen Lam | Sr. Business Partner | Human Resources | Speciality Products | Male | Asian | 30 | 5/29/2017 | $86,317 | 0% | China | Chengdu | 7/16/2017 |
61 | E00502 | Natalia Salazar | Sr. Analyst | Accounting | Manufacturing | Female | Latino | 44 | 1/2/2019 | $74,691 | 0% | Brazil | Manaus | 7/8/2020 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
963 | E03305 | Cooper Jiang | Analyst II | Accounting | Corporate | Male | Asian | 49 | 7/25/2019 | $50,883 | 0% | China | Chongqing | 3/2/2021 |
982 | E03247 | Aaliyah Mai | Vice President | IT | Speciality Products | Female | Asian | 57 | 11/11/2016 | $246,589 | 33% | United States | Phoenix | 3/26/2017 |
983 | E02703 | Austin Vang | Manager | Marketing | Speciality Products | Male | Asian | 49 | 5/20/2018 | $119,397 | 9% | China | Beijing | 3/14/2019 |
991 | E03430 | Leo Herrera | Sr. Business Partner | Human Resources | Research & Development | Male | Latino | 48 | 4/22/1998 | $85,369 | 0% | Brazil | Manaus | 11/27/2004 |
996 | E01909 | Lillian Khan | Analyst | Finance | Speciality Products | Female | Asian | 44 | 5/31/2010 | $47,387 | 0% | China | Chengdu | 1/8/2018 |
85 rows × 14 columns
In [329]:
# 查看Department都有哪些选项
df_no_na['Department'].unique()
Out[329]:
array(['IT', 'Finance', 'Accounting', 'Human Resources', 'Engineering', 'Marketing', 'Sales'], dtype=object)
In [330]:
# 利用函数或映射进行数据转换
department_class = {
'IT' : 'tech',
'Finance' : 'no_tech',
'Accounting' : 'no_tech',
'Human Resources' : 'no_tech',
'Engineering' : 'tech',
'Marketing' : 'no_tech',
'Sales' : 'no_tech'
}
# 使用.loc[row_indexer, col_indexer]来设置值
df_map = df_no_na.copy()
df_map.loc[:, 'department_class'] = df_map['Department'].map(department_class)
df_map
Out[330]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | department_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15% | United States | Seattle | 10/16/2021 | tech |
7 | E04332 | Luke Martin | Analyst | Finance | Manufacturing | Male | Black | 25 | 5/16/2020 | $41,336 | 0% | United States | Miami | 5/20/2021 | no_tech |
14 | E03496 | Robert Yang | Sr. Analyst | Accounting | Speciality Products | Male | Asian | 31 | 11/4/2017 | $97,078 | 0% | United States | Austin | 3/9/2020 | no_tech |
40 | E01754 | Owen Lam | Sr. Business Partner | Human Resources | Speciality Products | Male | Asian | 30 | 5/29/2017 | $86,317 | 0% | China | Chengdu | 7/16/2017 | no_tech |
61 | E00502 | Natalia Salazar | Sr. Analyst | Accounting | Manufacturing | Female | Latino | 44 | 1/2/2019 | $74,691 | 0% | Brazil | Manaus | 7/8/2020 | no_tech |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
963 | E03305 | Cooper Jiang | Analyst II | Accounting | Corporate | Male | Asian | 49 | 7/25/2019 | $50,883 | 0% | China | Chongqing | 3/2/2021 | no_tech |
982 | E03247 | Aaliyah Mai | Vice President | IT | Speciality Products | Female | Asian | 57 | 11/11/2016 | $246,589 | 33% | United States | Phoenix | 3/26/2017 | tech |
983 | E02703 | Austin Vang | Manager | Marketing | Speciality Products | Male | Asian | 49 | 5/20/2018 | $119,397 | 9% | China | Beijing | 3/14/2019 | no_tech |
991 | E03430 | Leo Herrera | Sr. Business Partner | Human Resources | Research & Development | Male | Latino | 48 | 4/22/1998 | $85,369 | 0% | Brazil | Manaus | 11/27/2004 | no_tech |
996 | E01909 | Lillian Khan | Analyst | Finance | Speciality Products | Female | Asian | 44 | 5/31/2010 | $47,387 | 0% | China | Chengdu | 1/8/2018 | no_tech |
85 rows × 15 columns
In [331]:
# 将Bonus中大于30%比例的情况(异常情况)替换为30%
df_remove_except = df_map.copy()
# 清理'Bonus %'列中的空格和特殊字符(如百分号)
df_remove_except.loc[:, 'Bonus %'] = df_remove_except['Bonus %'].str.strip('% ')
# 将清理后的字符串转换为浮点数
df_remove_except.loc[:, 'Bonus %'] = df_remove_except['Bonus %'].astype(float)
# 将大于30的情况替换为30
df_remove_except.loc[df_remove_except['Bonus %'] > 30, 'Bonus %'] = 30
df_remove_except
Out[331]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | department_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15.0 | United States | Seattle | 10/16/2021 | tech |
7 | E04332 | Luke Martin | Analyst | Finance | Manufacturing | Male | Black | 25 | 5/16/2020 | $41,336 | 0.0 | United States | Miami | 5/20/2021 | no_tech |
14 | E03496 | Robert Yang | Sr. Analyst | Accounting | Speciality Products | Male | Asian | 31 | 11/4/2017 | $97,078 | 0.0 | United States | Austin | 3/9/2020 | no_tech |
40 | E01754 | Owen Lam | Sr. Business Partner | Human Resources | Speciality Products | Male | Asian | 30 | 5/29/2017 | $86,317 | 0.0 | China | Chengdu | 7/16/2017 | no_tech |
61 | E00502 | Natalia Salazar | Sr. Analyst | Accounting | Manufacturing | Female | Latino | 44 | 1/2/2019 | $74,691 | 0.0 | Brazil | Manaus | 7/8/2020 | no_tech |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
963 | E03305 | Cooper Jiang | Analyst II | Accounting | Corporate | Male | Asian | 49 | 7/25/2019 | $50,883 | 0.0 | China | Chongqing | 3/2/2021 | no_tech |
982 | E03247 | Aaliyah Mai | Vice President | IT | Speciality Products | Female | Asian | 57 | 11/11/2016 | $246,589 | 30 | United States | Phoenix | 3/26/2017 | tech |
983 | E02703 | Austin Vang | Manager | Marketing | Speciality Products | Male | Asian | 49 | 5/20/2018 | $119,397 | 9.0 | China | Beijing | 3/14/2019 | no_tech |
991 | E03430 | Leo Herrera | Sr. Business Partner | Human Resources | Research & Development | Male | Latino | 48 | 4/22/1998 | $85,369 | 0.0 | Brazil | Manaus | 11/27/2004 | no_tech |
996 | E01909 | Lillian Khan | Analyst | Finance | Speciality Products | Female | Asian | 44 | 5/31/2010 | $47,387 | 0.0 | China | Chengdu | 1/8/2018 | no_tech |
85 rows × 15 columns
数据重新排序¶
In [332]:
# 对Exit Date进行排序
df_sort = df_remove_except.copy()
# 将"Exit Date"列转换为日期时间格式
df_sort['Exit Date'] = pd.to_datetime(df_sort['Exit Date'])
# 对DataFrame按照"Exit Date"列进行排序(时间由近到远)
df_sorted = df_sort.sort_values(by='Exit Date', ascending=False)
df_sorted
Out[332]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | department_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
730 | E03579 | Robert Zhang | Vice President | Marketing | Corporate | Male | Asian | 45 | 9/24/2015 | $202,680 | 30 | United States | Phoenix | 2022-08-17 | no_tech |
639 | E04641 | Scarlett Hill | Director | Engineering | Speciality Products | Female | Black | 45 | 4/22/2018 | $187,205 | 24.0 | United States | Columbus | 2022-06-20 | tech |
695 | E01789 | Charles Luu | Sr. Manger | Sales | Manufacturing | Male | Asian | 25 | 6/15/2021 | $142,731 | 11.0 | China | Shanghai | 2022-06-03 | no_tech |
576 | E02857 | Mason Jimenez | Sr. Manger | Finance | Speciality Products | Male | Latino | 44 | 8/8/2019 | $130,133 | 15.0 | United States | Austin | 2022-05-18 | no_tech |
750 | E02642 | Sebastian Rogers | HRIS Analyst | Human Resources | Research & Development | Male | Caucasian | 38 | 11/29/2019 | $69,647 | 0.0 | United States | Miami | 2022-04-20 | no_tech |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
478 | E01845 | Leo Fernandez | Manager | Finance | Research & Development | Male | Latino | 54 | 4/28/1998 | $108,268 | 9.0 | Brazil | Sao Paulo | 2004-05-15 | no_tech |
376 | E00181 | Genesis Hu | Sr. Analyst | Marketing | Corporate | Female | Asian | 46 | 1/15/2002 | $86,510 | 0.0 | China | Beijing | 2003-01-02 | no_tech |
350 | E03045 | Andrew Huynh | Business Partner | Human Resources | Speciality Products | Male | Asian | 57 | 4/28/1997 | $54,051 | 0.0 | United States | Miami | 1998-10-11 | no_tech |
834 | E00592 | Josephine Richardson | System Administrator | IT | Manufacturing | Female | Caucasian | 57 | 2/18/1996 | $75,354 | 0.0 | United States | Austin | 1996-12-14 | tech |
648 | E01591 | Paisley Trinh | Technical Architect | IT | Corporate | Female | Asian | 57 | 5/4/1992 | $76,202 | 0.0 | United States | Austin | 1994-12-18 | tech |
85 rows × 15 columns
数据离散化和面元划分¶
In [337]:
# 按照年龄划分不同的年龄组
df_handle_age = df_sorted.copy()
# 定义面元
bins = [18, 25, 35, 40, 60, 100]
labels = ['18-25', '26-35', '36-40', '41-60', '61-100']
df_handle_age['Age Group'] = pd.cut(df_handle_age['Age'], bins=bins, labels=labels, right=False)
df_handle_age
Out[337]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | department_class | Age Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
730 | E03579 | Robert Zhang | Vice President | Marketing | Corporate | Male | Asian | 45 | 9/24/2015 | $202,680 | 30 | United States | Phoenix | 2022-08-17 | no_tech | 41-60 |
639 | E04641 | Scarlett Hill | Director | Engineering | Speciality Products | Female | Black | 45 | 4/22/2018 | $187,205 | 24.0 | United States | Columbus | 2022-06-20 | tech | 41-60 |
695 | E01789 | Charles Luu | Sr. Manger | Sales | Manufacturing | Male | Asian | 25 | 6/15/2021 | $142,731 | 11.0 | China | Shanghai | 2022-06-03 | no_tech | 26-35 |
576 | E02857 | Mason Jimenez | Sr. Manger | Finance | Speciality Products | Male | Latino | 44 | 8/8/2019 | $130,133 | 15.0 | United States | Austin | 2022-05-18 | no_tech | 41-60 |
750 | E02642 | Sebastian Rogers | HRIS Analyst | Human Resources | Research & Development | Male | Caucasian | 38 | 11/29/2019 | $69,647 | 0.0 | United States | Miami | 2022-04-20 | no_tech | 36-40 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
478 | E01845 | Leo Fernandez | Manager | Finance | Research & Development | Male | Latino | 54 | 4/28/1998 | $108,268 | 9.0 | Brazil | Sao Paulo | 2004-05-15 | no_tech | 41-60 |
376 | E00181 | Genesis Hu | Sr. Analyst | Marketing | Corporate | Female | Asian | 46 | 1/15/2002 | $86,510 | 0.0 | China | Beijing | 2003-01-02 | no_tech | 41-60 |
350 | E03045 | Andrew Huynh | Business Partner | Human Resources | Speciality Products | Male | Asian | 57 | 4/28/1997 | $54,051 | 0.0 | United States | Miami | 1998-10-11 | no_tech | 41-60 |
834 | E00592 | Josephine Richardson | System Administrator | IT | Manufacturing | Female | Caucasian | 57 | 2/18/1996 | $75,354 | 0.0 | United States | Austin | 1996-12-14 | tech | 41-60 |
648 | E01591 | Paisley Trinh | Technical Architect | IT | Corporate | Female | Asian | 57 | 5/4/1992 | $76,202 | 0.0 | United States | Austin | 1994-12-18 | tech | 41-60 |
85 rows × 16 columns
In [338]:
# 创建副本并设置城市和年龄为层级索引
df_with_multiindex = df_handle_age.set_index(['City', 'Age Group'])
df_with_multiindex
Out[338]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | Exit Date | department_class | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
City | Age Group | ||||||||||||||
Phoenix | 41-60 | E03579 | Robert Zhang | Vice President | Marketing | Corporate | Male | Asian | 45 | 9/24/2015 | $202,680 | 30 | United States | 2022-08-17 | no_tech |
Columbus | 41-60 | E04641 | Scarlett Hill | Director | Engineering | Speciality Products | Female | Black | 45 | 4/22/2018 | $187,205 | 24.0 | United States | 2022-06-20 | tech |
Shanghai | 26-35 | E01789 | Charles Luu | Sr. Manger | Sales | Manufacturing | Male | Asian | 25 | 6/15/2021 | $142,731 | 11.0 | China | 2022-06-03 | no_tech |
Austin | 41-60 | E02857 | Mason Jimenez | Sr. Manger | Finance | Speciality Products | Male | Latino | 44 | 8/8/2019 | $130,133 | 15.0 | United States | 2022-05-18 | no_tech |
Miami | 36-40 | E02642 | Sebastian Rogers | HRIS Analyst | Human Resources | Research & Development | Male | Caucasian | 38 | 11/29/2019 | $69,647 | 0.0 | United States | 2022-04-20 | no_tech |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Sao Paulo | 41-60 | E01845 | Leo Fernandez | Manager | Finance | Research & Development | Male | Latino | 54 | 4/28/1998 | $108,268 | 9.0 | Brazil | 2004-05-15 | no_tech |
Beijing | 41-60 | E00181 | Genesis Hu | Sr. Analyst | Marketing | Corporate | Female | Asian | 46 | 1/15/2002 | $86,510 | 0.0 | China | 2003-01-02 | no_tech |
Miami | 41-60 | E03045 | Andrew Huynh | Business Partner | Human Resources | Speciality Products | Male | Asian | 57 | 4/28/1997 | $54,051 | 0.0 | United States | 1998-10-11 | no_tech |
Austin | 41-60 | E00592 | Josephine Richardson | System Administrator | IT | Manufacturing | Female | Caucasian | 57 | 2/18/1996 | $75,354 | 0.0 | United States | 1996-12-14 | tech |
41-60 | E01591 | Paisley Trinh | Technical Architect | IT | Corporate | Female | Asian | 57 | 5/4/1992 | $76,202 | 0.0 | United States | 1994-12-18 | tech |
85 rows × 14 columns
In [340]:
# 面元划分的其他示例
# 生成1000个随机数
data = np.random.rand(1000)
# 将这些数据划分为4个阶段
cats = pd.qcut(data, 4)
cats
Out[340]:
[(0.723, 1.0], (0.24, 0.485], (0.24, 0.485], (0.723, 1.0], (-0.00047599999999999997, 0.24], ..., (0.485, 0.723], (0.723, 1.0], (0.485, 0.723], (0.24, 0.485], (0.723, 1.0]] Length: 1000 Categories (4, interval[float64, right]): [(-0.00047599999999999997, 0.24] < (0.24, 0.485] < (0.485, 0.723] < (0.723, 1.0]]
In [341]:
# 查看每个阶段的counts数量是不是一样的
pd.value_counts(cats)
Out[341]:
(-0.00047599999999999997, 0.24] 250 (0.24, 0.485] 250 (0.485, 0.723] 250 (0.723, 1.0] 250 Name: count, dtype: int64
In [343]:
# 指定数据的分位数
cats = pd.cut(data, [0, 0.1, 0.5, 0.9, 1])
pd.value_counts(cats)
Out[343]:
(0.1, 0.5] 397 (0.5, 0.9] 390 (0.0, 0.1] 112 (0.9, 1.0] 101 Name: count, dtype: int64
数据排列和随机采样¶
In [344]:
# permutation进行顺序随机打乱
sam = np.random.permutation(6)
In [345]:
sam
Out[345]:
array([3, 0, 5, 2, 4, 1])
In [348]:
# 排列:从df中随机挑选6个
df.take(sam)
Out[348]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | E02832 | Penelope Jordan | Computer Systems Manager | IT | Manufacturing | Female | Caucasian | 26 | 9/27/2019 | $84,913 | 7% | United States | Chicago | NaN |
0 | E02387 | Emily Davis | Sr. Manger | IT | Research & Development | Female | Black | 55 | 4/8/2016 | $141,604 | 15% | United States | Seattle | 10/16/2021 |
5 | E00644 | Joshua Gupta | Account Representative | Sales | Corporate | Male | Asian | 57 | 1/24/2017 | $50,994 | 0% | China | Chongqing | NaN |
2 | E02572 | Luna Sanders | Director | Finance | Speciality Products | Female | Caucasian | 50 | 10/26/2006 | $163,099 | 20% | United States | Chicago | NaN |
4 | E01639 | Austin Vo | Sr. Analyst | Finance | Manufacturing | Male | Asian | 55 | 11/20/1995 | $95,409 | 0% | United States | Phoenix | NaN |
1 | E04105 | Theodore Dinh | Technical Architect | IT | Manufacturing | Male | Asian | 59 | 11/29/1997 | $99,975 | 0% | China | Chongqing | NaN |
In [352]:
# 从df中随机选取3个
df.sample(n = 3)
Out[352]:
EEID | Full Name | Job Title | Department | Business Unit | Gender | Ethnicity | Age | Hire Date | Annual Salary | Bonus % | Country | City | Exit Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
420 | E01762 | Maya Ngo | Manager | Sales | Speciality Products | Female | Asian | 55 | 10/20/2012 | $108,686 | 6% | United States | Columbus | NaN |
658 | E04174 | Maverick Henry | Computer Systems Manager | IT | Research & Development | Male | Caucasian | 26 | 7/10/2019 | $69,110 | 5% | United States | Chicago | NaN |
932 | E03894 | Charlotte Chang | Manager | Sales | Research & Development | Female | Asian | 50 | 5/7/2000 | $106,428 | 7% | United States | Chicago | NaN |
In [350]:
ch = pd.Series([5,4,6,8,2,1])
ch
Out[350]:
0 5 1 4 2 6 3 8 4 2 5 1 dtype: int64
In [353]:
# sample的replace参数为True表示可以进行重复选择
ch.sample(n=10,replace=True)
Out[353]:
0 5 4 2 2 6 0 5 1 4 0 5 5 1 0 5 0 5 4 2 dtype: int64