Pandas数据处理

1 2	import pandas as pd df = pd.read_excel('https://raw.githubusercontent.com/xiryg/blog_picture/main/resource/pandas120.xlsx')

df.head()

	createTime	education	salary
0	2020-03-16 11:30:18	本科	20k-35k
1	2020-03-16 10:58:48	本科	20k-40k
2	2020-03-16 10:46:39	不限	20k-35k
3	2020-03-16 10:45:44	本科	13k-20k
4	2020-03-16 10:20:41	本科	10k-20k

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   createTime  135 non-null    datetime64[ns]
 1   education   135 non-null    object        
 2   salary      135 non-null    object        
dtypes: datetime64[ns](1), object(2)
memory usage: 3.3+ KB

将salary列数据转换为最大值于最小值的平均值

# data 的apply方法
def func(series):
    # print(seires)
    lst = series['salary'].split('-')
    smin = int(lst[0].strip('k'))
    smax = int(lst[1].strip('k'))
    avg_salary = int((smin+smax)/2 * 1000)
    return avg_salary
df['new_salary'] = df.apply(func,axis = 1)

df.head()

	createTime	education	salary	new_salary
0	2020-03-16 11:30:18	本科	20k-35k	27500
1	2020-03-16 10:58:48	本科	20k-40k	30000
2	2020-03-16 10:46:39	不限	20k-35k	27500
3	2020-03-16 10:45:44	本科	13k-20k	16500
4	2020-03-16 10:20:41	本科	10k-20k	15000

将数据根据学历进行分组并计算平均薪资

1	print(df.groupby('education').mean(numeric_only=True))

             new_salary
education              
不限         19600.000000
大专         10000.000000
本科         19361.344538
硕士         20642.857143

查看数值型列的汇总统计

1	df.describe()

	new_salary
count	135.000000
mean	19159.259259
std	8661.686922
min	3500.000000
25%	14000.000000
50%	17500.000000
75%	25000.000000
max	45000.000000

新增一列根据salary将数据分为三组

# cut 方法
bins = [0,5000,20000,50000]
group_names = ['低','中','高']
df['categories'] = pd.cut(df['new_salary'],bins,labels=group_names)

df

	createTime	education	salary	new_salary	categories
0	2020-03-16 11:30:18	本科	20k-35k	27500	高
1	2020-03-16 10:58:48	本科	20k-40k	30000	高
2	2020-03-16 10:46:39	不限	20k-35k	27500	高
3	2020-03-16 10:45:44	本科	13k-20k	16500	中
4	2020-03-16 10:20:41	本科	10k-20k	15000	中
...	...	...	...	...	...
130	2020-03-16 11:36:07	本科	10k-18k	14000	中
131	2020-03-16 09:54:47	硕士	25k-50k	37500	高
132	2020-03-16 10:48:32	本科	20k-40k	30000	高
133	2020-03-16 10:46:31	本科	15k-23k	19000	中
134	2020-03-16 11:19:38	本科	20k-40k	30000	高

135 rows × 5 columns

按照salary列对数据进行降序排列

1	df.sort_values('new_salary',ascending=False)

	createTime	education	salary	new_salary	categories
53	2020-03-16 11:30:17	本科	30k-60k	45000	高
37	2020-03-16 11:04:00	本科	30k-50k	40000	高
101	2020-03-16 11:01:39	本科	30k-45k	37500	高
16	2020-03-16 10:36:57	本科	25k-50k	37500	高
131	2020-03-16 09:54:47	硕士	25k-50k	37500	高
...	...	...	...	...	...
123	2020-03-16 11:20:44	本科	3k-6k	4500	低
126	2020-03-16 11:12:04	本科	3k-5k	4000	低
110	2020-03-16 11:12:04	本科	3k-5k	4000	低
96	2020-03-16 10:44:23	不限	3k-4k	3500	低
113	2020-03-16 10:48:43	本科	3k-4k	3500	低

135 rows × 5 columns

df

	createTime	education	salary	new_salary	categories
0	2020-03-16 11:30:18	本科	20k-35k	27500	高
1	2020-03-16 10:58:48	本科	20k-40k	30000	高
2	2020-03-16 10:46:39	不限	20k-35k	27500	高
3	2020-03-16 10:45:44	本科	13k-20k	16500	中
4	2020-03-16 10:20:41	本科	10k-20k	15000	中
...	...	...	...	...	...
130	2020-03-16 11:36:07	本科	10k-18k	14000	中
131	2020-03-16 09:54:47	硕士	25k-50k	37500	高
132	2020-03-16 10:48:32	本科	20k-40k	30000	高
133	2020-03-16 10:46:31	本科	15k-23k	19000	中
134	2020-03-16 11:19:38	本科	20k-40k	30000	高

135 rows × 5 columns

取出第33行的数据

1	df.loc[32]

createTime    2020-03-16 10:07:25
education                      硕士
salary                    15k-30k
new_salary                  22500
categories                      高
Name: 32, dtype: object

删除列 createTime

1	df.drop(columns=['createTime'],inplace=True)

df.head()

	education	salary	new_salary	categories
0	本科	20k-35k	27500	高
1	本科	20k-40k	30000	高
2	不限	20k-35k	27500	高
3	本科	13k-20k	16500	中
4	本科	10k-20k	15000	中

将 education列与salary列合并为新的一列

1 2	df["new_c"] = df["new_salary"].astype(str) + df['education'] df

	education	salary	new_salary	categories	new_c
0	本科	20k-35k	27500	高	27500本科
1	本科	20k-40k	30000	高	30000本科
2	不限	20k-35k	27500	高	27500不限
3	本科	13k-20k	16500	中	16500本科
4	本科	10k-20k	15000	中	15000本科
...	...	...	...	...	...
130	本科	10k-18k	14000	中	14000本科
131	硕士	25k-50k	37500	高	37500硕士
132	本科	20k-40k	30000	高	30000本科
133	本科	15k-23k	19000	中	19000本科
134	本科	20k-40k	30000	高	30000本科

135 rows × 5 columns

将第一行与最后一行进行拼接

1	pd.concat([df[:1],df[-1:]])

	education	salary	new_salary	categories	new_c
0	本科	20k-35k	27500	高	27500本科
134	本科	20k-40k	30000	高	30000本科

查看每列的数据类型

df.dtypes

education       object
salary          object
new_salary       int64
categories    category
new_c           object
dtype: object

检查数据中是否有缺失值及其数量

1	df.isnull().any()

education     False
salary        False
new_salary    False
categories    False
new_c         False
dtype: bool

1	df.isnull().sum()

education     0
salary        0
new_salary    0
categories    0
new_c         0
dtype: int64

将 new_salary 列类型转换为浮点数

1	df['new_salary'].astype(float)

0      27500.0
1      30000.0
2      27500.0
3      16500.0
4      15000.0
        ...   
130    14000.0
131    37500.0
132    30000.0
133    19000.0
134    30000.0
Name: new_salary, Length: 135, dtype: float64

计算salary列大于10000的数量

1	len(df[df['new_salary'] > 10000])

查看每种学历出现的次数

1	df['education'].value_counts()

本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64

查看共有几种学历

1	df['education'].nunique()

1	df['education'].unique()

array(['本科', '不限', '硕士', '大专'], dtype=object)