Pandas数据处理

1
2
import pandas as pd
df = pd.read_excel('https://raw.githubusercontent.com/xiryg/blog_picture/main/resource/pandas120.xlsx')
1
df.head()
createTimeeducationsalary
02020-03-16 11:30:18本科20k-35k
12020-03-16 10:58:48本科20k-40k
22020-03-16 10:46:39不限20k-35k
32020-03-16 10:45:44本科13k-20k
42020-03-16 10:20:41本科10k-20k
1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   createTime  135 non-null    datetime64[ns]
 1   education   135 non-null    object        
 2   salary      135 non-null    object        
dtypes: datetime64[ns](1), object(2)
memory usage: 3.3+ KB

将salary列数据转换为最大值于最小值的平均值

1
2
3
4
5
6
7
8
9
# data 的apply方法
def func(series):
# print(seires)
lst = series['salary'].split('-')
smin = int(lst[0].strip('k'))
smax = int(lst[1].strip('k'))
avg_salary = int((smin+smax)/2 * 1000)
return avg_salary
df['new_salary'] = df.apply(func,axis = 1)
1
df.head()
createTimeeducationsalarynew_salary
02020-03-16 11:30:18本科20k-35k27500
12020-03-16 10:58:48本科20k-40k30000
22020-03-16 10:46:39不限20k-35k27500
32020-03-16 10:45:44本科13k-20k16500
42020-03-16 10:20:41本科10k-20k15000

将数据根据学历进行分组并计算平均薪资

1
print(df.groupby('education').mean(numeric_only=True))
             new_salary
education              
不限         19600.000000
大专         10000.000000
本科         19361.344538
硕士         20642.857143

查看数值型列的汇总统计

1
df.describe()
new_salary
count135.000000
mean19159.259259
std8661.686922
min3500.000000
25%14000.000000
50%17500.000000
75%25000.000000
max45000.000000

新增一列根据salary将数据分为三组

1
2
3
4
# cut 方法
bins = [0,5000,20000,50000]
group_names = ['低','中','高']
df['categories'] = pd.cut(df['new_salary'],bins,labels=group_names)
1
df
createTimeeducationsalarynew_salarycategories
02020-03-16 11:30:18本科20k-35k27500
12020-03-16 10:58:48本科20k-40k30000
22020-03-16 10:46:39不限20k-35k27500
32020-03-16 10:45:44本科13k-20k16500
42020-03-16 10:20:41本科10k-20k15000
..................
1302020-03-16 11:36:07本科10k-18k14000
1312020-03-16 09:54:47硕士25k-50k37500
1322020-03-16 10:48:32本科20k-40k30000
1332020-03-16 10:46:31本科15k-23k19000
1342020-03-16 11:19:38本科20k-40k30000

135 rows × 5 columns

按照salary列对数据进行降序排列

1
df.sort_values('new_salary',ascending=False)
createTimeeducationsalarynew_salarycategories
532020-03-16 11:30:17本科30k-60k45000
372020-03-16 11:04:00本科30k-50k40000
1012020-03-16 11:01:39本科30k-45k37500
162020-03-16 10:36:57本科25k-50k37500
1312020-03-16 09:54:47硕士25k-50k37500
..................
1232020-03-16 11:20:44本科3k-6k4500
1262020-03-16 11:12:04本科3k-5k4000
1102020-03-16 11:12:04本科3k-5k4000
962020-03-16 10:44:23不限3k-4k3500
1132020-03-16 10:48:43本科3k-4k3500

135 rows × 5 columns

1
df
createTimeeducationsalarynew_salarycategories
02020-03-16 11:30:18本科20k-35k27500
12020-03-16 10:58:48本科20k-40k30000
22020-03-16 10:46:39不限20k-35k27500
32020-03-16 10:45:44本科13k-20k16500
42020-03-16 10:20:41本科10k-20k15000
..................
1302020-03-16 11:36:07本科10k-18k14000
1312020-03-16 09:54:47硕士25k-50k37500
1322020-03-16 10:48:32本科20k-40k30000
1332020-03-16 10:46:31本科15k-23k19000
1342020-03-16 11:19:38本科20k-40k30000

135 rows × 5 columns

取出第33行的数据

1
df.loc[32]
createTime    2020-03-16 10:07:25
education                      硕士
salary                    15k-30k
new_salary                  22500
categories                      高
Name: 32, dtype: object

删除列 createTime

1
df.drop(columns=['createTime'],inplace=True)
1
df.head()
educationsalarynew_salarycategories
0本科20k-35k27500
1本科20k-40k30000
2不限20k-35k27500
3本科13k-20k16500
4本科10k-20k15000

将 education列与salary列合并为新的一列

1
2
df["new_c"] = df["new_salary"].astype(str) + df['education']
df
educationsalarynew_salarycategoriesnew_c
0本科20k-35k2750027500本科
1本科20k-40k3000030000本科
2不限20k-35k2750027500不限
3本科13k-20k1650016500本科
4本科10k-20k1500015000本科
..................
130本科10k-18k1400014000本科
131硕士25k-50k3750037500硕士
132本科20k-40k3000030000本科
133本科15k-23k1900019000本科
134本科20k-40k3000030000本科

135 rows × 5 columns

将第一行与最后一行进行拼接

1
pd.concat([df[:1],df[-1:]])
educationsalarynew_salarycategoriesnew_c
0本科20k-35k2750027500本科
134本科20k-40k3000030000本科

查看每列的数据类型

1
df.dtypes
education       object
salary          object
new_salary       int64
categories    category
new_c           object
dtype: object

检查数据中是否有缺失值及其数量

1
df.isnull().any()
education     False
salary        False
new_salary    False
categories    False
new_c         False
dtype: bool
1
df.isnull().sum()
education     0
salary        0
new_salary    0
categories    0
new_c         0
dtype: int64

将 new_salary 列类型转换为浮点数

1
df['new_salary'].astype(float)
0      27500.0
1      30000.0
2      27500.0
3      16500.0
4      15000.0
        ...   
130    14000.0
131    37500.0
132    30000.0
133    19000.0
134    30000.0
Name: new_salary, Length: 135, dtype: float64

计算salary列大于10000的数量

1
len(df[df['new_salary'] > 10000])
119

查看每种学历出现的次数

1
df['education'].value_counts()
本科    119
硕士      7
不限      5
大专      4
Name: education, dtype: int64

查看共有几种学历

1
df['education'].nunique()
4
1
df['education'].unique()
array(['本科', '不限', '硕士', '大专'], dtype=object)