pandas组队学习：task2

一、文件读取和写入

1. 文件读取

csv：pd.read_csv(filename)
txt ：pd.read_table(filename)
excle：pd.read_excel(filename)

示例：

import pandas as pd
df_csv = pd.read_csv('my_csv.csv')

Out[6]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2
2     6    c   2.5  orange  2020/1/5
3     5    d   3.2   lemon  2020/1/7

header=None表示第一行不作为列名，例如：

import pandas as pd
df_csv = pd.read_csv('my_csv.csv',header=None)			#原本一共4行，现在变成了5行
Out[9]: 
      0     1     2       3         4
0  col1  col2  col3    col4      col5
1     2     a   1.4   apple  2020/1/1
2     3     b   3.4  banana  2020/1/2
3     6     c   2.5  orange  2020/1/5
4     5     d   3.2   lemon  2020/1/7

usecols表示读取指定列：（输入为列的名称）

import pandas as pd
df_csv = pd.read_csv('my_csv.csv',usecols=['col1'])		#读取第一列
Out[13]: 
   col1
0     2
1     3
2     6
3     5

nrows表示读取的行数：（输入为整数）

import pandas as pd
df_csv = pd.read_csv('my_csv.csv',nrows=2)			#读取两行
Out[15]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2

2.文件写入

csv：data.to_csv(path, index = False) index=False表示将索引去除
excel：data.to_excel(path, index = False)
txt：data.to_csv(path,sep=' ', index=False)

二.基本数据结构

1.series

由四个部分组成，数据：data，索引：index，存储类型：dtype，名称：name；例如：

s = pd.Series(data = [1,10,100],index=[1,2,3],name = 'my_series')
Out[20]: 
1      1
2     10
3    100
Name: my_series, dtype: int64

访问这些属性可以分别用：数据：s.values，索引：s.index，类型：s.dtype，名称：s.name访问；

2.DataFrame

DataFrame在sreies的基础上，将列进行了扩展，由原来的一维变为了二维。

创建方法和sries基本一致，增加列的索引名，例如：

In [33]: df = pd.DataFrame(data = {'col_0': [1,2,3], 'col_1':list('abc'),
   ....:                           'col_2': [1.2, 2.2, 3.2]},
   ....:                   index = ['row_%d'%i for i in range(3)])
   ....: 

In [34]: df
Out[34]: 
       col_0 col_1  col_2
row_0      1     a    1.2
row_1      2     b    2.2
row_2      3     c    3.2

可以按列索引，取出一列或者多列：

In [35]: df['col_0']				#取出某一列
Out[35]: 
row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

In [36]: df[['col_0', 'col_1']]			#取出多列
Out[36]: 
       col_0 col_1
row_0      1     a
row_1      2     b
row_2      3     c

三、常见基本函数

1.汇总函数

head函数表示返回表的前n行，tail返回后n行:

In [46]: df.head(2)
Out[46]: 
                          School     Grade            Name  Gender  Height  Weight Transfer
0  Shanghai Jiao Tong University  Freshman    Gaopeng Yang  Female   158.9    46.0        N
1              Peking University  Freshman  Changqiang You    Male   166.5    70.0        N

In [47]: df.tail(3)
Out[47]: 
                            School      Grade            Name  Gender  Height  Weight Transfer
197  Shanghai Jiao Tong University     Senior  Chengqiang Chu  Female   153.9    45.0        N
198  Shanghai Jiao Tong University     Senior   Chengmei Shen    Male   175.3    71.0        N
199            Tsinghua University  Sophomore     Chunpeng Lv    Male   155.7    51.0

info返回表的信息概况， describe返回表中数值列对应的主要统计量：

In [48]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

In [49]: df.describe()
Out[49]: 
           Height      Weight
count  183.000000  189.000000
mean   163.218033   55.015873
std      8.608879   12.824294
min    145.400000   34.000000
25%    157.150000   46.000000
50%    161.900000   51.000000
75%    167.500000   65.000000
max    193.900000   89.000000

2.统计函数

quantile：返回分位数

In [53]: df_demo.quantile(0.75)
Out[53]: 
Height    167.5
Weight     65.0
Name: 0.75, dtype: float64

count：返回非缺失值个数

In [54]: df_demo.count()
Out[54]: 
Height    183
Weight    189
dtype: int64

idxmax：返回最大值索引

In [55]: df_demo.idxmax() # idxmin是对应的函数
Out[55]: 
Height    193
Weight      2
dtype: int64

3.唯一值函数

主要用来统计表中类别的个数。

unique：统计类别的列表
nunique：统计类别的数目
value_counts：统计不同类别出现的次数

上面这几个函数只能针对某一列使用，若对多列使用，应该用drop_duplicates函数，相当于是去除重复的值。

对于drop_duplicates函数中的keep参数：keep=first表示保留第一次出现的行，keep=last表示保留最后一次，False表示把重复的全都剔除。

4.替换函数

映射替换：replace
逻辑替换：where和mask； where 函数在传入条件为 False 的对应行进行替换，而 mask 在传入条件为 True 的对应行进行替换。
数值替换：round，四舍五入；abs，取绝对值；clip，上下边界截断。

5.排序函数

值排序：sort_values，其中ascending参数默认为True升序，false为降序 （按列值排）
索引排序：sort_index，索引用leve表示，排序顺序是按字母的顺序 （按行值排）

6.apply方法

有点像上一章的map方法，也是通过自定义函数来进行操作

五、练习

Ex1

思路：将其他列进行相加，然后判断不等于的个数，若为0，说明全都相等。

import pandas as pd
df = pd.read_csv('pokemon.csv')
print((df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].sum(1)!=df['Total']).sum())
out:0

a 思路：先使用unique函数求出种类的数目，然后再按数量统计

第一次写忘记使用nunique函数了，然后翻前面的内容想起来还有这个函数。

len(df['Type 1'].unique())

df['Type 1'].nunique()
Out[45]: 18

然后按数量统计，显示前三个：

df['Type 1'].value_counts()[0:3]
Out[55]: 
Water     112
Normal     98
Grass      70
Name: Type 1, dtype: int64

b.思路：先使用drop_duplicates将重复的去除，然后再对type1和type2索引

df.drop_duplicates(['Type 1','Type 2'],keep = False)[['Type 1','Type 2']]
Out[58]: 
       Type 1    Type 2
7        Fire    Dragon
196  Electric    Dragon
237      Fire      Rock
245     Steel    Flying
271   Psychic     Grass
275     Grass    Dragon
307       Bug     Water
316       Bug     Ghost
366    Dragon     Fairy
424    Ground      Fire
434     Grass    Ground
440     Water     Steel
445    Normal     Water
490     Ghost      Dark
501    Poison       Bug
530       Ice     Ghost
531  Electric     Ghost
532  Electric      Fire
533  Electric     Water
534  Electric       Ice
536  Electric     Grass
540     Steel    Dragon
542      Fire     Steel
553   Psychic      Fire
589    Ground     Steel
679    Ground  Electric
699     Steel  Fighting
700      Rock  Fighting
706    Dragon      Fire
707    Dragon  Electric
728    Normal    Ground
743  Fighting      Dark
760    Poison     Water
761    Poison    Dragon
771  Fighting    Flying
772  Electric     Fairy
797   Psychic     Ghost
798   Psychic      Dark
799      Fire     Water

c.（知道思路是怎么样的，但不会实现，代码直接参考答案的）

思路如下：先用两个for循环将所有可能的组合列出，再将现在的组合求出来，之后将两者取差

In [36]: L_full = [i+' '+j for i in df['Type 1'].unique() for j in (
   ....:           df['Type 1'].unique().tolist() + [''])]
   ....: 

In [37]: L_part = [i+' '+j for i, j in zip(df['Type 1'], df['Type 2'
   ....:          ].replace(np.nan, ''))]
   ....: 

In [38]: res = set(L_full).difference(set(L_part))

In [39]: len(res) # 太多，不打印了
Out[39]: 188

3.a.比较简单，直接使用mask。

df['Attack'].mask(df['Attack']>120, 'high'
            ).mask(df['Attack']<50, 'low').mask((50<=df['Attack'])&(df['Attack']<=120), 'mid').head()

df['Type 1'].replace({i:str.upper(i) for i in df['Type 1'].unique()})	#replace
df['Type 1'].apply(lambda x: str.upper(x))    	#applay

c.先求出偏差：

df['Deviation'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk',
   ....:                      'Sp. Def', 'Speed']].apply(lambda x:np.max(
   ....:                      (x-x.median()).abs()), 1)					#apply为1表示按行

排序

df.sort_values('Deviation', ascending=False).head()