pandas

读取

pandas.read_csv()

形状

print df1.describe()
print df1.shape
print df1.columns
print df1.index

筛选列

df1[['playerID','nameFirst','nameLast','HR']]

筛选行

df1[(df1['AB']>=500)&(df1['yearID']==1977)]

分组

df1.groupby(['nameFirst','nameLast'])

简单聚合：

df1.groupby(['playerID']).count()

类似的还有min() max() sum() median() mean() quantile() var() std()

agg聚合：

求和聚合（如果想保留其他列，则形如：'nameFirst':'first','nameLast':'first' ）

df1.groupby(['playerID']).agg({'HR':'sum','nameFirst':'first','nameLast':'first'})

最大值聚合（如果想保留其他列，则需使用merge）

df9=df1.groupby('yearID').agg({'H':'max'})
df9=pd.merge(df9,df1,on=['yearID','H'],how='left')

排序：

df1.sort_values('SB',ascending=False)

选取某一列最大值所在的行：

df1.loc[df1[['HR']].idxmax(),:]

注意是loc，如果是最小值就是idxmin()

如果有多个最大值要都选出来，则需要用连接merge()

https://blog.csdn.net/oYeZhou/article/details/82378160

左外连接：（连接后行数与左表df9相同）

df9=pd.merge(df9,df1,on=['yearID','H'],how='left')

右外连接则how='right'

将索引转换为列数据：

groupby和agg之后得到的dataframe的索引我们通常需要将其转为列数据：

df2=df1.groupby(['playerID','yearID']).agg({'HR':'sum','nameFirst':'first','nameLast':'first'}).reset_index()

求dataframe中某一列的无重复长度（转为set）

len(set(df2['playerID']))

python list 求出现次数最多的元素及其出现次数（众数）

#python list求众数
def cal_mode(mylist):
    #两种特殊情况
    if len(mylist)==0:
        return [None,0]
    elif len(mylist)==1:
        return [mylist[0],1]
    
    #4个临时变量
    temp_elem=mylist[0]
    max_elem=temp_elem
    temp_num=1
    max_num=temp_num
    
    #从1开始
    for i in range(1,len(mylist)):
        if mylist[i]==temp_elem:
            temp_num+=1
        else:
            if max_num<temp_num:
                max_num=temp_num
                max_elem=temp_elem
            temp_num=1
            temp_elem=mylist[i]
    #最后收尾
    if max_num<temp_num:
        max_num=temp_num
        max_elem=temp_elem
    return [max_elem,max_num]

排序：

mylist.sort() 直接改变mylist，返回值为空

sorted(mylist) 不改变mylist，返回值为结果

sort()不能对字典排序，sorted()可以

sort()和sorted()都支持传入一个函数进行来自定义比较

参考

https://www.cnblogs.com/JahanGu/p/7650109.html

遍历行：参考：https://blog.csdn.net/ls13552912394/article/details/79349809

for index, row in df.iterrows():
    print row["c1"], row["c2"]