【每日一学】pandas_透视表函数&交叉表函数

每日一悟

【分开工作内外8小时】
前一个月，我经常把工作内的问题带到路上、地铁上、睡觉前，甚至是周末。
然而很快发现，我工作外的成就几乎没有，而工作内的进展也并不理想。
仔细想想，工作外是需要学新东西，产生新灵感。一方面是工作内的支撑，另一方面也是新的方向。而不是低效率地光在脑子里想工作内的解决方案。
所以，我觉得有必要明确工作内外的目标和行动，比如工作外每周一本书，每天的原版技术书阅读；工作内做好事务优先级，处理前先想清楚思路再着手准备。
高效且多产，这才是目的。

pandas.pivot_table

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

简介：
method of pandas.core.frame.DataFrame instance Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
pandas核心实例的方法，创建一个大宽表的透视表数据框，在这个结果数据框中的索引和列等级，将会被存储在多重索引对象中（分层索引）。

应用格式：
pandas.pivot_table(dataframe,Other parameters)
等同于
dataframe.pivot_table(Other parameters)

参数：
在看参数之前我们先看看Excel中透视表的结构，结构为筛选、列、行、值。除了筛选，列、行、值与下面要介绍的pandas.pivot_table功能一值。

data : 要应用透视表的数据框；
values: 可选，是要聚合的列，相当于“值”，例如 values=["Price"]；
index : 是要聚合值的分组，相当于“行”，多个层次格式例如 index=["Name","Rep","Manager"]；
columns : 是要聚合值的分组，相当于“列”；
aggfunc : 是要应用的聚合函数，指定不同值使用不同聚合函数时可用字典格式，例如 aggfunc=[np.mean,len]，aggfunc={"Quantity":len,"Price":[np.sum,np.mean]}；
fill_value : 有时候聚合结果里出现了NaN，想替换成0时，fill_value=0；
margins : 是否添加所有行或列的小计/总计，margins=True；
margins_name : 当margins设置为True时，设置总计的名称，默认是“ALL”。

举例：
见help(pandas.pivot_table)

pandas.crosstab

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False,margins_name='All', dropna=True, normalize=False)


    Compute a simple cross-tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    aggfunc : function, optional
        If specified, requires `values` be specified as well
    rownames : sequence, default None
        If passed, must match number of row arrays passed
    colnames : sequence, default None
        If passed, must match number of column arrays passed
    margins : boolean, default False
        Add row/column margins (subtotals)
    margins_name : string, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    
        .. versionadded:: 0.21.0
    
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
        .. versionadded:: 0.18.1
    
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
              "bar", "bar", "foo", "foo", "foo"], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one",
              "one", "two", "two", "two", "one"], dtype=object)
c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
              "shiny", "dull", "shiny", "shiny", "shiny"],
              dtype=object)
    
pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
# doctest: +NORMALIZE_WHITESPACE
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                    # but they still will be counted in the output
# doctest: +SKIP
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0
    
    Returns
    -------
    crosstab : DataFrame

Without summary,you can't master it.