pd.merge(), pd.concat()

处理数据的时候,在数据库经常用到inner join, left join, right join 等连表方式,而在python,有多种连接方式。自己也经常混淆,于是记录一下。
看了很多博客,但总感觉不清不楚。于是,还是看官方文档help()一下好了。

一.pd.merge() 数据变得更胖(主要横向发展,因为左右表的列都连接起来了;但是也因为连接方式,可能变矮变高)

语法:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Parameters(参数解释)
----------

  • left : DataFrame(二维数据表)

  • right : DataFrame or named Series (放你要连接的数据)

  • how: 指定拼接方式,有:{'left', 'right', 'outer', 'inner'}, default 'inner'。

  • on : label or list(使用默认值的话,python会自动去找两个表里列名一样的列来拼接;也可以自定义哪些列作为连接的连接键,例:on=['name', 'weight'],这是在两个表有一样的列名的时候,如果左右表列名不一样,但是是代指同一个东西的话,那就用下面给的两个参数)

  • left_on : label or list, or array-like (指定左表的连接列,如:left_on=['name','id'])

  • right_on : label or list, or array-like(指定右表的连接列,如:left_on=['姓名','学号'])

  • left_index : bool, default False(使用左表的行索引作为连接键,有可能是multiindex,那就是多个键)
    Use the index from the left DataFrame as the join key(s). If it is a
    MultiIndex, the number of keys in the other DataFrame (either the index
    or a number of columns) must match the number of levels.

  • right_index : bool, default False(使用右表的行索引作为连接键,同上)
    Use the index from the right DataFrame as the join key. Same caveats as
    left_index.

  • sort : bool, default False(对连接后的表,排序,一般不整这个,因为数据多的话,你的机子可能崩了)
    Sort the join keys lexicographically in the result DataFrame. If False,
    the order of the join keys depends on the join type (how keyword).

  • suffixes : tuple of (str, str), default ('_x', '_y')(左右表连接完,经常会出现重复的列名,那么用该参数指定同名列的后缀,输出结果你就可以看到这是哪个表的列了,不用混淆了)
    Suffix to apply to overlapping column names in the left and right
    side, respectively. To raise an exception on overlapping columns use
    (False, False).

  • copy : bool, default True (默认用的是复制表,不在原数据上操作)
    If False, avoid copy if possible.

  • indicator : bool or str, default False(布尔值,可选,可以在结果中新增一列,显示这一行数据的来源,来自左表,则显示:left;来自右表,则显示:right,还是两列都有both)
    If True, adds a column to output DataFrame called "_merge" with
    information on the source of each row.
    If string, column with information on source of each row will be added to
    output DataFrame, and column will be named value of string.
    Information column is Categorical-type and takes on a value of "left_only"
    for observations whose merge key only appears in 'left' DataFrame,
    "right_only" for observations whose merge key only appears in 'right'
    DataFrame, and "both" if the observation's merge key is found in both.

  • validate : str, optional(很少用,跳过)
    If specified, checks if merge is of specified type.

      * "one_to_one" or "1:1": check if merge keys are unique in both
        left and right datasets.
      * "one_to_many" or "1:m": check if merge keys are unique in left
        dataset.
      * "many_to_one" or "m:1": check if merge keys are unique in right
        dataset.
      * "many_to_many" or "m:m": allowed, but does not result in checks.
    
      .. versionadded:: 0.21.0
    

    Returns(返回结果)

    DataFrame(返回的是一个二维数据表对象)
    A DataFrame of the two merged objects.

二.pd.concat() 也是连接表,但是这里只有inner和outer方式。默认是outer,一般我要拼接多个店铺的运营数据的时候,就用concat。因为数据表的表头有些会不一样,但是影响不大,我直接用拼接,一切搞掂。

语法:
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.

Parameters
----------
  • objs : 要连接的表,多表用列表[]框起来

  • axis : {0/'index', 1/'columns'}, default 0(默认数据会变高,如果指定axis=1, 数据会变变胖(横向发展))

  • join : {'inner', 'outer'}, default 'outer'(指定连接方式,默认是outer)
    How to handle indexes on other axis (or axes).

  • join_axes : list of Index objects
    .. deprecated:: 0.25.0

      Specific indexes to use for the other n - 1 axes instead of performing
      inner/outer set logic. Use .reindex() before or after concatenation
      as a replacement.
    
  • ignore_index : bool, default False(默认输出结果会有index,如果你不想要,设置True.)

  • keys : sequence, default None(给输出结果定义多出一行或者一列,给他定义一个标签,如:key=[A, B],那么在输出结果里,就会有一行或者一列标明A或者B,那么你就可以知道他是哪个表来的,数据透视可用)
    If multiple levels passed, should contain tuples. Construct
    hierarchical index using the passed keys as the outermost level.

  • levels : list of sequences, default None
    Specific levels (unique values) to use for constructing a
    MultiIndex. Otherwise they will be inferred from the keys.

  • names : list, default None
    Names for the levels in the resulting hierarchical index.

  • verify_integrity : bool, default False
    Check whether the new concatenated axis contains duplicates. This can
    be very expensive relative to the actual data concatenation.

  • sort : bool, default None(是否对表进行排序,默认不排序)

  • copy : bool, default True(是否复制表操作,在原表操作不太好,万一出错了呢)
    If False, do not copy data unnecessarily.

    Returns

    object, type of objs
    When concatenating all Series along the index (axis=0), a
    Series is returned. When objs contains at least one
    DataFrame, a DataFrame is returned. When concatenating along
    the columns (axis=1), a DataFrame is returned.

原文地址:https://www.cnblogs.com/wyy1480/p/12349109.html