直方图

直方图

  • 数值型数据可视化的一种方式。
  • 如果原始数值型数据经过类别化(分组)处理,则可以使用直方图来观察数据的分布。
  • 如果原始数据没有经过分组处理,则使用茎叶图、箱线图、小提琴图、点图、核密度图等来观察数据的分布。

作用:展示数据分布的一种常用方式,通过直方图可观察数据分布的大致形状,能看分布是否对称、快速判断数据是否近似服从正态分布。

import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False

下面来看plt.hist()的语法:

plt.hist(x, bins=None, range=None, normed=False, weights=None, cumulative=False, bottom=None, histtype='bar',
align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False,
hold=None, data=None, **kwargs)

  Parameters
----------
x : (n,) array or sequence of (n,) arrays  指定要绘制直方图的数据
    Input values, this takes either a single array or a sequency of
    arrays which are not required to be of the same length

bins : integer or array_like or 'auto', optional    指定直方图条形的个数,一种是给定整数,指定绘制多少个条形;一种是给出一个序列,可随意指定间距
    If an integer is given, `bins + 1` bin edges are returned,   
    consistently with :func:`numpy.histogram` for numpy version >=
    1.3.

    Unequally spaced bins are supported if `bins` is a sequence.   

    If Numpy 1.11 is installed, may also be ``'auto'``.

    Default is taken from the rcParam ``hist.bins``.

range : tuple or None, optional   指定直方图数据的上下界,默认包含绘图数据的max和min
    The lower and upper range of the bins. Lower and upper outliers
    are ignored. If not provided, `range` is (x.min(), x.max()). Range
    has no effect if `bins` is a sequence.

    If `bins` is a sequence or `range` is specified, autoscaling
    is based on the specified bin range instead of the
    range of x.

    Default is ``None``

normed : boolean, optional   是否将直方图的频数转换成频率
    If `True`, the first element of the return tuple will
    be the counts normalized to form a probability density, i.e.,
    ``n/(len(x)`dbin)``, i.e., the integral of the histogram will sum
    to 1. If *stacked* is also *True*, the sum of the histograms is
    normalized to 1.

    Default is ``False``     默认是False,频数图;频率图需要改成True

weights : (n, ) array_like or None, optional  该参数可为每一个数据设置权重
    An array of weights, of the same shape as `x`.  Each value in `x`
    only contributes its associated weight towards the bin count
    (instead of 1).  If `normed` is True, the weights are normalized,
    so that the integral of the density over the range remains 1.

    Default is ``None``

cumulative : boolean, optional   是否需要计算累计频数或频率
    If `True`, then a histogram is computed where each bin gives the
    counts in that bin plus all bins for smaller values. The last bin
    gives the total number of datapoints.  If `normed` is also `True`
    then the histogram is normalized such that the last bin equals 1.
    If `cumulative` evaluates to less than 0 (e.g., -1), the direction
    of accumulation is reversed.  In this case, if `normed` is also
    `True`, then the histogram is normalized such that the first bin
    equals 1.

    Default is ``False``  默认是False,普通的频数或频率图;True,即设置为累计频数图or累计频率图

bottom : array_like, scalar, or None    可以为直方图的每个条形添加基准线,默认0
    Location of the bottom baseline of each bin.  If a scalar,
    the base line for each bin is shifted by the same amount.
    If an array, each bin is shifted independently and the length
    of bottom must match the number of bins.  If None, defaults to 0.

    Default is ``None``

histtype : {'bar', 'barstacked', 'step',  'stepfilled'}, optional   指定直方图的类型,默认为bar,还有barstacked,step,stepfilled
    The type of histogram to draw.

    - 'bar' is a traditional bar-type histogram.  If multiple data
      are given the bars are aranged side by side.

    - 'barstacked' is a bar-type histogram where multiple
      data are stacked on top of each other.

    - 'step' generates a lineplot that is by default
      unfilled.

    - 'stepfilled' generates a lineplot that is by default
      filled.

    Default is 'bar'

align : {'left', 'mid', 'right'}, optional   设置条形边界值的对齐方式,默认mid,还有left and right
    Controls how the histogram is plotted.

        - 'left': bars are centered on the left bin edges.

        - 'mid': bars are centered between the bin edges.

        - 'right': bars are centered on the right bin edges.

    Default is 'mid'

orientation : {'horizontal', 'vertical'}, optional     设置直方图的摆放方向,默认为垂直方向
    If 'horizontal', `~matplotlib.pyplot.barh` will be used for
    bar-type histograms and the *bottom* kwarg will be the left edges.

rwidth : scalar or None, optional   设置直方图条形宽的的百分比
    The relative width of the bars as a fraction of the bin width.  If
    `None`, automatically compute the width.

    Ignored if `histtype` is 'step' or 'stepfilled'.

    Default is ``None``

log : boolean, optional  对数据是否进行log变换
    If `True`, the histogram axis will be set to a log scale. If `log`
    is `True` and `x` is a 1D array, empty bins will be filtered out
    and only the non-empty (`n`, `bins`, `patches`) will be returned.

    Default is ``False``

color : color or array_like of colors or None, optional   直方图的填充色
    Color spec or sequence of color specs, one per dataset.  Default
    (`None`) uses the standard line color sequence.

    Default is ``None``

label : string or None, optional  直方图的标签,可通过legend展示其图例
    String, or sequence of strings to match multiple datasets.  Bar
    charts yield multiple patches per dataset, but only the first gets
    the label, so that the legend command will work as expected.

    default is ``None``

stacked : boolean, optional 当有多个数据时,是否需要将直方图堆叠摆放,默认水平摆放
    If `True`, multiple data are stacked on top of each other If
    `False` multiple data are aranged side by side if histtype is
    'bar' or on top of each other if histtype is 'step'

    Default is ``False``
 
 facecolor: 直方图颜色
 alpha: 条形透明度

借用Titanic的年龄数据来做例子:

import pandas as pd
data = pd.read_csv(r"G:KaggleTitanic	rain.csv")
data.Age.isnull().sum()
177

存在177个缺失值,我们选择删除掉这些样本。

data.dropna(subset=["Age"], inplace=True)
plt.hist(data.Age)  #普通版
plt.show()

png

下面逐步美化一下

plt.style.use("ggplot")
plt.hist(data.Age, bins=20, color="#FF7373" )     
plt.show()

png

界限不清晰,再加一个边界颜色看看效果。下面先画一个指定组数的,一个指定组距的直方图,根据需要使用。

plt.hist(data.Age, bins=20, color="#FF7373",  edgecolor="K")
plt.title("乘客年龄分布直方图(指定20组)", fontsize=18)
plt.xlabel("年龄", size=15)
plt.ylabel("频数", size=15)
plt.show()

png

plt.hist(data.Age, bins=np.arange(data.Age.min(),data.Age.max(),5), color="#FF7373",  edgecolor="K")
plt.title("乘客年龄直方图(指定组距为5)",fontsize=18)
plt.xlabel("年龄", size=15)
plt.ylabel("频数", size=15)
plt.show()

png

看到年龄分布有点接近正态分布

下面绘制个累计频率直方图

import numpy as np
plt.figure(figsize=(5,6))
plt.hist(data.Age, bins=np.arange(data.Age.min(),data.Age.max(),5), normed=True, cumulative=True, color="#FF7373",  edgecolor="K")
plt.title("乘客年龄累计频率直方图",fontsize=18)
plt.xlabel("年龄", size=15)
plt.ylabel("频数", size=15)
plt.show()

png

从上图看出乘客年龄的分布情况,接近70%的乘客年龄在35岁以下,接近80%的乘客年龄在40岁以下。

下面给直方图添加两条曲线:正态分布曲线and核密度曲线。一般我们用核密度曲线,正态分布曲线与现实一般都是不一样的,只是理论上的曲线。

import matplotlib.mlab as mlab
plt.figure(figsize=(5.5,5.5))
plt.hist(data.Age, bins = np.arange(data.Age.min(),data.Age.max(), 5), normed = True, color="#FF7373", edgecolor = "k")#画出原数据的频率直方图
x1 = np.linspace(data.Age.min(), data.Age.max(), 1000)    #生成正态分布的数据
normal = mlab.normpdf(x1, data.Age.mean(), data.Age.std())  
line1, = plt.plot(x1,normal, "b--", linewidth = 3)        #画出正态分布曲线
kde = mlab.GaussianKDE(data.Age)    #生存核密度数据
x2 = np.linspace(data.Age.min(), data.Age.max(), 1000)
line2, = plt.plot(x2,kde(x2), "g-", linewidth = 3, alpha=0.6)     #画出核密度曲线
plt.legend([line1, line2],[ "正态分布曲线", "核密度曲线"],loc= "best")  #生成图例
plt.title("乘客年龄分布直方图", fontsize=18)
plt.xlabel("年龄", size=15)
plt.ylabel("频率", size=15)
plt.show()

png

乘客的年龄不服从正态分布,从核密度曲线看出年龄的实际分布情况。

原文地址:https://www.cnblogs.com/wyy1480/p/9672301.html