【Hadoop离线基础总结】hive的窗口函数

hive的窗口函数


  • 概述
    hive中一般求取TopN时就需要用到窗口函数
    专业窗口函数一般有三个
    rank() over
    dense rank() over
    row_number() over

  • 实例

    • 创建一个f_test表拿到以下数据
      +------------+--------------+-------------+--+
      | f_test.id  | f_test.name  | f_test.sal  |
      +------------+--------------+-------------+--+
      | 1          | a            | 10          |
      | 2          | a            | 12          |
      | 3          | b            | 13          |
      | 4          | b            | 12          |
      | 5          | a            | 14          |
      | 6          | a            | 15          |
      | 7          | a            | 13          |
      | 8          | b            | 11          |
      | 9          | a            | 16          |
      | 10         | b            | 17          |
      | 11         | a            | 14          |
      +------------+--------------+-------------+--+
      
    • 现在用窗口函数求取TopN
      SELECT id,name,sal,
      RANK() over(PARTITION by name ORDER BY sal DESC) rp,
      DENSE_RANK() over(PARTITION by name ORDER BY sal DESC) drp,
      ROW_NUMBER() over(PARTITION by name ORDER BY sal DESC) rmp
      FROM f_test;
      
      partition by就相当于group by,但是这里不能替换为group by,而且partition by后只能跟一个字段,group by可以跟多个字段
    • 结果
      +-----+-------+------+-----+------+------+--+
      | id  | name  | sal  | rp  | drp  | rmp  |
      +-----+-------+------+-----+------+------+--+
      | 9   | a     | 16   | 1   | 1    | 1    |
      | 6   | a     | 15   | 2   | 2    | 2    |
      | 11  | a     | 14   | 3   | 3    | 3    |
      | 5   | a     | 14   | 3   | 3    | 4    |
      | 7   | a     | 13   | 5   | 4    | 5    |
      | 2   | a     | 12   | 6   | 5    | 6    |
      | 1   | a     | 10   | 7   | 6    | 7    |
      | 10  | b     | 17   | 1   | 1    | 1    |
      | 3   | b     | 13   | 2   | 2    | 2    |
      | 4   | b     | 12   | 3   | 3    | 3    |
      | 8   | b     | 11   | 4   | 4    | 4    |
      +-----+-------+------+-----+------+------+--+
      
      可以看出在a组的sal中有两个相同值
      rank() over是给出了相同的序号3,然后直接跳到5,所以他最后的序号要比实际数据数量少
      dense rank() over同样给出了相同的序号3,但是是接着顺序给出了4,最后序号和实际数据数量相同
      row_number() over则是单纯地按照顺序进行排序,不受相同name影响
原文地址:https://www.cnblogs.com/zzzsw0412/p/12772472.html