Hive on spark和Hive on mr在处理orc格式表格时数据不一致问题探究

一、问题描述

  在使用CDH6.3.2的集群处理数据时,当创建的表格为orc格式,且表格中存在null字段时,where中的<>条件没有生效;

建表语句为:

CREATE TABLE DWD_PC_INT_ZM_StockPoolComponent(
ID bigint ,
JSID bigint ,
InsertTime TIMESTAMP ,
UpdateTime TIMESTAMP ,
CalDate TIMESTAMP ,
InnerCode int ,
SecuCode string ,--(30) ,
SecuAbbr string ,--(100) ,
SecuMarket int ,
RuleCode bigint ,
RuleDesc string ,--(1000),
UDATE string,
DELDATE string
)stored as orcfile;

插入数据为:

INSERT INTO dwd.dwd_pc_int_zm_stockpoolcomponent (id,jsid,inserttime,updatetime,caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc,udate,deldate) VALUES
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',152,'000405','ST 鑫 光',90,100000000100002,'退市股票:退市日期:20040319','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',91,'000047','ST 中 侨',90,100000000100002,'退市股票:退市日期:20030530','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',76,'000033','新 都 退',90,100000000100002,'退市股票:退市日期:20170707','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',61,'000024','招商地产',90,100000000100002,'退市股票:退市日期:20151230','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',51,'000018','神城A退',90,100000000100002,'退市股票:退市日期:20200107','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST','20211021','99991231'),
(NULL,NULL,'2021-10-21 18:16:47.75','2021-10-21 18:16:47.75','2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614','20211021','99991231');

建表语句为:

CREATE TABLE tmp.TMP_DWD_PC_INT_ZM_StockPoolComponent_02(
CalDate TIMESTAMP ,
InnerCode int ,
SecuCode string ,--(30) ,
SecuAbbr string ,--(100) ,
SecuMarket int ,
RuleCode bigint ,
RuleDesc string --(1000)
)stored as orcfile;

插入数据为:

INSERT INTO tmp_dwd_pc_int_zm_stockpoolcomponent_02 (caldate,innercode,secucode,secuabbr,secumarket,rulecode,ruledesc) VALUES
('2021-10-17 00:00:00.0',23603,'000136','民生策略A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',309519,'000090','民生信用债A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',309520,'000089','民生信用债C',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',22892,'000068','民生转债C',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',22891,'000067','民生转债A',NULL,101000000100001,NULL),
('2021-10-17 00:00:00.0',44,'000015','PT中浩A',90,100000000100002,'退市股票:退市日期:20011022'),
('2021-10-17 00:00:00.0',41,'000013','*ST石化A',90,100000000100002,'退市股票:退市日期:20040920'),
('2021-10-17 00:00:00.0',23,'000007','*ST 全新',90,100000000100002,'ST、*ST'),
('2021-10-17 00:00:00.0',17,'000005','ST 星 源',90,100000000100002,'ST、*ST'),
('2021-10-17 00:00:00.0',11,'000003','PT金田A',90,100000000100002,'退市股票:退市日期:20020614');

执行查询语句为:

SELECT
A.CalDate,
A.InnerCode,
A.SecuCode,
A.SecuAbbr,
A.SecuMarket,
A.RuleCode,
A.RuleDesc,
2 AS DataFlag
FROM TMP_DWD_PC_INT_ZM_StockPoolComponent_02 A
JOIN DWD_PC_INT_ZM_StockPoolComponent B ON A.CalDate=B.CalDate AND A.RuleCode=B.RuleCode AND A.SecuCode=B.SecuCode AND NVL(A.SecuMarket,0)=NVL(B.SecuMarket,0)
AND B.DELDATE='99991231'
WHERE NVL(A.InnerCode,0)<>NVL(B.InnerCode,0)
OR NVL(A.SecuAbbr,'')<>NVL(B.SecuAbbr,'')
OR NVL(A.RuleDesc,'')<>NVL(B.RuleDesc,'')

经过测试发现:

1、对于CDH6.3.2的集群

  如果使用hive  on spark执行引擎,可以查询出数据;使用hive on mr 或者sparksql来执行是查询不出来数据的;将表格格式切换为txt或者parquet时,使用mr或者hive on spark也查询不出数据

2、对于TDH6.0.2的集群

使用tdh集群来查询也查询不出数据;

3、将CDH集群切换到5.16.4版本

使用hive  on spark、hive on mr引擎也查不出结果;

二、问题总结

 虽然具体原因没有定位到,但是可以确定的是hive2.1.1版本对hive on spark支持的不友好,会有bug存在。

   1、由此可见,cdh6.3.2的集群对于在表格格式为orc格式时,使用hive  on spark执行引擎会有出错的情况;CDH6.3.2的集群中的hive版本2.1.1对orc格式的表格处理不友好,会有bug存在;

   2、推荐大家在建立数据库表格时,使用parquet格式。

现在Cloudera公司在主推CDP集群,后面cdh版本不在维护,hive on mr和hive on spark引擎在CDP7.1推出后也不在支持了。后面会主推hive  on tz。

原文地址:https://www.cnblogs.com/chhyan-dream/p/15471053.html