Impala Catalog加载

转载：https://blog.csdn.net/hezh914/article/details/52810985

1. 问题
在一个大型的应用系统，每天都有上百亿甚至上千亿的数据需要加载到Hadoop中，随着数据量达到海量的级别，原本可以轻松搞定的事情，现在都变得非常棘手，不管是在Oracle中还是以Impala作为实时查询引擎的Hadoop中，都会遇到很多让你日思夜想，难以入眠的问题。其中的一个问题就是Impala的元数据同步问题，比如：

为什么HIVE中有表，但在IMPALA中查询却提示表不存在，不能自动刷新
为什么HDFS中有数据文件，查询却没有数据，show partitions也不能看到完整的表分区
这一个个诡异的问题都能让你茶不思，饭不想，下面我们就来分析一下Impala的元数据同步过程和它到底同步了什么

2. 分析
Impala的源代码可以在GitHub上下载或在impala.io网站上在线查看。
所有的Impala Catalog相关的操作都可以在fe目录下的代码包：com.cloudera.impala.catalog中找到。Impala有几种情况会涉及到元数据的同步操作：
1. Impala启动时
所有表都会放到 TableLoadingMgr.tableLoadingDeque_ 队列中，按队列进行加载，队列的声明为：

private final LinkedBlockingDeque<TTableName> tableLoadingDeque_ =
new LinkedBlockingDeque<TTableName>();

因此可以在catalog.INFO日志中看到如下日志信息：
TableLoadingMgr.java:278] Loading next table. Remaining items in queue: 16124
TableLoadingMgr类的以下方法可以看到启动多线程进行表的Metadata加载工作：

private void startTableLoadingThreads() {
ExecutorService loadingPool = Executors.newFixedThreadPool(numLoadingThreads_);
try {
for (int i = 0; i < numLoadingThreads_; ++i) {
loadingPool.execute(new Runnable() {
@Override
public void run() {
while (true) {
try {
loadNextTable();
} catch (Exception e) {
LOG.error("Error loading table: ", e);
// Ignore exception.
}
}
}
});
}
} finally {
loadingPool.shutdown();
}
}

2. 执行INVALIDATE METADATA, REFRESH SQL语句时
这两个语句都会调用com.cloudera.impala.service.CatalogOpExecutor类的execResetMetadata方法，方法中的以下代码可以看出两个语句分别调用的方法：

if (req.isIs_refresh()) {
modifiedObjects.second = catalog_.reloadTable(req.getTable_name()); //执行REFRESH语句
} else {
wasRemoved = catalog_.invalidateTable(req.getTable_name(), modifiedObjects); //执行INVALIDATE语句,
}
//catalog_对象为：com.cloudera.impala.catalog.CatalogServiceCatalog

3.执行DDL操作，除了会执行DDL操作外也会执行Metadata刷新操作
比如执行增加分区操作，仍会调用com.cloudera.impala.service.CatalogOpExecutor的alterTableAddPartition方法，并在完成增加HIVE表分区后，调用本类的addHdfsPartition方法完成后续的Impala的分区刷新工作。

不管那种操作，涉及到从Hive MetaData中加载表元数据信息时，最终都会调用到HdfsTable的load方法，从这个方法中就可以看到Impala到底需要加载和缓存那些元数据：

public void load(Table cachedEntry, HiveMetaStoreClient client,
org.apache.hadoop.hive.metastore.api.Table msTbl) throws TableLoadingException {
numHdfsFiles_ = 0;
totalHdfsBytes_ = 0;
// 打印加载日志，该日志在catalog.INFO日志中是经常看见的
LOG.debug("load table: " + db_.getName() + "." + name_);

// turn all exceptions into TableLoadingException
try {
// set nullPartitionKeyValue from the hive conf.
nullPartitionKeyValue_ = client.getConfigValue(
"hive.exec.default.partition.name", "__HIVE_DEFAULT_PARTITION__");

// set NULL indicator string from table properties
nullColumnValue_ =
msTbl.getParameters().get(serdeConstants.SERIALIZATION_NULL_FORMAT);
if (nullColumnValue_ == null) nullColumnValue_ = DEFAULT_NULL_COLUMN_VALUE;

// populate with both partition keys and regular columns
List<FieldSchema> partKeys = msTbl.getPartitionKeys();
List<FieldSchema> tblFields = Lists.newArrayList();
String inputFormat = msTbl.getSd().getInputFormat();
if (HdfsFileFormat.fromJavaClassName(inputFormat) == HdfsFileFormat.AVRO) {
.... 这里省去一段AVRO表的表列信息加载逻辑
} else {
fs.setType(avroType);
}
fs.setComment("from deserializer");
tblFields.add(fs);
i++;
}
}
} else {
tblFields.addAll(msTbl.getSd().getCols());
}
List<FieldSchema> fieldSchemas = new ArrayList<FieldSchema>(
partKeys.size() + tblFields.size());
fieldSchemas.addAll(partKeys);
fieldSchemas.addAll(tblFields);
// The number of clustering columns is the number of partition keys.
numClusteringCols_ = partKeys.size();
//这儿加载列以及列的统计信息，其中也包括分区列, 上面两行代码有添加
loadColumns(fieldSchemas, client);

// Collect the list of partitions to use for the table. Partitions may be reused
// from the existing cached table entry (if one exists), read from the metastore,
// or a mix of both. Whether or not a partition is reused depends on whether
// the table or partition has been modified.
List<org.apache.hadoop.hive.metastore.api.Partition> msPartitions =
Lists.newArrayList();
if (cachedEntry == null || !(cachedEntry instanceof HdfsTable) ||
cachedEntry.lastDdlTime_ != lastDdlTime_) {
//如果还没有缓存或表上有更改，则从HIVE METASTORE中加载所有的分区信息
msPartitions.addAll(MetaStoreUtil.fetchAllPartitions(
client, db_.getName(), name_, NUM_PARTITION_FETCH_RETRIES));
} else {
// The table was already in the metadata cache and it has not been modified.
Preconditions.checkArgument(cachedEntry instanceof HdfsTable);
HdfsTable cachedHdfsTableEntry = (HdfsTable) cachedEntry;
// Set of partition names that have been modified. Partitions in this Set need to
// be reloaded from the metastore.
Set<String> modifiedPartitionNames = Sets.newHashSet();

// If these are not the exact same object, look up the set of partition names in
// the metastore. This is to support the special case of CTAS which creates a
// "temp" table that doesn't actually exist in the metastore.
if (cachedEntry != this) {
// Since the table has not been modified, we might be able to reuse some of the
// old partition metadata if the individual partitions have not been modified.
// First get a list of all the partition names for this table from the
// metastore, this is much faster than listing all the Partition objects.
modifiedPartitionNames.addAll(
client.listPartitionNames(db_.getName(), name_, (short) -1));
}

int totalPartitions = modifiedPartitionNames.size();
// Get all the partitions from the cached entry that have not been modified.
for (HdfsPartition cachedPart: cachedHdfsTableEntry.getPartitions()) {
// Skip the default partition and any partitions that have been modified.
if (cachedPart.isDirty() || cachedPart.getMetaStorePartition() == null ||
cachedPart.getId() == DEFAULT_PARTITION_ID) {
continue; //跳过默认的或需要重新加载的分区
}
org.apache.hadoop.hive.metastore.api.Partition cachedMsPart =
cachedPart.getMetaStorePartition();
Preconditions.checkNotNull(cachedMsPart);

// This is a partition we already know about and it hasn't been modified.
// No need to reload the metadata.
String cachedPartName = cachedPart.getPartitionName();
if (modifiedPartitionNames.contains(cachedPartName)) {
//这儿虽然也将不需要刷新的分区增加到了列表中，但下面的刷新方法会跳过这些分区的一些操作。
msPartitions.add(cachedMsPart);
modifiedPartitionNames.remove(cachedPartName); //将不需要重新加载的分区都从列表中移除
}
}
LOG.info(String.format("Incrementally refreshing %d/%d partitions.",
modifiedPartitionNames.size(), totalPartitions)); //这一行日志打印了增量刷新了多少个分区

// No need to make the metastore call if no partitions are to be updated.
if (modifiedPartitionNames.size() > 0) { //列表中剩下的分区就是需要重新加载的分区
// Now reload the the remaining partitions.
msPartitions.addAll(MetaStoreUtil.fetchPartitionsByName(client,
Lists.newArrayList(modifiedPartitionNames), db_.getName(), name_));
}
}

Map<String, List<FileDescriptor>> oldFileDescMap = null;
if (cachedEntry != null && cachedEntry instanceof HdfsTable) {
HdfsTable cachedHdfsTable = (HdfsTable) cachedEntry;
oldFileDescMap = cachedHdfsTable.fileDescMap_;
hostIndex_.populate(cachedHdfsTable.hostIndex_.getList());
}
/*
* 加载分区，该方法完成的工作包括：加载分区信息，加载分区下的文件列表，及文件的Block信息
* 大家可以在Catalog.INFO日志中看到如下日志信息:
* HdfsTable.java:323] load block md for xxx_table_name file 000094_0
*/
loadPartitions(msPartitions, msTbl, oldFileDescMap);

// load table stats
numRows_ = getRowCount(msTbl.getParameters()); //加载表上的统计信息
LOG.debug("table #rows=" + Long.toString(numRows_));

// For unpartitioned tables set the numRows in its partitions //这个注释和下面的操作不符啊！！
// to the table's numRows.
if (numClusteringCols_ == 0 && !partitions_.isEmpty()) {
// Unpartitioned tables have a 'dummy' partition and a default partition.
// Temp tables used in CTAS statements have one partition.
Preconditions.checkState(partitions_.size() == 2 || partitions_.size() == 1);
for (HdfsPartition p: partitions_) { //这儿为每个分区设置分区记录数
p.setNumRows(numRows_);
}
}
} catch (TableLoadingException e) {
throw e;
} catch (Exception e) {
//这个异常非常非常恼火啊，因为经常看到它。
throw new TableLoadingException("Failed to load metadata for table: " + name_, e);
}
}

3. 最后
从上面的源码分析，可以看出Impala Catalog干的事情非常多啊，要缓存的信息也非常多，虽然说能者多劳，但这样干早晚也得出事啊。它包括

表信息
表分区信息
表及分区下的文件及Block信息
表及分区的统计信息
需要加载和缓存的信息非常多，所以在Impala开始的启动时，如果对象太多，加载的时间非常长，就会出现查询时表不存在，但过一会就正常了的情况。文档中说可以通过设置参数：load_catalog_in_background=false，来让Impala在加载CatalogLog完成前不接受客户端连接，来避免此问题发生，但会增加Impala的启动时长（不过经过测试没达到理想效果，版本：CDH5.5）。
在表的元数据同步失败时，就会造成最开始提出的第2个问题，查不出数据或看不到分区，这种情况可能是内存溢出造成，此时可以在catalog.ERROR中看到OutOfMemoryError异常, 也可能是期问题引起，除了增大Catalog的运行内存外，需要从源头上避免该问题发生，应该在程序设计和执行时采用以下方法：

减少表数量
减少表上的分区数，作为了一个大数据系统，单个分区是可以包含亿级数据并同时保证查询效率的
减少表或分区中的文件数，可以定期命令表或分区中的小文件
及时删除历史数据