从Message的json转换看protobuf的Descriptor及Meta结构

一、Message消息的可视化展示

将消息转换为二进制结构，必然提高了结构的传输效率。但是和通常的二进制文件格式一样，为节省空间付出的代价就是牺牲了部分的可读性，就像linus对systemd中二进制文件的看法一样“I dislike the binary logs, for example”。转换为二进制的message文件同样存在着不直观的问题，所以此时需要通过工具来讲它转换为文本格式——例如json格式——的文本以便于阅读。在这个时候，protobuf生成代码中生成的meta、Descriptor、scheme等格式就可以排上用处了。这一点在之前的分析中其实并没有注意到它们在可视化读取中的意义，只是注意到它在修改变量值的时候意义不太大。

二、从字符串反序列化出FileDescriptorProto内存对象

考虑一个文件的json格式化输出，需要一个比较关键的就是每个字段的字符串名称，这个字段是json输出中最为关键也最为基础的一个信息。首先第一个问题是，这个字段的字符串格式名称从哪里来？
在为proto文件生成的c++代码中，其中可以看到一些比较长的字符串结构，这些是查看生成的源文件中最为醒目的一个数据，在这个字符串中可以看到Message中各个字段的字符串格式的名称，所以推测这些字符串的名称从这里来。查看protobuf的源代码可以看到，这个猜测是正确的，这个字符串其实也是一个protobuf生成的Message通过二进制格式化之后的内容。既然说它是一个Message二进制化之后的内容，所以这个protobuf应该有一个对应的proto文件，这个文件就是源代码中存在的descriptor.proto文件。从这个文件中要注意到一个细节，就是其中对于字段名称的定义是string格式的，另外关心的数值为number字段，还有一个是这个字段的类型。
也就是说，对于
message mainmsg
{
int32 x = 10;
}
这种格式，它在Descriptor中，其name为字符串形式的"x"，number为数值形式的10，而Type则为枚举的TYPE_INT32。
这个其实已经非常接近了json输出中必须的字符串格式名字。
或者更为直观的说，在生成文件中的字符串是一个FileDescriptorProto格式的Message实例，更精确的说，就是可以通过 descriptor.proto生成的对应代码，直接调用它的接口，从这个字符串生成一个对应的实例，而这个实例完整的表示了proto文件的定义。事实上，对于用户中的这个内容，在protobuf中也的确是通过这样的接口来完成根据这些字符串来获得这个消息的proto文件对应的内存对象。
在protobuf中对应的将生成源代码中字符串形式的FileDescriptorProto转换为内存对象的代码为下面函数，可以看到，这个地方是直接从字符串反序列化出来一个FileDescriptorProto内存对象。在FileDescriptorProto类中，包含了完整的原始proto文件定义时都是可以查询到的。
protobuf-mastersrcgoogleprotobufdescriptor_database.cc
bool EncodedDescriptorDatabase::Add(
const void* encoded_file_descriptor, int size) {
FileDescriptorProto file;
if (file.ParseFromArray(encoded_file_descriptor, size)) {
return index_.AddFile(file, std::make_pair(encoded_file_descriptor, size));
} else {
GOOGLE_LOG(ERROR) << "Invalid file descriptor data passed to "
"EncodedDescriptorDatabase::Add().";
return false;
}
}
每个文件对应的FileDescriptorProto对象，通过C++的全局静态变量在main函数运行前已经注册到protobuf中。通过这个对象，可以找到原始文件中定义的所有消息、文件选项、每个消息的各个字段，这样就有了最为原始的信息。

三、从FileDescriptorProto到FileDescriptor

FileDescriptorProto是从生成代码中的字符串直接发序列化生成的内存对象，这个对象包含的是原始信息，但是这个结构也有一些局限性。这里我们关心的一个重要问题就是它只是proto文件的原始定义，并没有我们最为关系的内存布局信息。这里所谓的内存布局是指消息中各个成员（Field）在一个Message对象中的偏移位置。因为proto会在生成的Message对象中添加一些内部结构，这些包括可控和不可控的字段，例如虚函数指针这些不可控字段、是否包含可选字段的bits标志位这种可控字段。所以在各种原始的XXXProto后缀的基础上生成对应的无后缀结构。例如FileDescriptorProto对应的FileDescriptor、DescriptorProto对应的Descriptor、FieldDescriptorProto对应的FieldDescriptor。这些不带Proto后缀的类虽然看起来是自动生成的，但事实上并不是，它们是根据对应的Proto文件手动构建（Build）出来的，这些Build的代码主要位于protobuf-mastersrcgoogleprotobufdescriptor.cc。例如
// These methods all have the same signature for the sake of the BUILD_ARRAY
// macro, below.
void BuildMessage(const DescriptorProto& proto,
const Descriptor* parent,
Descriptor* result);
void BuildField(const FieldDescriptorProto& proto,
const Descriptor* parent,
FieldDescriptor* result)
在这些Descriptor类中，和对应的DescriptorProto类相比，一个明显的、我们感兴趣的成员就是index()接口，这个接口相当于为每个用户定义的结构分配了唯一的编号，并且这个编号是连续的，再更准确的说，这个字段是一个数组的下标，这样通过循环来遍历也非常方便。这一步非常重要，因为它完成了从字符串到数值的绑定关系。

四、从FileDescriptor到File

在json转换过程中，更简洁的是不带Descriptor后缀的表示形式，对应于FieldDescriptor它的内容为Field，其同样是通过proto文件定义，位于中。整个表达更加简洁，其中比较基础的依然是类型、编号、名字三个字段，有这三个字段其实就可以完成对于protobuf中TLV内容的解析了，这个转换在protobuf-mastersrcgoogleprotobufutil ype_resolver_util.cc中完成
oid ConvertFieldDescriptor(const FieldDescriptor* descriptor, Field* field) {
field->set_kind(static_cast<Field::Kind>(descriptor->type()));
switch (descriptor->label()) {
case FieldDescriptor::LABEL_OPTIONAL:
field->set_cardinality(Field::CARDINALITY_OPTIONAL);
break;
case FieldDescriptor::LABEL_REPEATED:
field->set_cardinality(Field::CARDINALITY_REPEATED);
break;
case FieldDescriptor::LABEL_REQUIRED:
field->set_cardinality(Field::CARDINALITY_REQUIRED);
break;
}
field->set_number(descriptor->number());
field->set_name(descriptor->name());
field->set_json_name(descriptor->json_name());
……
}

五、json格式的输出

从TLV中解析出来Tag之后，可以找到对一个的Field，通过Field中的type知道基本类型，通过name知道字符串名称，这个其实已经完成了解析的必备条件。
srcgoogleprotobufutilinternalprotostream_objectsource.cc
Status ProtoStreamObjectSource::RenderNonMessageField(
const google::protobuf::Field* field, StringPiece field_name,
ObjectWriter* ow) const {
// Temporary buffers of different types.
uint32 buffer32;
uint64 buffer64;
std::string strbuffer;
switch (field->kind()) {
case google::protobuf::Field_Kind_TYPE_BOOL: {
stream_->ReadVarint64(&buffer64);
ow->RenderBool(field_name, buffer64 != 0);
break;
}

六、从index到offset

当有了唯一的index编号之后，就可以使用这个作为下标来所以内存偏移量，这个在生成的CPP文件中的名字就是offsets。当运行的时候，这个信息保存在了ReflectionSchema对象中，这里最为关键的就是其中的offsets_字段，它可以通过Field的index索引找到该Field在一个对象中的偏移量。有了这些信息，就可以结合通过一个Message对象的指针，加上某个Field中保存的offset字段，就可以定位到它在内存中的位置。
protobuf-mastersrcgoogleprotobufgenerated_message_reflection.cc
// Helper function to transform migration schema into reflection schema.
ReflectionSchema MigrationToReflectionSchema(
const Message* const* default_instance, const uint32* offsets,
MigrationSchema migration_schema) {
ReflectionSchema result;
result.default_instance_ = *default_instance;
// First 6 offsets are offsets to the special fields. The following offsets
// are the proto fields.
result.offsets_ = offsets + migration_schema.offsets_index + 5;
result.has_bit_indices_ = offsets + migration_schema.has_bit_indices_index;
result.has_bits_offset_ = offsets[migration_schema.offsets_index + 0];
result.metadata_offset_ = offsets[migration_schema.offsets_index + 1];
result.extensions_offset_ = offsets[migration_schema.offsets_index + 2];
result.oneof_case_offset_ = offsets[migration_schema.offsets_index + 3];
result.object_size_ = migration_schema.object_size;
result.weak_field_map_offset_ = offsets[migration_schema.offsets_index + 4];
return result;
}

七、offset从哪里来

MigrationToReflectionSchema函数中传入参数中的offsets和前面提到的用户自定义proto文件一样以常量的形式保存在生成的cpp文件中。

八、举个栗子

tsecer@harry :cat msgdef.proto
syntax = "proto3";

message subsubmsg
{
int32 x = 1;
};

message submsg
{
subsubmsg x = 1;
float y = 2;
};

message mainmsg
{
submsg msg = 3;
int32 x = 1;
float y = 2;
};

tsecer@harry :protoc --cpp_out=. msgdef.proto
tsecer@harry :
这个proto文件对应的FileDescriptorProto类型对象进行序列化之后的内容为：
const char descriptor_table_protodef_msgdef_2eproto[] =
" 14msgdef.proto"26 subsubmsg22 01x3001 01(05"*"
" 06submsg2225 01x3001 01(132 .subsubmsg22 01y3002 01"
"(02"5 07mainmsg2224 03msg3003 01(13207.submsg22 01x"
"3001 01(0522 01y3002 01(02b06proto3"
;
这个内容可以直接反序列化出来一个FileDescriptorProto对象。
const ::PROTOBUF_NAMESPACE_ID::uint32 TableStruct_msgdef_2eproto::offsets[] PROTOBUF_SECTION_VARIABLE(protodesc_cold) = {
~0u, // no _has_bits_
PROTOBUF_FIELD_OFFSET(::subsubmsg, _internal_metadata_),
~0u, // no _extensions_
~0u, // no _oneof_case_
~0u, // no _weak_field_map_
PROTOBUF_FIELD_OFFSET(::subsubmsg, x_),
~0u, // no _has_bits_
PROTOBUF_FIELD_OFFSET(::submsg, _internal_metadata_),
~0u, // no _extensions_
~0u, // no _oneof_case_
~0u, // no _weak_field_map_
PROTOBUF_FIELD_OFFSET(::submsg, x_),
PROTOBUF_FIELD_OFFSET(::submsg, y_),
~0u, // no _has_bits_
PROTOBUF_FIELD_OFFSET(::mainmsg, _internal_metadata_),
~0u, // no _extensions_
~0u, // no _oneof_case_
~0u, // no _weak_field_map_
PROTOBUF_FIELD_OFFSET(::mainmsg, msg_),
PROTOBUF_FIELD_OFFSET(::mainmsg, x_),
PROTOBUF_FIELD_OFFSET(::mainmsg, y_),
};
static const ::PROTOBUF_NAMESPACE_ID::internal::MigrationSchema schemas[] PROTOBUF_SECTION_VARIABLE(protodesc_cold) = {
{ 0, -1, sizeof(::subsubmsg)},
{ 6, -1, sizeof(::submsg)},
{ 13, -1, sizeof(::mainmsg)},
};
上面MigrationSchema三个元素分别表示了三个消息在TableStruct_msgdef_2eproto::offsets数组中的起始下标编号（以及各自的结构大小）。在TableStruct_msgdef_2eproto::offsets内部，开始5个为预定义内部结构，从第六个开始为各个字段在对象中的偏移位置，这个偏移位置可以通过前面提到的FieldDescriptor中的index()作为下标访问。
下面offsets_[field->index()]将前面的TableStruct_msgdef_2eproto::offsets和index连接起来
protobuf-mastersrcgoogleprotobufgenerated_message_reflection.h
// Offset of a non-oneof field. Getting a field offset is slightly more
// efficient when we know statically that it is not a oneof field.
uint32 GetFieldOffsetNonOneof(const FieldDescriptor* field) const {
GOOGLE_DCHECK(!field->containing_oneof());
return OffsetValue(offsets_[field->index()], field->type());
}

九、index从哪里来

这个其实比较简单，就是按照声明的顺序依次编码即可获得。
protobuf-mastersrcgoogleprotobufdescriptor.h
// To save space, index() is computed by looking at the descriptor's position
// in the parent's array of children.
inline int FieldDescriptor::index() const {
if (!is_extension_) {
return static_cast<int>(this - containing_type()->fields_);
} else if (extension_scope_ != NULL) {
return static_cast<int>(this - extension_scope_->extensions_);
} else {
return static_cast<int>(this - file_->extensions_);
}
}

十、遍历打印简单结构

通过Message的Descriptor和Reflection遍历打印结构，这里只考虑了最简单的INT32和FLOAT及Message消息类型

tsecer@harry :cat main.cpp
#include "stdio.h"
#include "msgdef.pb.h"
#include <stdio.h>

using namespace google::protobuf;

void printmsg(const Message &mmsg, int indent)
{
const Descriptor* pstDesc = mmsg.GetDescriptor();
const Reflection* pstRefl = mmsg.GetReflection();

printf(" %*c %s", indent * 10, ' ', pstDesc->name().c_str());

for (int i = 0; i < pstDesc->field_count(); i++)
{
printf("%*c", indent * 10, ' ');
const FieldDescriptor *pstFieldDesc = pstDesc->field(i);
switch (pstFieldDesc->cpp_type())
{
case FieldDescriptor::CPPTYPE_INT32 : printf("%s %d ", pstFieldDesc->name().c_str(), pstRefl->GetInt32(mmsg, pstFieldDesc)); break;
case FieldDescriptor::CPPTYPE_FLOAT : printf("%s %f ", pstFieldDesc->name().c_str(), pstRefl->GetFloat(mmsg, pstFieldDesc)); break;
default:
printmsg(pstRefl->GetMessage(mmsg, pstFieldDesc), indent + 1);
}
}
printf(" ");

}

int main()
{
mainmsg mmsg;
mmsg.set_x(11);
mmsg.set_y(2.2);
mmsg.mutable_msg()->set_y(3.3);
mmsg.mutable_msg()->mutable_x()->set_x(4.4);
printmsg(mmsg, 0);
return 0;
}
tsecer@harry :make
g++ -std=c++11 msgdef.pb.cc main.cpp -lprotobuf -g
tsecer@harry :./a.out

mainmsg
submsg
subsubmsg x 4
y 3.300000
x 11 y 2.200000
tsecer@harry :