Protocol Buffers Guide（for Python）

学习proto，但是手册资料十分难找。爬出去把官网的Guide拿了进来，贴在这里分享给有需求的朋友。

表格挤压有点单词错位，大家可以复制下来粘到markdown编辑器中阅读，看原文可能帮助大家理解更深。

此外，后面这个链接基本是翻译版：https://www.cnblogs.com/dkblog/archive/2012/03/27/2419010.html

=============================写在正文之前========================================

1. Language Guide (proto3)

This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .proto file syntax and how to generate data access classes from your .proto files. It covers the proto3 version of the protocol buffers language: for information on the older proto2 syntax, see the Proto2 Language Guide.

This is a reference guide – for a step by step example that uses many of the features described in this document, see the tutorial for your chosen language (currently proto2 only; more proto3 documentation is coming soon).

Defining A Message Type

First let's look at a very simple example. Let's say you want to define a search request message format, where each search request has a query string, the particular page of results you are interested in, and a number of results per page. Here's the .proto file you use to define the message type.

 syntax = "proto3"; 
 message SearchRequest {  
     string query = 1;  
     int32 page_number = 2;  
     int32 result_per_page = 3; 
 }

The first line of the file specifies that you're using proto3 syntax: if you don't do this the protocol buffer compiler will assume you are using proto2. This must be the first non-empty, non-comment line of the file.
The SearchRequest message definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.

Specifying Field Types

In the above example, all the fields are scalar types: two integers (page_number and result_per_page) and a string (query). However, you can also specify composite types for your fields, including enumerations and other message types.

Assigning Field Numbers

As you can see, each field in the message definition has a unique number. These field numbers are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that field numbers in the range 1 through 15 take one byte to encode, including the field number and the field's type (you can find out more about this in Protocol Buffer Encoding). Field numbers in the range 16 through 2047 take two bytes. So you should reserve the numbers 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.

The smallest field number you can specify is 1, and the largest is 229 - 1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved field numbers.

Specifying Field Rules

Message fields can be one of the following:

singular: a well-formed message can have zero or one of this field (but not more than one). And this is the default field rule for proto3 syntax.
repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

In proto3, repeated fields of scalar numeric types use packed encoding by default.

You can find out more about packed encoding in Protocol Buffer Encoding.

Adding More Message Types

Multiple message types can be defined in a single .proto file. This is useful if you are defining multiple related messages – so, for example, if you wanted to define the reply message format that corresponds to your SearchResponse message type, you could add it to the same .proto:

 message SearchRequest {
     string query = 1;  
     int32 page_number = 2;  
     int32 result_per_page = 3; 
 } 
 message SearchResponse { 
 ... 
 }

Adding Comments

To add comments to your .proto files, use C/C++-style // and /* ... */ syntax.

 /* SearchRequest represents a search query, with pagination options to * indicate which results to include in the response. */ 
 message SearchRequest {  
     string query = 1;  
     int32 page_number = 2;  // Which page number do we want?  
     int32 result_per_page = 3;  // Number of results to return per page. 
 }

Reserved Fields

If you update a message type by entirely removing a field, or commenting it out, future users can reuse the field number when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto, including data corruption, privacy bugs, and so on. One way to make sure this doesn't happen is to specify that the field numbers (and/or names, which can also cause issues for JSON serialization) of your deleted fields are reserved. The protocol buffer compiler will complain if any future users try to use these field identifiers.

 message Foo {  
     reserved 2, 15, 9 to 11;  
     reserved "foo", "bar"; 
 }

Note that you can't mix field names and field numbers in the same reserved statement.

What's Generated From Your `.proto`?

When you run the protocol buffer compiler on a .proto, the compiler generates the code in your chosen language you'll need to work with the message types you've described in the file, including getting and setting field values, serializing your messages to an output stream, and parsing your messages from an input stream.

For C++, the compiler generates a .h and .cc file from each .proto, with a class for each message type described in your file.
For Java, the compiler generates a .java file with a class for each message type, as well as a special Builder classes for creating message class instances.
Python is a little different – the Python compiler generates a module with a static descriptor of each message type in your .proto, which is then used with a metaclass to create the necessary Python data access class at runtime.
For Go, the compiler generates a .pb.go file with a type for each message type in your file.
For Ruby, the compiler generates a .rb file with a Ruby module containing your message types.
For Objective-C, the compiler generates a pbobjc.h and pbobjc.m file from each .proto, with a class for each message type described in your file.
For C#, the compiler generates a .cs file from each .proto, with a class for each message type described in your file.
For Dart, the compiler generates a .pb.dart file with a class for each message type in your file.

You can find out more about using the APIs for each language by following the tutorial for your chosen language (proto3 versions coming soon). For even more API details, see the relevant API reference (proto3 versions also coming soon).

Scalar Value Types

A scalar message field can have one of the following types – the table shows the type specified in the .proto file, and the corresponding type in the automatically generated class:

.proto Type	Notes	C++ Type	Java Type	Python Type[2]	Go Type	Ruby Type	C# Type	PHP Type	Dart Type
double		double	double	float	float64	Float	double	float	double
float		float	float	float	float32	Float	float	float	double
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int	int	int32	Fixnum or Bignum (as required)	int	integer	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	long	int/long[3]	int64	Bignum	long	integer/string[5]	Int64
uint32	Uses variable-length encoding.	uint32	int[1]	int/long[3]	uint32	Fixnum or Bignum (as required)	uint	integer	int
uint64	Uses variable-length encoding.	uint64	long[1]	int/long[3]	uint64	Bignum	ulong	integer/string[5]	Int64
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int	int	int32	Fixnum or Bignum (as required)	int	integer	int
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	long	int/long[3]	int64	Bignum	long	integer/string[5]	Int64
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 228.	uint32	int[1]	int/long[3]	uint32	Fixnum or Bignum (as required)	uint	integer	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 256.	uint64	long[1]	int/long[3]	uint64	Bignum	ulong	integer/string[5]	Int64
sfixed32	Always four bytes.	int32	int	int	int32	Fixnum or Bignum (as required)	int	integer	int
sfixed64	Always eight bytes.	int64	long	int/long[3]	int64	Bignum	long	integer/string[5]	Int64
bool		bool	boolean	bool	bool	TrueClass/FalseClass	bool	boolean	bool
string	A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 232.	string	String	str/unicode[4]	string	String (UTF-8)	string	string	String
bytes	May contain any arbitrary sequence of bytes no longer than 232.	string	ByteString	str	[]byte	String (ASCII-8BIT)	ByteString	string	List<int>

You can find out more about how these types are encoded when you serialize your message in Protocol Buffer Encoding.

[1] In Java, unsigned 32-bit and 64-bit integers are represented using their signed counterparts, with the top bit simply being stored in the sign bit.

[2] In all cases, setting values to a field will perform type checking to make sure it is valid.

[3] 64-bit or unsigned 32-bit integers are always represented as long when decoded, but can be an int if an int is given when setting the field. In all cases, the value must fit in the type represented when set. See [2].

[4] Python strings are represented as unicode on decode but can be str if an ASCII string is given (this is subject to change).

[5] Integer is used on 64-bit machines and string is used on 32-bit machines.

Default Values

When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:

For strings, the default value is the empty string.
For bytes, the default value is empty bytes.
For bools, the default value is false.
For numeric types, the default value is zero.
For enums, the default value is the first defined enum value, which must be 0.
For message fields, the field is not set. Its exact value is language-dependent. See the generated code guide for details.

The default value for repeated fields is empty (generally an empty list in the appropriate language).

Note that for scalar message fields, once a message is parsed there's no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false) or just not set at all: you should bear this in mind when defining your message types. For example, don't have a boolean that switches on some behaviour when set to false if you don't want that behaviour to also happen by default. Also note that if a scalar message field is set to its default, the value will not be serialized on the wire.

See the generated code guide for your chosen language for more details about how defaults work in generated code.

Enumerations

When you're defining a message type, you might want one of its fields to only have one of a pre-defined list of values. For example, let's say you want to add a corpus field for each SearchRequest, where the corpus can be UNIVERSAL, WEB, IMAGES, LOCAL, NEWS, PRODUCTS or VIDEO. You can do this very simply by adding an enum to your message definition with a constant for each possible value.

In the following example we've added an enum called Corpus with all the possible values, and a field of type Corpus:

 message SearchRequest {  
     string query = 1;  
     int32 page_number = 2;  
     int32 result_per_page = 3;  
     enum Corpus {    
         UNIVERSAL = 0;    
         WEB = 1;    
         IMAGES = 2;    
         LOCAL = 3;    
         NEWS = 4;    
         PRODUCTS = 5;    
         VIDEO = 6;  
     }  
     Corpus corpus = 4; 
 }

As you can see, the Corpus enum's first constant maps to zero: every enum definition must contain a constant that maps to zero as its first element. This is because:

There must be a zero value, so that we can use 0 as a numeric default value.
The zero value needs to be the first element, for compatibility with the proto2 semantics where the first enum value is always the default.

You can define aliases by assigning the same value to different enum constants. To do this you need to set the allow_alias option to true, otherwise the protocol compiler will generate an error message when aliases are found.

 enum EnumAllowingAlias {  
     option allow_alias = true;  
     UNKNOWN = 0;  
     STARTED = 1;  
     RUNNING = 1; 
 } 
 enum EnumNotAllowingAlias {  
     UNKNOWN = 0;  
     STARTED = 1;  // 
     RUNNING = 1;  // Uncommenting this line will cause a compile error inside Google and a warning message outside. 
 }

Enumerator constants must be in the range of a 32-bit integer. Since enum values use varint encoding on the wire, negative values are inefficient and thus not recommended. You can define enums within a message definition, as in the above example, or outside – these enums can be reused in any message definition in your .proto file. You can also use an enum type declared in one message as the type of a field in a different message, using the syntax *MessageType*.*EnumType*.

When you run the protocol buffer compiler on a .proto that uses an enum, the generated code will have a corresponding enum for Java or C++, a special EnumDescriptor class for Python that's used to create a set of symbolic constants with integer values in the runtime-generated class.

During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation. In languages with closed enum types such as Java, a case in the enum is used to represent an unrecognized value, and the underlying integer can be accessed with special accessors. In either case, if the message is serialized the unrecognized value will still be serialized with the message.

For more information about how to work with message enums in your applications, see the generated code guide for your chosen language.

Reserved Values

If you update an enum type by entirely removing an enum entry, or commenting it out, future users can reuse the numeric value when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto, including data corruption, privacy bugs, and so on. One way to make sure this doesn't happen is to specify that the numeric values (and/or names, which can also cause issues for JSON serialization) of your deleted entries are reserved. The protocol buffer compiler will complain if any future users try to use these identifiers. You can specify that your reserved numeric value range goes up to the maximum possible value using the max keyword.

 enum Foo {  
     reserved 2, 15, 9 to 11, 40 to max;  
     reserved "FOO", "BAR"; 
 }

Note that you can't mix field names and numeric values in the same reserved statement.

Using Other Message Types

You can use other message types as field types. For example, let's say you wanted to include Result messages in each SearchResponse message – to do this, you can define a Result message type in the same .proto and then specify a field of type Result in SearchResponse:

 message SearchResponse {  
     repeated Result results = 1; 
 } 
 message Result {  
     string url = 1;  
     string title = 2;  
     repeated string snippets = 3; 
 }

Importing Definitions

In the above example, the Result message type is defined in the same file as SearchResponse – what if the message type you want to use as a field type is already defined in another .proto file?

You can use definitions from other .proto files by importing them. To import another .proto's definitions, you add an import statement to the top of your file:

import "myproject/other_protos.proto";

By default you can only use definitions from directly imported .proto files. However, sometimes you may need to move a .proto file to a new location. Instead of moving the .proto file directly and updating all the call sites in a single change, now you can put a dummy .proto file in the old location to forward all the imports to the new location using the import public notion. import public dependencies can be transitively relied upon by anyone importing the proto containing the import public statement. For example:

// new.proto 
// All definitions are moved here
// old.proto 
// This is the proto that all clients are importing. 
import public "new.proto"; 
import "other.proto";
// client.proto 
import "old.proto"; // You use definitions from old.proto and new.proto, but not other.proto

The protocol compiler searches for imported files in a set of directories specified on the protocol compiler command line using the -I/--proto_path flag. If no flag was given, it looks in the directory in which the compiler was invoked. In general you should set the --proto_path flag to the root of your project and use fully qualified names for all imports.

Using proto2 Message Types

It's possible to import proto2 message types and use them in your proto3 messages, and vice versa. However, proto2 enums cannot be used directly in proto3 syntax (it's okay if an imported proto2 message uses them).

Nested Types

You can define and use message types inside other message types, as in the following example – here the Result message is defined inside the SearchResponse message:

message SearchResponse {  
    message Result {    
        string url = 1;    
        string title = 2;    
        repeated string snippets = 3;  
    }  
    repeated Result results = 1; 
}

If you want to reuse this message type outside its parent message type, you refer to it as *Parent*.*Type*:

message SomeOtherMessage {  
	SearchResponse.Result result = 1; 
}

You can nest messages as deeply as you like:

message Outer {  // Level 0  
	message MiddleAA {  // Level 1    
		message Inner {   // Level 2      
			int64 ival = 1;      
			bool  booly = 2;    
		}  
	}  
	message MiddleBB {  // Level 1    
		message Inner {   // Level 2      
			int32 ival = 1;      
			bool  booly = 2;    
         }  
	} 
}

Updating A Message Type

If an existing message type no longer meets all your needs – for example, you'd like the message format to have an extra field – but you'd still like to use code created with the old format, don't worry! It's very simple to update message types without breaking any of your existing code. Just remember the following rules:

Don't change the field numbers for any existing fields.
If you add new fields, any messages serialized by code using your "old" message format can still be parsed by your new generated code. You should keep in mind the default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. See the Unknown Fields section for details.
Fields can be removed, as long as the field number is not used again in your updated message type. You may want to rename the field instead, perhaps adding the prefix "OBSOLETE_", or make the field number reserved, so that future users of your .proto can't accidentally reuse the number.
int32, uint32, int64, uint64, and bool are all compatible – this means you can change a field from one of these types to another without breaking forwards- or backwards-compatibility. If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if you had cast the number to that type in C++ (e.g. if a 64-bit number is read as an int32, it will be truncated to 32 bits).
sint32 and sint64 are compatible with each other but are not compatible with the other integer types.
string and bytes are compatible as long as the bytes are valid UTF-8.
Embedded messages are compatible with bytes if the bytes contain an encoded version of the message.
fixed32 is compatible with sfixed32, and fixed64 with sfixed64.
enum is compatible with int32, uint32, int64, and uint64 in terms of wire format (note that values will be truncated if they don't fit). However be aware that client code may treat them differently when the message is deserialized: for example, unrecognized proto3 enum types will be preserved in the message, but how this is represented when the message is deserialized is language-dependent. Int fields always just preserve their value.
Changing a single value into a member of a new oneof is safe and binary compatible. Moving multiple fields into a new oneof may be safe if you are sure that no code sets more than one at a time. Moving any fields into an existing oneof is not safe.

Unknown Fields

Unknown fields are well-formed protocol buffer serialized data representing fields that the parser does not recognize. For example, when an old binary parses data sent by a new binary with new fields, those new fields become unknown fields in the old binary.

Originally, proto3 messages always discarded unknown fields during parsing, but in version 3.5 we reintroduced the preservation of unknown fields to match the proto2 behavior. In versions 3.5 and later, unknown fields are retained during parsing and included in the serialized output.

Any

The Any message type lets you use messages as embedded types without having their .proto definition. An Any contains an arbitrary serialized message as bytes, along with a URL that acts as a globally unique identifier for and resolves to that message's type. To use the Any type, you need to import google/protobuf/any.proto.

import "google/protobuf/any.proto"; 
message ErrorStatus {  
	string message = 1;  
	repeated google.protobuf.Any details = 2; 
}

The default type URL for a given message type is type.googleapis.com/*packagename*.*messagename*.

Different language implementations will support runtime library helpers to pack and unpack Any values in a typesafe manner – for example, in Java, the Any type will have special pack() and unpack() accessors, while in C++ there are PackFrom() and UnpackTo() methods:

 // Storing an arbitrary message type in Any. 
 NetworkErrorDetails details = ...; 
 ErrorStatus status; 
 status.add_details()->PackFrom(details); 
 // Reading an arbitrary message from Any. 
 ErrorStatus status = ...; 
 for (const Any& detail : status.details()) {  
     if (detail.Is()) {    
         NetworkErrorDetails network_error;    
         detail.UnpackTo(&network_error);    
         ... processing network_error ...  
     } 
 }

Currently the runtime libraries for working with Any types are under development.

If you are already familiar with proto2 syntax, the Any type replaces extensions.

Oneof

If you have a message with many fields and where at most one field will be set at the same time, you can enforce this behavior and save memory by using the oneof feature.

Oneof fields are like regular fields except all the fields in a oneof share memory, and at most one field can be set at the same time. Setting any member of the oneof automatically clears all the other members. You can check which value in a oneof is set (if any) using a special case() or WhichOneof() method, depending on your chosen language.

Using Oneof

To define a oneof in your .proto you use the oneof keyword followed by your oneof name, in this case test_oneof:

 message SampleMessage {  
     oneof test_oneof {    
         string name = 4;    
         SubMessage sub_message = 9;  
     } 
 }

You then add your oneof fields to the oneof definition. You can add fields of any type, but cannot use repeated fields.

In your generated code, oneof fields have the same getters and setters as regular fields. You also get a special method for checking which value (if any) in the oneof is set. You can find out more about the oneof API for your chosen language in the relevant API reference.

Oneof Features

Setting a oneof field will automatically clear all other members of the oneof. So if you set several oneof fields, only the last field you set will still have a value. SampleMessage message; message.set_name("name"); CHECK(message.has_name()); message.mutable_sub_message(); // Will clear name field. CHECK(!message.has_name());
If the parser encounters multiple members of the same oneof on the wire, only the last member seen is used in the parsed message.
A oneof cannot be repeated.
Reflection APIs work for oneof fields.
If you set a oneof field to the default value (such as setting an int32 oneof field to 0), the "case" of that oneof field will be set, and the value will be serialized on the wire.
If you're using C++, make sure your code doesn't cause memory crashes. The following sample code will crash because sub_message was already deleted by calling the set_name() method. SampleMessage message; SubMessage* sub_message = message.mutable_sub_message(); message.set_name("name"); // Will delete sub_message sub_message->set_... // Crashes here
Again in C++, if you Swap() two messages with oneofs, each message will end up with the other’s oneof case: in the example below, msg1 will have a sub_message and msg2 will have a name. SampleMessage msg1; msg1.set_name("name"); SampleMessage msg2; msg2.mutable_sub_message(); msg1.swap(&msg2); CHECK(msg1.has_sub_message()); CHECK(msg2.has_name());

Backwards-compatibility issues

Be careful when adding or removing oneof fields. If checking the value of a oneof returns None/NOT_SET, it could mean that the oneof has not been set or it has been set to a field in a different version of the oneof. There is no way to tell the difference, since there's no way to know if an unknown field on the wire is a member of the oneof.

Tag Reuse Issues

Move fields into or out of a oneof: You may lose some of your information (some fields will be cleared) after the message is serialized and parsed. However, you can safely move a single field into a new oneof and may be able to move multiple fields if it is known that only one is ever set.
Delete a oneof field and add it back: This may clear your currently set oneof field after the message is serialized and parsed.
Split or merge oneof: This has similar issues to moving regular fields.

Maps

If you want to create an associative map as part of your data definition, protocol buffers provides a handy shortcut syntax:

map map_field = N;

...where the key_type can be any integral or string type (so, any scalar type except for floating point types and bytes). Note that enum is not a valid key_type. The value_type can be any type except another map.

So, for example, if you wanted to create a map of projects where each Project message is associated with a string key, you could define it like this:

map projects = 3;

Map fields cannot be repeated.
Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.
When generating text format for a .proto, maps are sorted by key. Numeric keys are sorted numerically.
When parsing from the wire or when merging, if there are duplicate map keys the last key seen is used. When parsing a map from text format, parsing may fail if there are duplicate keys.
If you provide a key but no value for a map field, the behavior when the field is serialized is language-dependent. In C++, Java, and Python the default value for the type is serialized, while in other languages nothing is serialized.

The generated map API is currently available for all proto3 supported languages. You can find out more about the map API for your chosen language in the relevant API reference.

Backwards compatibility

The map syntax is equivalent to the following on the wire, so protocol buffers implementations that do not support maps can still handle your data:

message MapFieldEntry {  
    key_type key = 1;  
    value_type value = 2; 
} 
repeated MapFieldEntry map_field = N;

Any protocol buffers implementation that supports maps must both produce and accept data that can be accepted by the above definition.

Packages

You can add an optional package specifier to a .proto file to prevent name clashes between protocol message types.

package foo.bar; 
message Open { ... }

You can then use the package specifier when defining fields of your message type:

message Foo {  ...  foo.bar.Open open = 1;  ... }

The way a package specifier affects the generated code depends on your chosen language:

In C++ the generated classes are wrapped inside a C++ namespace. For example, Open would be in the namespace foo::bar.
In Java, the package is used as the Java package, unless you explicitly provide an option java_package in your .proto file.
In Python, the package directive is ignored, since Python modules are organized according to their location in the file system.
In Go, the package is used as the Go package name, unless you explicitly provide an option go_package in your .proto file.
In Ruby, the generated classes are wrapped inside nested Ruby namespaces, converted to the required Ruby capitalization style (first letter capitalized; if the first character is not a letter, PB_ is prepended). For example, Open would be in the namespace Foo::Bar.
In C# the package is used as the namespace after converting to PascalCase, unless you explicitly provide an option csharp_namespace in your .proto file. For example, Open would be in the namespace Foo.Bar.

Packages and Name Resolution

Type name resolution in the protocol buffer language works like C++: first the innermost scope is searched, then the next-innermost, and so on, with each package considered to be "inner" to its parent package. A leading '.' (for example, .foo.bar.Baz) means to start from the outermost scope instead.

The protocol buffer compiler resolves all type names by parsing the imported .proto files. The code generator for each language knows how to refer to each type in that language, even if it has different scoping rules.

Defining Services

If you want to use your message types with an RPC (Remote Procedure Call) system, you can define an RPC service interface in a .proto file and the protocol buffer compiler will generate service interface code and stubs in your chosen language. So, for example, if you want to define an RPC service with a method that takes your SearchRequest and returns a SearchResponse, you can define it in your .proto file as follows:

service SearchService {  rpc Search (SearchRequest) returns (SearchResponse); }

The most straightforward RPC system to use with protocol buffers is gRPC: a language- and platform-neutral open source RPC system developed at Google. gRPC works particularly well with protocol buffers and lets you generate the relevant RPC code directly from your .proto files using a special protocol buffer compiler plugin.

If you don't want to use gRPC, it's also possible to use protocol buffers with your own RPC implementation. You can find out more about this in the Proto2 Language Guide.

There are also a number of ongoing third-party projects to develop RPC implementations for Protocol Buffers. For a list of links to projects we know about, see the third-party add-ons wiki page.

JSON Mapping

Proto3 supports a canonical encoding in JSON, making it easier to share data between systems. The encoding is described on a type-by-type basis in the table below.

If a value is missing in the JSON-encoded data or if its value is null, it will be interpreted as the appropriate default value when parsed into a protocol buffer. If a field has the default value in the protocol buffer, it will be omitted in the JSON-encoded data by default to save space. An implementation may provide options to emit fields with default values in the JSON-encoded output.

proto3	JSON	JSON example	Notes
message	object	`{"fooBar": v, "g": null, …}`	Generates JSON objects. Message field names are mapped to lowerCamelCase and become JSON object keys. If the `json_name` field option is specified, the specified value will be used as the key instead. Parsers accept both the lowerCamelCase name (or the one specified by the `json_name` option) and the original proto field name. `null` is an accepted value for all field types and treated as the default value of the corresponding field type.
enum	string	`"FOO_BAR"`	The name of the enum value as specified in proto is used. Parsers accept both enum names and integer values.
map<K,V>	object	`{"k": v, …}`	All keys are converted to strings.
repeated V	array	`[v, …]`	`null` is accepted as the empty list [].
bool	true, false	`true, false`
string	string	`"Hello World!"`
bytes	base64 string	`"YWJjMTIzIT8kKiYoKSctPUB+"`	JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted.
int32, fixed32, uint32	number	`1, -10, 0`	JSON value will be a decimal number. Either numbers or strings are accepted.
int64, fixed64, uint64	string	`"1", "-10"`	JSON value will be a decimal string. Either numbers or strings are accepted.
float, double	number	`1.1, -10.0, 0, "NaN", "Infinity"`	JSON value will be a number or one of the special string values "NaN", "Infinity", and "-Infinity". Either numbers or strings are accepted. Exponent notation is also accepted.
Any	`object`	`{"@type": "url", "f": v, … }`	If the Any contains a value that has a special JSON mapping, it will be converted as follows: `{"@type": xxx, "value": yyy}`. Otherwise, the value will be converted into a JSON object, and the `"@type"` field will be inserted to indicate the actual data type.
Timestamp	string	`"1972-01-01T10:00:20.021Z"`	Uses RFC 3339, where generated output will always be Z-normalized and uses 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted.
Duration	string	`"1.000340012s", "1s"`	Generated output always contains 0, 3, 6, or 9 fractional digits, depending on required precision, followed by the suffix "s". Accepted are any fractional digits (also none) as long as they fit into nano-seconds precision and the suffix "s" is required.
Struct	`object`	`{ … }`	Any JSON object. See `struct.proto`.
Wrapper types	various types	`2, "2", "foo", true, "true", null, 0, …`	Wrappers use the same representation in JSON as the wrapped primitive type, except that `null` is allowed and preserved during data conversion and transfer.
FieldMask	string	`"f.fooBar,h"`	See `field_mask.proto`.
ListValue	array	`[foo, bar, …]`
Value	value		Any JSON value
NullValue	null		JSON null
Empty	object	{}	An empty JSON object

JSON options

A proto3 JSON implementation may provide the following options:

Emit fields with default values: Fields with default values are omitted by default in proto3 JSON output. An implementation may provide an option to override this behavior and output fields with their default values.
Ignore unknown fields: Proto3 JSON parser should reject unknown fields by default but may provide an option to ignore unknown fields in parsing.
Use proto field name instead of lowerCamelCase name: By default proto3 JSON printer should convert the field name to lowerCamelCase and use that as the JSON name. An implementation may provide an option to use proto field name as the JSON name instead. Proto3 JSON parsers are required to accept both the converted lowerCamelCase name and the proto field name.
Emit enum values as integers instead of strings: The name of an enum value is used by default in JSON output. An option may be provided to use the numeric value of the enum value instead.

Options

Individual declarations in a .proto file can be annotated with a number of options. Options do not change the overall meaning of a declaration, but may affect the way it is handled in a particular context. The complete list of available options is defined in google/protobuf/descriptor.proto.

Some options are file-level options, meaning they should be written at the top-level scope, not inside any message, enum, or service definition. Some options are message-level options, meaning they should be written inside message definitions. Some options are field-level options, meaning they should be written inside field definitions. Options can also be written on enum types, enum values, service types, and service methods; however, no useful options currently exist for any of these.

Here are a few of the most commonly used options:

java_package (file option): The package you want to use for your generated Java classes. If no explicit java_package option is given in the .proto file, then by default the proto package (specified using the "package" keyword in the .proto file) will be used. However, proto packages generally do not make good Java packages since proto packages are not expected to start with reverse domain names. If not generating Java code, this option has no effect. option java_package = "com.example.foo";
java_multiple_files (file option): Causes top-level messages, enums, and services to be defined at the package level, rather than inside an outer class named after the .proto file.
option java_multiple_files = true;
java_outer_classname (file option): The class name for the outermost Java class (and hence the file name) you want to generate. If no explicit java_outer_classname is specified in the .proto file, the class name will be constructed by converting the .proto file name to camel-case (so foo_bar.proto becomes FooBar.java). If not generating Java code, this option has no effect. option java_outer_classname = "Ponycopter";
```
optimize_for
```
(file option): Can be set to
```
SPEED
```
,
```
CODE_SIZE
```
, or
```
LITE_RUNTIME
```
. This affects the C++ and Java code generators (and possibly third-party generators) in the following ways:
- SPEED (default): The protocol buffer compiler will generate code for serializing, parsing, and performing other common operations on your message types. This code is highly optimized.
- CODE_SIZE: The protocol buffer compiler will generate minimal classes and will rely on shared, reflection-based code to implement serialialization, parsing, and various other operations. The generated code will thus be much smaller than with SPEED, but operations will be slower. Classes will still implement exactly the same public API as they do in SPEED mode. This mode is most useful in apps that contain a very large number .proto files and do not need all of them to be blindingly fast.
- LITE_RUNTIME: The protocol buffer compiler will generate classes that depend only on the "lite" runtime library (libprotobuf-lite instead of libprotobuf). The lite runtime is much smaller than the full library (around an order of magnitude smaller) but omits certain features like descriptors and reflection. This is particularly useful for apps running on constrained platforms like mobile phones. The compiler will still generate fast implementations of all methods as it does in SPEED mode. Generated classes will only implement the MessageLite interface in each language, which provides only a subset of the methods of the full Message interface.
option optimize_for = CODE_SIZE;
cc_enable_arenas (file option): Enables arena allocation for C++ generated code.
objc_class_prefix (file option): Sets the Objective-C class prefix which is prepended to all Objective-C generated classes and enums from this .proto. There is no default. You should use prefixes that are between 3-5 uppercase characters as recommended by Apple. Note that all 2 letter prefixes are reserved by Apple.
deprecated (field option): If set to true, indicates that the field is deprecated and should not be used by new code. In most languages this has no actual effect. In Java, this becomes a @Deprecated annotation. In the future, other language-specific code generators may generate deprecation annotations on the field's accessors, which will in turn cause a warning to be emitted when compiling code which attempts to use the field. If the field is not used by anyone and you want to prevent new users from using it, consider replacing the field declaration with a reserved statement. int32 old_field = 6 [deprecated=true];

Custom Options

Protocol Buffers also allows you to define and use your own options. This is an advanced feature which most people don't need. If you do think you need to create your own options, see the Proto2 Language Guide for details. Note that creating custom options uses extensions, which are permitted only for custom options in proto3.

Generating Your Classes

To generate the Java, Python, C++, Go, Ruby, Objective-C, or C# code you need to work with the message types defined in a .proto file, you need to run the protocol buffer compiler protoc on the .proto. If you haven't installed the compiler, download the package and follow the instructions in the README. For Go, you also need to install a special code generator plugin for the compiler: you can find this and installation instructions in the golang/protobuf repository on GitHub.

The Protocol Compiler is invoked as follows:

 protoc --proto_path=*IMPORT_PATH* --cpp_out=*DST_DIR* --java_out=*DST_DIR* --python_out=*DST_DIR* --go_out=*DST_DIR* --ruby_out=*DST_DIR* --objc_out=*DST_DIR* --csharp_out=*DST_DIR* *path/to/file*.proto

IMPORT_PATH specifies a directory in which to look for .proto files when resolving import directives. If omitted, the current directory is used. Multiple import directories can be specified by passing the --proto_path option multiple times; they will be searched in order. -I=*IMPORT_PATH* can be used as a short form of --proto_path.
You can provide one or more

output directives

:
- --cpp_out generates C++ code in DST_DIR. See the C++ generated code reference for more.
- --java_out generates Java code in DST_DIR. See the Java generated code reference for more.
- --python_out generates Python code in DST_DIR. See the Python generated code reference for more.
- --go_out generates Go code in DST_DIR. See the Go generated code reference for more.
- --ruby_out generates Ruby code in DST_DIR. Ruby generated code reference is coming soon!
- --objc_out generates Objective-C code in DST_DIR. See the Objective-C generated code reference for more.
- --csharp_out generates C# code in DST_DIR. See the C# generated code reference for more.
- --php_out generates PHP code in DST_DIR. See the PHP generated code reference for more.
As an extra convenience, if the
```
DST_DIR
```
ends in
```
.zip
```
or
```
.jar
```
, the compiler will write the output to a single ZIP-format archive file with the given name.
```
.jar
```
outputs will also be given a manifest file as required by the Java JAR specification. Note that if the output archive already exists, it will be overwritten; the compiler is not smart enough to add files to an existing archive.
You must provide one or more .proto files as input. Multiple .proto files can be specified at once. Although the files are named relative to the current directory, each file must reside in one of the IMPORT_PATHs so that the compiler can determine its canonical name.

2. Style Guide

This document provides a style guide for .proto files. By following these conventions, you'll make your protocol buffer message definitions and their corresponding classes consistent and easy to read.

Note that protocol buffer style has evolved over time, so it is likely that you will see .proto files written in different conventions or styles. Please respect the existing style when you modify these files. Consistency is key. However, it is best to adopt the current best style when you are creating a new .proto file.

Standard file formatting

Keep the line length to 80 characters.
Use an indent of 2 spaces.

File structure

Files should be named lower_snake_case.proto

All files should be ordered in the following manner:

License header (if applicable)
File overview
Syntax
Package
Imports (sorted)
File options
Everything else

Packages

Package name should be in lowercase, and should correspond to the directory hierarchy. e.g., if a file is in my/package/, then the package name should be my.package.

Message and field names

Use CamelCase (with an initial capital) for message names – for example, SongServerRequest. Use underscore_separated_names for field names (including oneof field and extension names) – for example, song_name.

message SongServerRequest {  required string song_name = 1; }

Using this naming convention for field names gives you accessors like the following:

C++:  
const string& song_name() { ... }  
void set_song_name(const string& x) { ... } 
Java:  
public String getSongName() { ... }  
public Builder setSongName(String v) { ... }

If your field name contains a number, the number should appear after the letter instead of after the underscore. e.g., use song_name1 instead of song_name_1

Repeated fields

Use pluralized names for repeated fields.

  repeated string keys = 1;  
  ...  
  repeated MyMessage accounts = 17;

Enums

Use CamelCase (with an initial capital) for enum type names and CAPITALS_WITH_UNDERSCORES for value names:

enum Foo {  
    FOO_UNSPECIFIED = 0;  
    FOO_FIRST_VALUE = 1;  
    FOO_SECOND_VALUE = 2; 
}

Each enum value should end with a semicolon, not a comma. Prefer prefixing enum values instead of surrounding them in an enclosing message. The zero value enum should have the suffix UNSPECIFIED.

Services

If your .proto defines an RPC service, you should use CamelCase (with an initial capital) for both the service name and any RPC method names:

service FooService {  rpc GetSomething(FooRequest) returns (FooResponse); }

Things to avoid

Required fields (only for proto2)
Groups (only for proto2)

3. Encoding

This document describes the binary wire format for protocol buffer messages. You don't need to understand this to use protocol buffers in your applications, but it can be very useful to know how different protocol buffer formats affect the size of your encoded messages.

A Simple Message

Let's say you have the following very simple message definition:

message Test1 {  optional int32 a = 1; }

In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you'd see three bytes:

08 96 01

So far, so small and numeric – but what does it mean? Read on...

Base 128 Varints

To understand your simple protocol buffer encoding, you first need to understand varints. Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.

Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

So, for example, here is the number 1 – it's a single byte, so the msb is not set:

0000 0001

And here is 300 – this is a bit more complicated:

1010 1100 0000 0010

How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint):

 1010 1100 0000 0010 → 010 1100  000 0010

You reverse the two groups of 7 bits because, as you remember, varints store numbers with the least significant group first. Then you concatenate them to get your final value:

000 0010  010 1100 →  000 0010 ++ 010 1100 →  100101100 →  256 + 32 + 8 + 4 = 300

Message Structure

As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value. In most language implementations this key is referred to as a tag.

The available wire types are as follows:

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum

1	64-bit	fixed64, sfixed64, double

2	Length-delimited	string, bytes, embedded messages, packed repeated fields

3	Start group	groups (deprecated)

4	End group	groups (deprecated)

5	32-bit	fixed32, sfixed32, float

Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.

Now let's look at our simple example again. You now know that the first number in the stream is always a varint key, and here it's 08, or (dropping the msb):

000 1000

You take the last three bits to get the wire type (0) and then right-shift by three to get the field number (1). So you now know that the field number is 1 and the following value is a varint. Using your varint-decoding knowledge from the previous section, you can see that the next two bytes store the value 150.

96 01 = 1001 0110  0000 0001       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)       → 10010110       → 128 + 16 + 4 + 2 = 150

More Value Types

Signed Integers

As you saw in the previous section, all the protocol buffer types associated with wire type 0 are encoded as varints. However, there is an important difference between the signed int types (sint32 and sint64) and the "standard" int types (int32 and int64) when it comes to encoding negative numbers. If you use int32 or int64 as the type for a negative number, the resulting varint is always ten bytes long – it is, effectively, treated like a very large unsigned integer. If you use one of the signed types, the resulting varint uses ZigZag encoding, which is much more efficient.

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:

Signed Original	Encoded As
0	0

-1	1

1	2

-2	3

2147483647	4294967294

-2147483648	4294967295

In other words, each value n is encoded using

 (n << 1) ^ (n >> 31)

for sint32s, or

 (n << 1) ^ (n >> 63)

for the 64-bit version.

Note that the second shift – the (n >> 31) part – is an arithmetic shift. So, in other words, the result of the shift is either a number that is all zero bits (if n is positive) or all one bits (if n is negative).

When the sint32 or sint64 is parsed, its value is decoded back to the original, signed version.

Non-varint Numbers

Non-varint numeric types are simple – double and fixed64 have wire type 1, which tells the parser to expect a fixed 64-bit lump of data; similarly float and fixed32 have wire type 5, which tells it to expect 32 bits. In both cases the values are stored in little-endian byte order.

Strings

A wire type of 2 (length-delimited) means that the value is a varint encoded length followed by the specified number of bytes of data.

 message Test2 {  optional string b = 2; }

Setting the value of b to "testing" gives you:

 12 07 74 65 73 74 69 6e 67

The red bytes are the UTF8 of "testing". The key here is 0x12 → field number = 2, type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.

Embedded Messages

Here's a message definition with an embedded message of our example type, Test1:

message Test3 {  optional Test1 c = 3; }

And here's the encoded version, again with the Test1's a field set to 150:

 1a 03 08 96 01

As you can see, the last three bytes are exactly the same as our first example (08 96 01), and they're preceded by the number 3 – embedded messages are treated in exactly the same way as strings (wire type = 2).

Optional And Repeated Elements

If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same field number. These repeated values do not have to appear consecutively; they may be interleaved with other fields. The order of the elements with respect to each other is preserved when parsing, though the ordering with respect to other fields is lost. In proto3, repeated fields use packed encoding, which you can read about below.

For any non-repeated fields in proto3, or optional fields in proto2, the encoded message may or may not have a key-value pair with that field number.

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. For embedded message fields, the parser merges multiple instances of the same field, as if with the Message::MergeFrom method – that is, all singular scalar fields in the latter instance replace those in the former, singular embedded messages are merged, and repeated fields are concatenated. The effect of these rules is that parsing the concatenation of two encoded messages produces exactly the same result as if you had parsed the two messages separately and merged the resulting objects. That is, this:

MyMessage message; message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2; message.ParseFromString(str1); message2.ParseFromString(str2); message.MergeFrom(message2);

This property is occasionally useful, as it allows you to merge two messages even if you do not know their types.

Packed Repeated Fields

Version 2.1.0 introduced packed repeated fields, which in proto2 are declared like repeated fields but with the special [packed=true] option. In proto3, repeated fields of scalar numeric types are packed by default. These function like repeated fields, but are encoded differently. A packed repeated field containing zero elements does not appear in the encoded message. Otherwise, all of the elements of the field are packed into a single key-value pair with wire type 2 (length-delimited). Each element is encoded the same way it would be normally, except without a key preceding it.

For example, imagine you have the message type:

message Test4 {  repeated int32 d = 4 [packed=true]; }

Now let's say you construct a Test4, providing the values 3, 270, and 86942 for the repeated field d. Then, the encoded form would be:

22        // key (field number 4, wire type 2) 06        // payload size (6 bytes) 03        // first element (varint 3) 8E 02     // second element (varint 270) 9E A7 05  // third element (varint 86942)

Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".

Note that although there's usually no reason to encode more than one key-value pair for a packed repeated field, encoders must be prepared to accept multiple key-value pairs. In this case, the payloads should be concatenated. Each pair must contain a whole number of elements.

Protocol buffer parsers must be able to parse repeated fields that were compiled as packed as if they were not packed, and vice versa. This permits adding [packed=true] to existing fields in a forward- and backward-compatible way.

Field Order

Field numbers may be used in any order in a .proto file. The order chosen has no effect on how the messages are serialized.

When a message is serialized, there is no guaranteed order for how its known or unknown fields should be written. Serialization order is an implementation detail and the details of any particular implementation may change in the future. Therefore, protocol buffer parsers must be able to parse fields in any order.

Implications

Do not assume the byte output of a serialized message is stable. This is especially true for messages with transitive bytes fields representing other serialized protocol buffer messages.
By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
- Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.
The following checks may fail for a protocol buffer message instance
```
foo
```
.
- foo.SerializeAsString() == foo.SerializeAsString()
- Hash(foo.SerializeAsString()) == Hash(foo.SerializeAsString())
- CRC(foo.SerializeAsString()) == CRC(foo.SerializeAsString())
- FingerPrint(foo.SerializeAsString()) == FingerPrint(foo.SerializeAsString())
Here're a few example scenarios where logically equivalent protocol buffer messages
```
foo
```
and
```
bar
```
may serialize to different byte outputs.
- bar is serialized by an old server that treats some fields as unknown.
- bar is serialized by a server that is implemented in a different programming language and serializes fields in different order.
- bar has a field that serializes in non-deterministic manner.
- bar has a field that stores a serialized byte output of a protocol buffer message which is serialized differently.
- bar is serialized by a new server that serializes fields in different order due to an implementation change.
- Both foo and bar are concatenation of individual messages but with different order.

4. Techniques

This page describes some commonly-used design patterns for dealing with Protocol Buffers. You can also send design and usage questions to the Protocol Buffers discussion group.

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

Large Data Sets

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.

Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you may want something more like a database. Each solution should be developed as a separate library, so that only those who need it need to pay the costs.

Self-describing Messages

Protocol Buffers do not contain descriptions of their own types. Thus, given only a raw message without the corresponding .proto file defining its type, it is difficult to extract any useful data.

However, note that the contents of a .proto file can itself be represented using protocol buffers. The file src/google/protobuf/descriptor.proto in the source code package defines the message types involved. protoc can output a FileDescriptorSet – which represents a set of .proto files – using the --descriptor_set_out option. With this, you could define a self-describing protocol message like so:

 syntax = "proto3"; import "google/protobuf/any.proto"; import "google/protobuf/descriptor.proto"; message SelfDescribingMessage {  // Set of FileDescriptorProtos which describe the type and its dependencies.  google.protobuf.FileDescriptorSet descriptor_set = 1;   // The message and its type, encoded as an Any message.  google.protobuf.Any message = 2; }

By using classes like DynamicMessage (available in C++ and Java), you can then write tools which can manipulate SelfDescribingMessages.

All that said, the reason that this functionality is not included in the Protocol Buffer library is because we have never had a use for it inside Google.

This technique requires support for dynamic messages using descriptors. Please check that your platforms support this feature before using self-describing messages.

5. Protocol Buffer Basics: Python

This tutorial provides a basic Python programmer's introduction to working with protocol buffers. By walking through creating a simple example application, it shows you how to

Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the Python protocol buffer API to write and read messages.

This isn't a comprehensive guide to using protocol buffers in Python. For more detailed reference information, see the Protocol Buffer Language Guide, the Python API Reference, the Python Generated Code Guide, and the Encoding Reference.

Why Use Protocol Buffers?

The example we're going to use is a very simple "address book" application that can read and write people's contact details to and from a file. Each person in the address book has a name, an ID, an email address, and a contact phone number.

How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:

Use Python pickling. This is the default approach since it's built into the language, but it doesn't deal well with schema evolution, and also doesn't work very well if you need to share data with applications written in C++ or Java.
You can invent an ad-hoc way to encode the data items into a single string – such as encoding 4 ints as "12:3:-23:67". This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.

Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

Where to Find the Example Code

The example code is included in the source code package, under the "examples" directory. Download it here.

Defining Your Protocol Format

To create your address book application, you'll need to start with a .proto file. The definitions in a .proto file are simple: you add a message for each data structure you want to serialize, then specify a name and a type for each field in the message. Here is the .proto file that defines your messages, addressbook.proto.

 syntax = "proto2"; 
 package tutorial; 
 message Person {  
     required string name = 1;  
     required int32 id = 2;  
     optional string email = 3;
     
     enum PhoneType {    
         MOBILE = 0;    
         HOME = 1;    
         WORK = 2;  
     }   
     message PhoneNumber {    
         required string number = 1;    
         optional PhoneType type = 2 [default = HOME];  
     }   
     repeated PhoneNumber phones = 4; 
 } 
 message AddressBook {  
     repeated Person people = 1; 
 }

As you can see, the syntax is similar to C++ or Java. Let's go through each part of the file and see what it does.

The .proto file starts with a package declaration, which helps to prevent naming conflicts between different projects. In Python, packages are normally determined by directory structure, so the package you define in your .proto file will have no effect on the generated code. However, you should still declare one to avoid name collisions in the Protocol Buffers name space as well as in non-Python languages.

Next, you have your message definitions. A message is just an aggregate containing a set of typed fields. Many standard simple data types are available as field types, including bool, int32, float, double, and string. You can also add further structure to your messages by using other message types as field types – in the above example the Person message contains PhoneNumber messages, while the AddressBook message contains Person messages. You can even define message types nested inside other messages – as you can see, the PhoneNumber type is defined inside Person. You can also define enum types if you want one of your fields to have one of a predefined list of values – here you want to specify that a phone number can be one of MOBILE, HOME, or WORK.

The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional elements. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good candidates for this optimization.

Each field must be annotated with one of the following modifiers:

required: a value for the field must be provided, otherwise the message will be considered "uninitialized". Serializing an uninitialized message will raise an exception. Parsing an uninitialized message will fail. Other than this, a required field behaves exactly like an optional field.
optional: the field may or may not be set. If an optional field value isn't set, a default value is used. For simple types, you can specify your own default value, as we've done for the phone number type in the example. Otherwise, a system default is used: zero for numeric types, the empty string for strings, false for bools. For embedded messages, the default value is always the "default instance" or "prototype" of the message, which has none of its fields set. Calling the accessor to get the value of an optional (or required) field which has not been explicitly set always returns that field's default value.
repeated: the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.

==Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal. ==

You'll find a complete guide to writing .proto files – including all the possible field types – in the Protocol Buffer Language Guide. Don't go looking for facilities similar to class inheritance, though – protocol buffers don't do that.

Compiling Your Protocol Buffers

Now that you have a .proto, the next thing you need to do is generate the classes you'll need to read and write AddressBook (and hence Person and PhoneNumber) messages. To do this, you need to run the protocol buffer compiler protoc on your .proto:

If you haven't installed the compiler, download the package and follow the instructions in the README.
Now run the compiler, specifying the source directory (where your application's source code lives – the current directory is used if you don't provide a value), the destination directory (where you want the generated code to go; often the same as $SRC_DIR), and the path to your .proto. In this case, you...:

protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto Because you want Python classes, you use the --python_out option – similar options are provided for other supported languages.

This generates addressbook_pb2.py in your specified destination directory.

The Protocol Buffer API

Unlike when you generate Java and C++ protocol buffer code, the Python protocol buffer compiler doesn't generate your data access code for you directly. Instead (as you'll see if you look at addressbook_pb2.py) it generates special descriptors for all your messages, enums, and fields, and some mysteriously empty classes, one for each message type:

 class Person(message.Message):  
     __metaclass__ = reflection.GeneratedProtocolMessageType   
     class PhoneNumber(message.Message):    
         __metaclass__ = reflection.GeneratedProtocolMessageType    
         DESCRIPTOR = _PERSON_PHONENUMBER  
         DESCRIPTOR = _PERSON 
 class AddressBook(message.Message):  
     __metaclass__ = reflection.GeneratedProtocolMessageType  
     DESCRIPTOR = _ADDRESSBOOK

The important line in each class is __metaclass__ = reflection.GeneratedProtocolMessageType. While the details of how Python metaclasses work is beyond the scope of this tutorial, you can think of them as like a template for creating classes. At load time, the GeneratedProtocolMessageType metaclass uses the specified descriptors to create all the Python methods you need to work with each message type and adds them to the relevant classes. You can then use the fully-populated classes in your code.

The end effect of all this is that you can use the Person class as if it defined each field of the Message base class as a regular field. For example, you could write:

 import addressbook_pb2 
 person = addressbook_pb2.Person() 
 person.id = 1234 
 person.name = "John Doe" 
 person.email = "jdoe@example.com" 
 phone = person.phones.add() 
 phone.number = "555-4321" 
 phone.type = addressbook_pb2.Person.HOME

Note that these assignments are not just adding arbitrary new fields to a generic Python object. If you were to try to assign a field that isn't defined in the .proto file, an AttributeError would be raised. If you assign a field to a value of the wrong type, a TypeError will be raised. Also, reading the value of a field before it has been set returns the default value.

 person.no_such_field = 1  # raises AttributeError 
 person.id = "1234"        # raises TypeError

For more information on exactly what members the protocol compiler generates for any particular field definition, see the Python generated code reference.

Enums

Enums are expanded by the metaclass into a set of symbolic constants with integer values. So, for example, the constant addressbook_pb2.Person.PhoneType.WORK has the value 2.

Standard Message Methods

Each message class also contains a number of other methods that let you check or manipulate the entire message, including:

IsInitialized(): checks if all the required fields have been set.
__str__(): returns a human-readable representation of the message, particularly useful for debugging. (Usually invoked as str(message) or print message.)
CopyFrom(other_msg): overwrites the message with the given message's values.
Clear(): clears all the elements back to the empty state.

These methods implement the Message interface. For more information, see the complete API documentation for Message.

Parsing and Serialization

Finally, each protocol buffer class has methods for writing and reading messages of your chosen type using the protocol buffer binary format. These include:

SerializeToString(): serializes the message and returns it as a string. Note that the bytes are binary, not text; we only use the str type as a convenient container.
ParseFromString(data): parses a message from the given string.

These are just a couple of the options provided for parsing and serialization. Again, see the Message API reference for a complete list.

==Protocol Buffers and O-O Design Protocol buffer classes are basically dumb data holders (like structs in C); they don't make good first class citizens in an object model. If you want to add richer behaviour to a generated class, the best way to do this is to wrap the generated protocol buffer class in an application-specific class. Wrapping protocol buffers is also a good idea if you don't have control over the design of the .proto file (if, say, you're reusing one from another project). In that case, you can use the wrapper class to craft an interface better suited to the unique environment of your application: hiding some data and methods, exposing convenience functions, etc. You should never add behaviour to the generated classes by inheriting from them. This will break internal mechanisms and is not good object-oriented practice anyway. ==

Writing A Message

Now let's try using your protocol buffer classes. The first thing you want your address book application to be able to do is write personal details to your address book file. To do this, you need to create and populate instances of your protocol buffer classes and then write them to an output stream.

Here is a program which reads an AddressBook from a file, adds one new Person to it based on user input, and writes the new AddressBook back out to the file again. The parts which directly call or reference code generated by the protocol compiler are highlighted.

 #! /usr/bin/python 
 import addressbook_pb2 
 import sys 
 
 # This function fills in a Person message based on user input. 
 def PromptForAddress(person):  
     person.id = int(raw_input("Enter person ID number: "))  
     person.name = raw_input("Enter name: ")   
     email = raw_input("Enter email address (blank for none): ")  
     if email != "":    
         person.email = email   
     while True:    
         number = raw_input("Enter a phone number (or leave blank to finish): ")    
         if number == "":      
             break     
         phone_number = person.phones.add()    
         phone_number.number = number     
         type = raw_input("Is this a mobile, home, or work phone? ")    
         if type == "mobile":      
             phone_number.type = addressbook_pb2.Person.PhoneType.MOBILE    
         elif type == "home":      
             phone_number.type = addressbook_pb2.Person.PhoneType.HOME    
         elif type == "work":      
             phone_number.type = addressbook_pb2.Person.PhoneType.WORK    
         else:      
             print "Unknown phone type; 
             leaving as default value." 
 # Main procedure:  Reads the entire address book from a file, 
 #   adds one person based on user input, then writes it back out to the same 
 #   file. 
 if len(sys.argv) != 2:  
     print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"  
     sys.exit(-1) 
     
 address_book = addressbook_pb2.AddressBook() 
 
 # Read the existing address book. 
 try:  
     f = open(sys.argv[1], "rb")  
     address_book.ParseFromString(f.read())  
     f.close() 
 except IOError:  
     print sys.argv[1] + ": Could not open file.  Creating a new one." 
     
 # Add an address. 
 PromptForAddress(address_book.people.add()) 
 
 # Write the new address book back to disk. 
 f = open(sys.argv[1], "wb") 
 f.write(address_book.SerializeToString()) 
 f.close()

Reading A Message

Of course, an address book wouldn't be much use if you couldn't get any information out of it! This example reads the file created by the above example and prints all the information in it.

#! /usr/bin/python 
import addressbook_pb2 
import sys 

# Iterates though all people in the AddressBook and prints info about them. 
def ListPeople(address_book):  
	for person in address_book.people:    
		print "Person ID:", person.id    
		print "  Name:", person.name   
         if person.HasField('email'):      
        	print "  E-mail address:", person.email 
            
         for phone_number in person.phones:      
         	if phone_number.type == addressbook_pb2.Person.PhoneType.MOBILE:        			print "  Mobile phone #: ",      
         	elif phone_number.type == addressbook_pb2.Person.PhoneType.HOME:        			print "  Home phone #: ",      
         	elif phone_number.type == addressbook_pb2.Person.PhoneType.WORK:        			print "  Work phone #: ",      
         	print phone_number.number 

# Main procedure:  Reads the entire address book from a file and prints all 
#   the information inside. 
if len(sys.argv) != 2:  
	print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"  
	sys.exit(-1) 
	
address_book = addressbook_pb2.AddressBook()

# Read the existing address book. 
f = open(sys.argv[1], "rb") 
address_book.ParseFromString(f.read()) 
f.close() 
ListPeople(address_book)

Extending a Protocol Buffer

Sooner or later after you release the code that uses your protocol buffer, you will undoubtedly want to "improve" the protocol buffer's definition. If you want your new buffers to be backwards-compatible, and your old buffers to be forward-compatible – and you almost certainly do want this – then there are some rules you need to follow. In the new version of the protocol buffer:

you must not change the tag numbers of any existing fields.
you must not add or delete any required fields.
you may delete optional or repeated fields.
you may add new optional or repeated fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

(There are some exceptions to these rules, but they are rarely used.)

If you follow these rules, old code will happily read new messages and simply ignore any new fields. To the old code, optional fields that were deleted will simply have their default value, and deleted repeated fields will be empty. New code will also transparently read old messages. However, keep in mind that new optional fields will not be present in old messages, so you will need to either check explicitly whether they're set with has_, or provide a reasonable default value in your .proto file with [default = value] after the tag number. If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For booleans, the default value is false. For numeric types, the default value is zero. Note also that if you added a new repeated field, your new code will not be able to tell whether it was left empty (by new code) or never set at all (by old code) since there is no has_ flag for it.

Advanced Usage

Protocol buffers have uses that go beyond simple accessors and serialization. Be sure to explore the Python API reference to see what else you can do with them.

One key feature provided by protocol message classes is reflection. You can iterate over the fields of a message and manipulate their values without writing your code against any specific message type. One very useful way to use reflection is for converting protocol messages to and from other encodings, such as XML or JSON. A more advanced use of reflection might be to find differences between two messages of the same type, or to develop a sort of "regular expressions for protocol messages" in which you can write expressions that match certain message contents. If you use your imagination, it's possible to apply Protocol Buffers to a much wider range of problems than you might initially expect!

Reflection is provided as part of the Message interface.