Serialisation performance

Performance on basic types

Serialisation and deserialisation rates (in MByte/s) using OutputArchive and InputArchive were measured for arrays of basic types where the size of the array is 1MB, 8MB, 64MB or 512MB.

These measurement were obtained on an i7-7700HQ laptop with 16GB RAM running Windows 10 Pro.

The serialisation and deserialisation rates were equal to within the accuracy of the measurements.

Type	Rate for 1MB of data (MByte/s)	Rate for 8MB of data (MByte/s)	Rate for 64MB of data (MByte/s)	Rate for 512MB of data (MByte/s)
bool, int8, uint8	3200	3100	2900	2700
int16, uint16	5900	5700	4800	4500
int32, uint32, float32	12000	11000	7700	6700
int64, uint64, float64	22000	20000	9800	7600
128 bit guid	36000	27000	10000	7800

Serialisation on smaller arrays can achieve high performance because of the L1/L2/L3 caches. The performance of larger buffers is constrained by the sustained memory bandwidth to main memory which is less than 8GB/sec.

Example : serialisation of a TIN

Consider a Triangulated irregular network [] (TIN) which is a data type using the following implementation:


struct Triangle
{
    int v0, v1, v2;
};

struct Point
{
    double x,y,z;
};

struct Tin
{
    std::vector<Point> vertices;
    std::vector<Triangle> triangles;
};

Serialisation with Google Protobuf involves defining corresponding message types in a .proto file:


syntax = "proto3";

package Geometry;

message M_Triangle
{
	int32 v0 = 1;
	int32 v1 = 2;
	int32 v2 = 3;
}

message M_Point 
{
	double x = 1;
	double y = 2;
	double z = 3;
}

message M_Tin 
{
	repeated M_Point vertices = 1;
	repeated M_Triangle triangles = 2;
}

The following results were obtained serialising and deserialising a TIN with 718490 vertices and 1430166 triangles (about 35MB of data) on an i7-7700HQ laptop with 16GB RAM running Windows 10 Pro:

Function	Time (milliseconds)
Serialise with Google Protobuf	359
Deserialise with Google Protobuf	348
Serialise with CEDA OutputArchive	4.7
Deserialise with CEDA InputArchive	4.8

In this example CEDA is about 70x faster than Google protobuf at both serialisation and deserialisation.

The protobuf serialisation/deserialisation rate is about 110 MB/s The CEDA serialisation/deserialisation is about 6.8 GB/s

CEDA generated a smaller output:

Format	Size (bytes)
Google Protobuf	40764955
CEDA	34405758

The Google Protobuf format is about 18% larger than CEDA.

Summary of disadvantages in using protobuf messages in your C++ application

As someone said on Hacker News

Protobuf's abysmal performance, questionable integration into the C++ type system, append-only expandability, and annoying naming conventions and default values are why I usually try and steer away from it.

and also:

But yes, once you want real high performance, protobuf will disappoint you when you benchmark and find it responsible for all the CPU use.

There are a number of reasons why one might not want to build a large C++ application on top of protobuf message types:

The syntax for accessing the members is cumbersome. For example you cannot access the x,y,z elements of a Point message type as though they are members of a struct.
When used as the basis for writing entire files, it represents an example of the inadequate persistence system antipattern. It's unfortunately very common for desktop applications to write data directly to files instead of using a DBMS.
Protobuf messages have an excessive memory footprint. See for example C++ protobuf uses much more memory than what is required to hold the message on Stackoverflow.
The semantics for field presence are confusing and inconsistent. Also there are adhoc limitations on the way message types can be defined. Repeated oneof is illegal. Map fields and repeated fields cannot be used in oneofs. Singular submessages implicitly mean at most one. Oneofs implicitly mean at most one of. Maps and oneofs cannot be repeated. Optional repeated is illegal.
Usually integers are encoded in a variable length format even though this has a significant impact on performance.
Protobuf tends to be inefficient in the way it allocates memory. By default it performs heap allocations for each message object, each of its subobjects, and several field types, such as strings. This can potentially be avoided using arenas, but that isn't straightforward and greatly complicates application code.
Protobuf serialisation performance is typically only around 1 Gbit/sec, this is very poor so it's not a very good technology to be locked into. Note that 25 Gbit/s ethernet is common in the cloud, and most machines now have NVMe drives which might support data transfers at 50 Gbit/s.
Protobuf messages are type intrusive and standard library types such as std::vector, std::deque, std::list, std::map, std::set, std::optional, std::variant cannot be used.
Fixed size arrays aren't available in Protobuf messages, so for example you cannot easily optimise for the representation of a 3x3 matrix.
Protobuf provides its own implementation of associative containers (i.e. maps), and if that doesn't meet your performance requirements there's no way of replacing it with an alternative.
The type intrusive protobuf messages may imply a need to copy between a protobuf message and other types in your application.
Serialisation to a file using a single high level protobuf message is only reasonable for relatively small files. For large datasets it's unsuitable. Indeed in the Protocol Buffers Documentation it states:
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
There is no support for some of the most fundamental data structures on disk, such as Btrees.
Schema evolution isn't encapsulated in the deserialisation function, it is instead exposed to all clients of the data type.