Monday 5 May 2014

Simple Binary Encoding

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview.

In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.

Last year the CME, who are a member the FIX community, commissioned Todd Montgomery, of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive. Working on a cool technical problem with such a team is a dream job.

SBE Overview

SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Message Structure

A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.

Figure 1
SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here.

SbeTool and the Compiler

To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.

A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.

    java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.

Programming with Stubs

A full example of messages defined in a schema with supporting code can be found here. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.
    // Write the message header first
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion)
                  .blockLength(CAR.sbeBlockLength())
                  .templateId(CAR.sbeTemplateId())
                  .schemaId(CAR.sbeSchemaId())
                  .version(CAR.sbeSchemaVersion());

    // Then write the body of the message
    car.wrapForEncode(directBuffer, bufferOffset)
       .serialNumber(1234)
       .modelYear(2013)
       .available(BooleanType.TRUE)
       .code(Model.A)
       .putVehicleCode(VEHICLE_CODE, srcOffset);
Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.
    // Read the header and lookup the appropriate template to decode
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion);

    final int templateId = MESSAGE_HEADER.templateId();
    final int actingBlockLength = MESSAGE_HEADER.blockLength();
    final int schemaId = MESSAGE_HEADER.schemaId();
    final int actingVersion = MESSAGE_HEADER.version();

    // Once the template is located then the fields can be decoded.
    car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion);

    final StringBuilder sb = new StringBuilder();
    sb.append("\ncar.templateId=").append(car.sbeTemplateId());
    sb.append("\ncar.schemaId=").append(schemaId);
    sb.append("\ncar.schemaVersion=").append(car.sbeSchemaVersion());
    sb.append("\ncar.serialNumber=").append(car.serialNumber());
    sb.append("\ncar.modelYear=").append(car.modelYear());
    sb.append("\ncar.available=").append(car.available());
    sb.append("\ncar.code=").append(car.code());

The generated code in all languages gives performance similar to casting a C struct over the memory.

On-The-Fly Decoding

The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here.

Direct Buffers

SBE, via Agrona, provides an abstraction to Java, with the MutableDirectBuffer class, to work with buffers that are byte[], heap or direct ByteBuffer buffers, and off heap memory addresses returned from Unsafe.allocateMemory(long) or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via MappedByteBuffer and thus can be be transferred to a network channel by the kernel thus avoiding user space copies.

C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.

Message Extension and Versioning

SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.

Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.

Byte Ordering and Alignment

The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.

Message Protocols

I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.

Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.

Stub Performance

The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!

Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.

Feedback

This is the first version of SBE and we would welcome feedback. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.


Update: 08-May-2014

Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.

The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.

Before GPB Optimisation:
Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      462.817        6.474   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      326.018        2.972   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     1148.050       17.194   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1242.252       12.248   ops/ms

     [exec] u.c.r.sbe.CarBenchmark.testDecode                thrpt   1     30    1    10436.476      102.114   ops/ms
     [exec] u.c.r.sbe.CarBenchmark.testEncode                thrpt   1     30    1    11657.190       65.168   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testDecode         thrpt   1     30    1    34078.646      261.775   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testEncode         thrpt   1     30    1    29193.600      443.638   ops/ms
After GPB Optimisation:
Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      619.467        4.429   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      433.711       10.364   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     2088.998       60.619   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1316.123       19.816   ops/ms


Throughput msg/ms - Before GPB Optimisation
TestProtocol BuffersSBERatio
Car Encode462.817
10436.476
22.52
Car Decode326.018
11657.190
35.76
Market Data Encode1148.050
34078.646
29.68
Market Data Decode1242.252
29193.600
23.50

Throughput msg/ms - After GPB Optimisation
TestProtocol BuffersSBERatio
Car Encode619.467
10436.476
16.85
Car Decode433.711
11657.190
26.88
Market Data Encode2088.998
34078.646
16.31
Market Data Decode1316.123
29193.600
22.18