`otf.pack`: Serialisation library#

pack is a layer that is intended to be used on top of existing serialisation libraries. It converts arbitrary python values to a simple human readable format.

What differentiate pack from other serialisation libraries is that it is very explicit. Out of the box pack only handle a small set of types; we don’t rely on inheritance or the runtime structure of values to handle anything that isn’t of a supported type. Support for new types can be added via register().

Supported types#

Out of the box, the types that are supported are:

str, bytes, int, float, bool, None: Basic python primitives
list: where all the elements are serialisable
dict: where all the keys and values are serialisable
tuple: where all the elements are serialisable
set: where all the elements are serialisable
shared references: but not recursive values

API#

Text format#

The text format uses the same syntax as python, we just support a very restricted subset of the language and have a different set of builtins (e.g.: ref and nan).

Todo

The text format doesn’t have a proper spec.

otf.pack.dump_text(obj, indent=None, width=60, format=None)#

Serialise obj

Dumps supports several formats. Let’s take a sample value with a shared reference:

>>> v = {'nan': math.nan, '1_5':[1,2,3,4,5]}
>>> v2 = [v, v]

COMPACT means it will all be printed on one line.

>>> print(dump_text(v2, format = COMPACT))
[{'nan': nan, '1_5': [1, 2, 3, 4, 5]}, ref(10)]

PRETTY will use the width and indent argument to pretty print the output.

>>> print(dump_text(v2, format = PRETTY, width=20))
[
    {
        'nan': nan,
        '1_5': [
            1,
            2,
            3,
            4,
            5
        ]
    },
    ref(10)
]

EXECUTABLE will print code that can run in a python environment where the last statement is the value we’re building:

>>> print(dump_text(v2, format = EXECUTABLE, width=40))
_0 = {
    'nan': float("nan"),
    '1_5': [1, 2, 3, 4, 5]
}

[_0, _0]

Parameters

obj – The value to serialise
indent (int | None) – indentation (for the PRETTY and EXECUTABLE formats)
width (int) – Maximum line length (for the PRETTY and EXECUTABLE formats).
format – One of None, COMPACT, PRETTY, EXECUTABLE. If the value is None then the format will be COMPACT if indent wasn’t specified and PRETTY otherwise.

Valid arguments for the format keyword of the dump_text() are:

otf.pack.COMPACT#

otf.pack.PRETTY#

otf.pack.EXECUTABLE#

otf.pack.load_text(s)#

Load a value encoded as a string

Parameters: s (str) –

Binary format#

The binary format used by OTF is MessagePack with a couple of extensions. See Description of the binary format

otf.pack.dump_bin(obj)#: Serialise obj to the binary format

Note

This feature is only available if otf was installed with msgpack (e.g.: via pip install otf[msgpack]).

otf.pack.load_bin(packed)#: Read an object written in binary format

Note

This feature is only available if otf was installed with msgpack (e.g.: via pip install otf[msgpack]).

Adding support for new types#

otf.pack.register(function=None, /, *, type=None, pickle=False)#

function is expected to take objects of type T and to return a tuple describing how to recreate the object: a function and serialisable value.

If type is not specified, register() uses the type annotation on the first argument to deduce which type register function for.

If register() is used as a simple decorator (with no arguments) it acts as though the default values for all of it parameters.

Here are three equivalent ways to add support for the complex type:

>>> @register
... def _reduce_complex(c: complex):
...   return complex, (c.real, c.imag)

>>> @register()
... def _reduce_complex(c: complex):
...   return complex, (c.real, c.imag)

>>> @register(type=complex)
... def _reduce_complex(c: complex):
...   return complex, (c.real, c.imag)

Parameters

function – The reduction we are registering
type – The type we are registering the function for
pickle – If set to True, function is registered via copyreg.pickle() to be used in pickle.

Converting between formats#

otf.pack is built upon the concept of reducers and accumulator (you might know those as Fold and Unfold if you’re functional programmer).

Documents are converted from one format to another by calling perform a reduce on them from their source format with an accumulator for their destination format. For instance if you wanted to go from the text representation of a document to its binary representation you could do:

>>> otf.pack.reduce_text('[1, 2, 3, 4]', otf.pack.BinPacker())
b'\x94\x01\x02\x03\x04'

This doesn’t use the runtime representation as an intermediate value. This has the advantage that it lets us introspect and fix documents that might rely on constructors that don’t exist anymore. Let’s say you have a binary value that won’t load because it relies on constructor that doesn’t exist anymore:

>>> bin_doc = (
...   b'\x94\x92\xc7\x14\x02mymath.angle:degrees\x00\x92\xd4\x03\x02Z\x92'
...   b'\xd4\x03\x02\xcc\xb4\x92\xd4\x03\x02\xcd\x01\x0e'
... )
>>> otf.pack.load_bin(bin_doc)
Traceback (most recent call last):
  ...
LookupError: Constructor not found: 'mymath.angle'

You can convert that binary representation into a format that is easier to read and can be fixed in an editor:

>>> print(otf.pack.reduce_bin(bin_doc, otf.pack.PrettyPrinter()))
[
    mymath.angle(degrees=0),
    mymath.angle(degrees=90),
    mymath.angle(degrees=180),
    mymath.angle(degrees=270)
]

Let’s say that you know that the new system uses angles represented as time on a clock. You create a new binary value that can be loaded by that system even if you don’t have the mymath package installed on your machine:

>>> doc = """[
...     mymath.clock_angle(hours=3),
...     mymath.clock_angle(hours=12),
...     mymath.clock_angle(hours=9),
...     mymath.clock_angle(hours=6),
... ]"""
...
>>> new_bin_doc = otf.pack.reduce_text(doc, otf.pack.BinPacker())

The defined reducers are:

otf.pack.reduce_runtime_value(obj, acc, string_hashcon_length=32)#

otf.pack.reduce_text(orig, acc)#

otf.pack.reduce_bin(packed, acc)#: Read a binary encoded value.

The Accumulators are:

class otf.pack.RuntimeValueBuilder#: An accumulator that build runtime values.

class otf.pack.CompactPrinter#: Serialize a value as a human readable text.

class otf.pack.PrettyPrinter(indent=4, width=80)#: Convert a value into a multiline document

class otf.pack.ExecutablePrinter(indent=4, width=80, add_imports=True)#

class otf.pack.BinPacker#: Writer for binary encoded values.

Utility functions#

otf.pack.copy(v)#

Copy a value using its representation.

copy(v) is equivalent to load_text(dump_text(v)).

Parameters: v –

otf.pack.dis(raw, out=None, level=1)#

Output a disassembly of raw.

Warning

This function is for diagnose purposes only. We make no guarantees that the output format will not change.

The level argument controls how much details get printed. At level<=1 only the Otf instructions are printed:

>>> dis(dump_bin((1, 2, 3)))
CUSTOM('tuple'):
  LIST:
    1
    2
    3

At level=2 the msgpack instruction are also printed:

>>> dis(dump_bin((1, 2, 3)), level=2)
msgpack: Array(len=2)

msgpack: ExtType(2, b'tuple')
otf:     0001: CUSTOM('tuple'):

msgpack: Array(len=3)
otf:     0002:   LIST:

msgpack: int(1)
otf:     0003:     1

msgpack: int(2)
otf:     0004:     2

msgpack: int(3)
otf:     0005:     3

At level>=3 a hexdump of the source is also included:

>>> dis(dump_bin((1, 2, 3)), level=3)
raw:     92
msgpack: Array(len=2)

raw:     C7 05 02 74 75 70 6C 65
msgpack: ExtType(2, b'tuple')
otf:     0001: CUSTOM('tuple'):

raw:     93
msgpack: Array(len=3)
otf:     0002:   LIST:

raw:     01
msgpack: int(1)
otf:     0003:     1

raw:     02
msgpack: int(2)
otf:     0004:     2

raw:     03
msgpack: int(3)
otf:     0005:     3

Parameters

raw (bytes) – value to disassemble
out (file-like) – text file where the output will be written (defaults to sys.stdout).
level (int) – between 1 and 3.

Description of the text format#

OTF’s text representation is a subset of python. In fact we use the cpython parser to read values. The easiest to specify the language is to use an ASDL [ASDL97] representation based on the one for the python grammar:

-- builtin types are:
-- identifier, string, constant

module OTF {
    doc = Doc(import* imports, assign* bindings, expr value)

    -- We do not support import as
    import = Import(identifier* names)

    -- We only support assigning to a name
    assign = Assign(identifier target, expr? annotation, expr value)

    expr = Dict(expr* keys, expr* values)
         | Set(expr* elts)
         | List(expr* elts)
         | Tuple(expr* elts)
         | Call(function_identifier, expr* args, keyword* keywords)
         -- We don't support f-strings here
         | JoinedStr(string* values)
         | Constant(constant value)
         | Name(identifier id)

    -- In the python grammar this is part of expr
    function_identifier = Attribute(function_identifier value, identifier attr)
         | FName(identifier id)

    -- keyword arguments supplied to call (**kwargs not supported)
    keyword = (identifier arg, expr value)
}

An OTF serialised values consists of 3 parts:

imports: which are ignored by our parser
bindings: in form of <name> = <value>
value: an expression that can use any of the values defined in the bindings.

Todo

This part needs to be filled in properly.

Description of the binary format#

OTF’s binary format is valid MessagePack:

>>> import msgpack
>>> packed = otf.pack.dump_bin([None, {1: 1}, -0.])
>>> msgpack.unpackb(packed, strict_map_key=False)
[None, {1: 1}, -0.0]

MessagePack allows for application specific extension types. Here are the ones used by OTF:

Extension `0`: Arbitrary precision ints#

MessagePack only supports encoding integers in the \([-2^{63}, 2^{64}-1]\) interval. Python, on the other hand, supports arbitrarily large integers. We encode the integers outside the native MessagePack as 2’s-complement, little-endian payload_text inside an Extension Type of code 0:

>>> import msgpack
>>> msgpack.unpackb(otf.pack.dump_bin(2**72))
ExtType(code=0, data=b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01')

Extension `1`: Shared references#

A shared value is represented by a reference to a node that was previously visited while de-serialising the value:

>>> import msgpack
>>> v = {'a': 'b'}
>>> value = [v, v]
>>> packed = otf.pack.dump_bin(value)
>>> msgpack.unpackb(packed)
[{'a': 'b'}, ExtType(code=1, data=b'\x03')]

The ExtType(...) translates to a reference with an offset of 3:

>>> print(otf.dump_text(value))
[{'a': 'b'}, ref(3)]

This means that we are referencing the object that appeared three instructions ago in the OTF bin code. The best way to see what we are talking about is to disassemble the bin code:

>>> otf.pack.dis(packed)
LIST:
  MAP:
    'a'
    'b'
  REF(3)

Here we can clearly see that the ref points to MAP defined on instruction 0002.

In most cases it’s easier to just translate the binary value to a text format:

>>> print(otf.pack.reduce_bin(packed, otf.pack.text.ExecutablePrinter()))
_0 = {'a': 'b'}

[_0, _0]

Extension `2`: Custom constructors#

OTF can be extended to support arbitrary types via register(). For those we need to save:

The full name of the constructor.
The names of keyword arguments.
The full list of values to pass back to the constructor.

>>> @otf.pack.register
... def _(c: complex):
...   return complex, (c.real,), {"imag": c.imag}
>>>
>>> packed = otf.pack.dump_bin(complex(1, .5))
>>> otf.pack.dis(packed)
0001: CUSTOM('complex:imag'):
0002:   1.0
0003:   0.5

We will call the constructor complex with one keyword argument imag and the full list of arguments is [1.0, 0.5]. The last arguments of this list are the keyword arguments.

>>> print(otf.pack.reduce_bin(packed, otf.pack.text.PrettyPrinter()))
complex(1.0, imag=0.5)

In raw MessagePack this is encoded as an array where the first element is an ExtType of code 2 and the rest of the list are the arguments:

>>> import msgpack
>>> msgpack.unpackb(packed)
[ExtType(code=2, data=b'complex:imag'), 1.0, 0.5]

Extension `3`: Interned constructor#

The argument passed to the ExtType for custom constructors (the full name of the constructor and the names of keyword arguments) is called the “shape” of a custom constructors. In order to save space if the same shape appears twice in a document the encoder uses the “interned_custom” instruction to reuse a previous shape declaration.

>>> @otf.pack.register
... def _(c: complex):
...   return complex, (), {"real": c.real, "imag": c.imag}
>>>
>>> packed = otf.pack.dump_bin([complex(1, .5), complex(2)])
>>> otf.pack.dis(packed)
0001: LIST:
0002:   CUSTOM('complex:real:imag'):
0003:     1.0
0004:     0.5
0005:   INTERNED_CUSTOM(3):
0006:     2.0
0007:     0.0

This results in smaller MessagePack payload_text:

>>> import msgpack, pprint
>>>
>>> pprint.pp(msgpack.unpackb(packed), width=60)
[[ExtType(code=2, data=b'complex:real:imag'), 1.0, 0.5],
 [ExtType(code=3, data=b'\x03'), 2.0, 0.0]]

OTF’s serialised values should be self-descriptive; interning shapes encourages clients to focus on the readability of their output format.

Design choice#

Not prioritising speed#

The main focus is to provide an easy and safe way to serialise arbitrary python values. Fast python serialisation libraries (e.g.: msgpack and _pickle) significant part should be written in C. Both MsgPack and Pickle offer fallbacks in pure python. Maintaining a big a C stub (let alone a fallback in pure python) would be a significant cost. OTF is still a young project and being nimble is important.

If you want to save large amounts of data we recommend you use a format that is tailored to the type of data you are saving. Some good examples would be:

Arrow and Parquet for column oriented data.
Protocol Buffer for structured data (i.e.: data with a fixed schema).

Relative references#

There are several ways to encode shared references. Here are a couple of options we rejected:

Marking: You could “mark” the values that will be used later as shared references either by encoding them in a separate section and referring to them later or by having an instruction to save them the first time the encoder sees them. We chose to avoid this route because:
1. It requires two different instructions (one to save the value and one to refer to it).
2. You have to do an additional pass over the value you’re about to encode to detect all the shared references.
Absolute references: instead of using an offset relative to the current position to encode the references we could have used a position relative to the start of the document. We chose to use a relative position because that allows us to insert an existing document without having to rewrite it.

Not allowing recursive values#

Recursive values are somewhat of a rare corner case and they can cause a lot of headaches. The naive approach is to assume you can construct values in two stages: initialise them as empty, create their children and then fill them. This works fine if you look at a value like:

value = [[],]
value[0].append(0)

But now let’s consider this case:

value = ([],)
value[0].append(0)

In this case you can’t create value as an empty tuple and they fill it… The corner cases cases get even more problematic when you consider that we rely heavily on user provided serialisation function (via register()). We don’t actually know how much of its argument a constructor introspects…

Not relying on runtime representation or inheritance#

One of the main philosophical differences between pickle and otf.pack is that we are a lot more explicit. otf.pack only supports values of types that it explicitly knows how to handle. pickle, on the other hand, will try to serialise values based upon their runtime representation (e.g.: classes whose __dict__ is picklable). These classes can, in turn, have instances of other classes that will also be pickled based on their runtime representation. In practice this ties the output of pickle pretty tightly with the exact version of the code that generated it. On the other hand, by forcing clients to think carefully about how values are serialised, we make it much easier to reason about backward compatibility.

Using python and MessagePack for the serialisation languages#

Learning a new syntax, even the syntax for a domain specific language is tedious. We made it a point to re-use python for our text representation. Similarly, MessagePack is one of the most widespread languages to save unstructured data. Most languages already have libraries to read MessagePack. Depending on how much custom types you use in your applications, reading back values written by OTF in another language should be relatively easy. The same definitely cannot be said for pickle.

ASDL97: The Zephyr abstract syntax description language [PDF]

otf.pack: Serialisation library#

Supported types#

API#

Text format#

Binary format#

Adding support for new types#

Converting between formats#

Utility functions#

Description of the text format#

Description of the binary format#

Extension 0: Arbitrary precision ints#

Extension 1: Shared references#

Extension 2: Custom constructors#

Extension 3: Interned constructor#

Design choice#

Not prioritising speed#

Relative references#

Not allowing recursive values#

Not relying on runtime representation or inheritance#

Using python and MessagePack for the serialisation languages#

`otf.pack`: Serialisation library#

Extension `0`: Arbitrary precision ints#

Extension `1`: Shared references#

Extension `2`: Custom constructors#

Extension `3`: Interned constructor#