otf.pack
: Serialisation library#
pack
is a layer that is intended to be used on top of existing
serialisation libraries. It converts arbitrary python values to a simple human
readable format.
What differentiate pack
from other serialisation libraries is that
it is very explicit. Out of the box pack
only handle a small set of
types; we don’t rely on inheritance or the runtime structure of values to handle
anything that isn’t of a supported type. Support for new types can be added via
register()
.
Supported types#
Out of the box, the types that are supported are:
API#
Text format#
The text format uses the same syntax as python, we just support a very
restricted subset of the language and have a different set of builtins (e.g.:
ref
and nan
).
Todo
The text format doesn’t have a proper spec.
- otf.pack.dump_text(obj, indent=None, width=60, format=None)#
Serialise obj
Dumps supports several formats. Let’s take a sample value with a shared reference:
>>> v = {'nan': math.nan, '1_5':[1,2,3,4,5]} >>> v2 = [v, v]
COMPACT
means it will all be printed on one line.>>> print(dump_text(v2, format = COMPACT)) [{'nan': nan, '1_5': [1, 2, 3, 4, 5]}, ref(10)]
PRETTY
will use the width and indent argument to pretty print the output.>>> print(dump_text(v2, format = PRETTY, width=20)) [ { 'nan': nan, '1_5': [ 1, 2, 3, 4, 5 ] }, ref(10) ]
EXECUTABLE
will print code that can run in a python environment where the last statement is the value we’re building:>>> print(dump_text(v2, format = EXECUTABLE, width=40)) _0 = { 'nan': float("nan"), '1_5': [1, 2, 3, 4, 5] } [_0, _0]
- Parameters
obj – The value to serialise
indent (int | None) – indentation (for the
PRETTY
andEXECUTABLE
formats)width (int) – Maximum line length (for the
PRETTY
andEXECUTABLE
formats).format – One of
None
,COMPACT
,PRETTY
,EXECUTABLE
. If the value isNone
then the format will beCOMPACT
if indent wasn’t specified andPRETTY
otherwise.
Valid arguments for the format keyword of the dump_text()
are:
- otf.pack.COMPACT#
- otf.pack.PRETTY#
- otf.pack.EXECUTABLE#
Binary format#
The binary format used by OTF is MessagePack with a couple of extensions. See Description of the binary format
- otf.pack.dump_bin(obj)#
Serialise obj to the binary format
Note
This feature is only available if otf was installed with
msgpack
(e.g.: viapip install otf[msgpack]
).
- otf.pack.load_bin(packed)#
Read an object written in binary format
Note
This feature is only available if otf was installed with
msgpack
(e.g.: viapip install otf[msgpack]
).
Adding support for new types#
- otf.pack.register(function=None, /, *, type=None, pickle=False)#
Register a function to use while packing object of a given type.
function is expected to take objects of type T and to return a tuple describing how to recreate the object: a function and serialisable value.
If type is not specified,
register()
uses the type annotation on the first argument to deduce which type register function for.If
register()
is used as a simple decorator (with no arguments) it acts as though the default values for all of it parameters.Here are three equivalent ways to add support for the
complex
type:>>> @register ... def _reduce_complex(c: complex): ... return complex, (c.real, c.imag) >>> @register() ... def _reduce_complex(c: complex): ... return complex, (c.real, c.imag) >>> @register(type=complex) ... def _reduce_complex(c: complex): ... return complex, (c.real, c.imag)
- Parameters
function – The reduction we are registering
type – The type we are registering the function for
pickle – If set to
True
, function is registered viacopyreg.pickle()
to be used inpickle
.
Converting between formats#
otf.pack
is built upon the concept of reducers and accumulator (you
might know those as Fold and Unfold if you’re
functional programmer).
Documents are converted from one format to another by calling perform a reduce on them from their source format with an accumulator for their destination format. For instance if you wanted to go from the text representation of a document to its binary representation you could do:
>>> otf.pack.reduce_text('[1, 2, 3, 4]', otf.pack.BinPacker())
b'\x94\x01\x02\x03\x04'
This doesn’t use the runtime representation as an intermediate value. This has the advantage that it lets us introspect and fix documents that might rely on constructors that don’t exist anymore. Let’s say you have a binary value that won’t load because it relies on constructor that doesn’t exist anymore:
>>> bin_doc = (
... b'\x94\x92\xc7\x14\x02mymath.angle:degrees\x00\x92\xd4\x03\x02Z\x92'
... b'\xd4\x03\x02\xcc\xb4\x92\xd4\x03\x02\xcd\x01\x0e'
... )
>>> otf.pack.load_bin(bin_doc)
Traceback (most recent call last):
...
LookupError: Constructor not found: 'mymath.angle'
You can convert that binary representation into a format that is easier to read and can be fixed in an editor:
>>> print(otf.pack.reduce_bin(bin_doc, otf.pack.PrettyPrinter()))
[
mymath.angle(degrees=0),
mymath.angle(degrees=90),
mymath.angle(degrees=180),
mymath.angle(degrees=270)
]
Let’s say that you know that the new system uses angles represented as time on a
clock. You create a new binary value that can be loaded by that system even if
you don’t have the mymath
package installed on your machine:
>>> doc = """[
... mymath.clock_angle(hours=3),
... mymath.clock_angle(hours=12),
... mymath.clock_angle(hours=9),
... mymath.clock_angle(hours=6),
... ]"""
...
>>> new_bin_doc = otf.pack.reduce_text(doc, otf.pack.BinPacker())
The defined reducers are:
- otf.pack.reduce_runtime_value(obj, acc, string_hashcon_length=32)#
- otf.pack.reduce_text(orig, acc)#
- otf.pack.reduce_bin(packed, acc)#
Read a binary encoded value.
The Accumulators are:
- class otf.pack.RuntimeValueBuilder#
An accumulator that build runtime values.
- class otf.pack.CompactPrinter#
Serialize a value as a human readable text.
- class otf.pack.PrettyPrinter(indent=4, width=80)#
Convert a value into a multiline document
- class otf.pack.ExecutablePrinter(indent=4, width=80, add_imports=True)#
- class otf.pack.BinPacker#
Writer for binary encoded values.
Utility functions#
- otf.pack.copy(v)#
Copy a value using its representation.
copy(v)
is equivalent toload_text(dump_text(v))
.- Parameters
v –
- otf.pack.dis(raw, out=None, level=1)#
Output a disassembly of raw.
Warning
This function is for diagnose purposes only. We make no guarantees that the output format will not change.
The level argument controls how much details get printed. At
level<=1
only the Otf instructions are printed:>>> dis(dump_bin((1, 2, 3))) 0001: CUSTOM('tuple'): 0002: LIST: 0003: 1 0004: 2 0005: 3
At
level=2
the msgpack instruction are also printed:>>> dis(dump_bin((1, 2, 3)), level=2) msgpack: Array(len=2) msgpack: ExtType(2, b'tuple') otf: 0001: CUSTOM('tuple'): msgpack: Array(len=3) otf: 0002: LIST: msgpack: int(1) otf: 0003: 1 msgpack: int(2) otf: 0004: 2 msgpack: int(3) otf: 0005: 3
At
level>=3
a hexdump of the source is also included:>>> dis(dump_bin((1, 2, 3)), level=3) raw: 92 msgpack: Array(len=2) raw: C7 05 02 74 75 70 6C 65 msgpack: ExtType(2, b'tuple') otf: 0001: CUSTOM('tuple'): raw: 93 msgpack: Array(len=3) otf: 0002: LIST: raw: 01 msgpack: int(1) otf: 0003: 1 raw: 02 msgpack: int(2) otf: 0004: 2 raw: 03 msgpack: int(3) otf: 0005: 3
- Parameters
raw (bytes) – value to disassemble
out (file-like) – text file where the output will be written (defaults to
sys.stdout
).level (int) – between
1
and3
.
Description of the text format#
OTF’s text representation is a subset of python. In fact we use the cpython parser to read values. The easiest to specify the language is to use an ASDL [ASDL97] representation based on the one for the python grammar:
-- builtin types are:
-- identifier, string, constant
module OTF {
doc = Doc(import* imports, assign* bindings, expr value)
-- We do not support import as
import = Import(identifier* names)
-- We only support assigning to a name
assign = Assign(identifier target, expr? annotation, expr value)
expr = Dict(expr* keys, expr* values)
| Set(expr* elts)
| List(expr* elts)
| Tuple(expr* elts)
| Call(function_identifier, expr* args, keyword* keywords)
-- We don't support f-strings here
| JoinedStr(string* values)
| Constant(constant value)
| Name(identifier id)
-- In the python grammar this is part of expr
function_identifier = Attribute(function_identifier value, identifier attr)
| FName(identifier id)
-- keyword arguments supplied to call (**kwargs not supported)
keyword = (identifier arg, expr value)
}
An OTF serialised values consists of 3 parts:
imports: which are ignored by our parser
bindings: in form of <name> = <value>
value: an expression that can use any of the values defined in the bindings.
Todo
This part needs to be filled in properly.
Description of the binary format#
OTF’s binary format is valid MessagePack:
>>> import msgpack
>>> packed = otf.pack.dump_bin([None, {1: 1}, -0.])
>>> msgpack.unpackb(packed, strict_map_key=False)
[None, {1: 1}, -0.0]
MessagePack allows for application specific extension types. Here are the ones used by OTF:
Extension 0
: Arbitrary precision ints#
MessagePack only supports encoding integers in the \([-2^{63}, 2^{64}-1]\) interval. Python, on the other hand, supports arbitrarily large integers. We encode the integers outside the native MessagePack as 2’s-complement, little-endian payload_text inside an Extension Type of code 0:
>>> import msgpack
>>> msgpack.unpackb(otf.pack.dump_bin(2**72))
ExtType(code=0, data=b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01')
Extension 2
: Custom constructors#
OTF can be extended to support arbitrary types via
register()
. For those we need to save:
The full name of the constructor.
The names of keyword arguments.
The full list of values to pass back to the constructor.
>>> @otf.pack.register
... def _(c: complex):
... return complex, (c.real,), {"imag": c.imag}
>>>
>>> packed = otf.pack.dump_bin(complex(1, .5))
>>> otf.pack.dis(packed)
0001: CUSTOM('complex:imag'):
0002: 1.0
0003: 0.5
We will call the constructor complex
with one keyword argument imag
and
the full list of arguments is [1.0, 0.5]
. The last arguments of this list
are the keyword arguments.
>>> print(otf.pack.reduce_bin(packed, otf.pack.text.PrettyPrinter()))
complex(1.0, imag=0.5)
In raw MessagePack this is encoded as an array where the first element is an
ExtType
of code 2 and the rest of the list are the arguments:
>>> import msgpack
>>> msgpack.unpackb(packed)
[ExtType(code=2, data=b'complex:imag'), 1.0, 0.5]
Extension 3
: Interned constructor#
The argument passed to the ExtType
for custom constructors (the full name of
the constructor and the names of keyword arguments) is called the “shape” of a
custom constructors. In order to save space if the same shape appears twice in a
document the encoder uses the “interned_custom” instruction to reuse a previous
shape declaration.
>>> @otf.pack.register
... def _(c: complex):
... return complex, (), {"real": c.real, "imag": c.imag}
>>>
>>> packed = otf.pack.dump_bin([complex(1, .5), complex(2)])
>>> otf.pack.dis(packed)
0001: LIST:
0002: CUSTOM('complex:real:imag'):
0003: 1.0
0004: 0.5
0005: INTERNED_CUSTOM(3):
0006: 2.0
0007: 0.0
This results in smaller MessagePack payload_text:
>>> import msgpack, pprint
>>>
>>> pprint.pp(msgpack.unpackb(packed), width=60)
[[ExtType(code=2, data=b'complex:real:imag'), 1.0, 0.5],
[ExtType(code=3, data=b'\x03'), 2.0, 0.0]]
OTF’s serialised values should be self-descriptive; interning shapes encourages clients to focus on the readability of their output format.
Design choice#
Not prioritising speed#
The main focus is to provide an easy and safe way to serialise arbitrary python values. Fast python serialisation libraries (e.g.: msgpack and _pickle) significant part should be written in C. Both MsgPack and Pickle offer fallbacks in pure python. Maintaining a big a C stub (let alone a fallback in pure python) would be a significant cost. OTF is still a young project and being nimble is important.
If you want to save large amounts of data we recommend you use a format that is tailored to the type of data you are saving. Some good examples would be:
Protocol Buffer for structured data (i.e.: data with a fixed schema).
Relative references#
There are several ways to encode shared references. Here are a couple of options we rejected:
Marking: You could “mark” the values that will be used later as shared references either by encoding them in a separate section and referring to them later or by having an instruction to save them the first time the encoder sees them. We chose to avoid this route because:
It requires two different instructions (one to save the value and one to refer to it).
You have to do an additional pass over the value you’re about to encode to detect all the shared references.
Absolute references: instead of using an offset relative to the current position to encode the references we could have used a position relative to the start of the document. We chose to use a relative position because that allows us to insert an existing document without having to rewrite it.
Not allowing recursive values#
Recursive values are somewhat of a rare corner case and they can cause a lot of headaches. The naive approach is to assume you can construct values in two stages: initialise them as empty, create their children and then fill them. This works fine if you look at a value like:
value = [[],]
value[0].append(0)
But now let’s consider this case:
value = ([],)
value[0].append(0)
In this case you can’t create value
as an empty tuple and they fill
it… The corner cases cases get even more problematic when you consider that we
rely heavily on user provided serialisation function (via register()
). We
don’t actually know how much of its argument a constructor introspects…
Not relying on runtime representation or inheritance#
One of the main philosophical differences between pickle
and otf.pack
is that we are a lot more explicit. otf.pack
only supports values of types
that it explicitly knows how to handle. pickle
, on the other hand, will try
to serialise values based upon their runtime representation (e.g.: classes whose
__dict__
is picklable). These classes can, in turn, have instances of other
classes that will also be pickled based on their runtime representation. In
practice this ties the output of pickle pretty tightly with the exact version of
the code that generated it. On the other hand, by forcing clients to think
carefully about how values are serialised, we make it much easier to reason
about backward compatibility.
Using python and MessagePack for the serialisation languages#
Learning a new syntax, even the syntax for a domain specific language is tedious. We made it a point to re-use python for our text representation. Similarly, MessagePack is one of the most widespread languages to save unstructured data. Most languages already have libraries to read MessagePack. Depending on how much custom types you use in your applications, reading back values written by OTF in another language should be relatively easy. The same definitely cannot be said for pickle.