January 19, 2024
Blog

SQLite’s New Binary JSON Format

A new data format is being used by the world’s most-used database format. Here’s what digital forensics practitioners need to know.

SQLite is one of, if not outright the most deployed database engine in the world. Being both free and good would have been enough to sustain its popularity, but the SQLite development team continue to add new features to the engine, so that it can keep up with the new, trendy kids on the database engine block. One such feature is support for direct querying and modifying of embedded JSON data.

Support for querying JSON fields stored in a database has been available in SQLite since version 3.9.0, released late 2015 via an officially supported extension which, at the time, could be optionally added to the library when it was compiled. Since version 3.38.0, released early 2022, this functionality has been added as a default part of the official library. SQLite also supports the additional syntax supported by the JSON5 project. 

The newest version (at the time of writing) is 3.45.0 which brings with it further enhancements to the JSON functionality. In particular: internally SQLite uses a faster, smaller, binary JSON format. This format is serializable and available to applications using the database if developers are happy to choose performance over readability (for them, not their users) of the data being stored.

The SQLite team have opted to create their own binary JSON format which they refer to as “JSONB”, rather than making use of one of the pre-existing binary JSON or JSON-adjacent formats already available. This choice makes sense given the specific use-case of integrating it into their query engine, but this does leave us with YABJF (Yet Another Binary JSON Format) to understand and recognize.

The first thing to note is that if you encounter a database which contains JSONB, as long as you are using a tool with the latest version of the SQLite library (which could just be the official CLI tool), converting to text-based JSON for review is trivial. Given a table called “table” and a column called “data”, you could do the following:

SELECT json("data") FROM "table";

This query will work, even if there is a mix of text-based JSON and JSONB in the column.

But how do you recognize the format in the first place? The first thing to get out of the way is that there is no “file signature” i.e., a fixed sequence of bytes at the start of the data – but there are other clues we can use to detect it. For some tips on that, we should first take a moment to examine how the format works (here come the bits and bytes – but don’t worry, this one isn’t too complicated!). The full format is documented on the SQLite website, but I’ll do my best to summarize it here.

The format is “tagged”, meaning that each element of data starts with a tag (a single byte in this case) which encodes the type of data and the length of said data in bytes; the data then follows the tag (possibly prefixed with an additional length field – we’ll get to this). The data might be a string, a number, or in the case of [arrays] and {objects} other tagged values.

A tag byte encodes the type of data in the four least-significant bits, which means that there are 16 different types of data which can be encoded (although 3 values are currently reserved).

Data table

Note that numbers are stored as text, which initially seems a little unusual in a binary format, but given that one of the design goals of the format is to convert quickly between text and binary, this choice does make sense.  

In testing, I have been unable to generate “TextJ” or “Text5” values, as any data I have generated during testing that I believe should be quoted is converted to a “TextRaw” type. I suspect that “TextJ” and “Text5” are used internally in SQLite but are not directly serialized, but that is conjecture on my part.

The length of the record is stored as a number of bytes. This length value is stored in the 4 most significant bits of the tag byte if it is 11 bytes or less; if the data is greater than 11 bytes in length, the 4 most significant bits give the magnitude of a big-endian, unsigned integer which follows the tag byte directly (preceding the actual data).

Data table

Let’s take a look at an example:

This JSON:

JSON Code

Is encoded in JSONB thus:

JSONB table data

As there isn’t any “file header” in JSONB, the data starts with the root value’s tag.  

The first byte we encounter is 0xCC; looking up the least significant 4 bits in the data-types table above, we can see that 0xC relates to an object. The upper 4 bits (also 0xC, or 12 in decimal) give a length value greater than 11 so the length of the data is encoded after the tag; looking up 0xC in our length table tells us that the length is encoded as an 8-bit integer. The length, therefore, is found in the following byte: 0x6F, which is 111 in decimal, which given that this is the root value should account for the remaining JSONB data (which it does).

For the next 111 bytes we shall be reading pairs of keys and values which make up this root object’s properties; let’s decode the first of them:

The next tag byte we encounter is 0x47, which has a type of 0x7 (Text) and a length of 0x4; because the length is less than or equal to 11, the length value encoded in the tag is the actual length, so we can read 4 bytes from here and interpret them as Unicode encoded with UTF-8 and find the first key “text”.  

Next, we find the tag 0x97, which has a type of 0x7 (Text) and a length of 0x9; again, as the length is less than or equal to 11, we can again treat this as the actual length, read 9 bytes and decode it as UTF-8 to find the value of the key “some text”.

We can continue to read the data if we wish, but I’ll leave that as an exercise for the reader as we’ve already encountered the clue that we can use to detect the format at a glance: the root tag.

JSON data will tend to have either an object or array as its root object (otherwise, in the context of SQLite, it would be better stored as a native datatype). This means that JSONB data will almost always start with tags of those types: 0x?B or 0x?C. The upper 4-bits encodes the length, which is, of course variable, however, it is unlikely (though not impossible!) in many cases that the length will be under 12 bytes, and I would be even more surprised to find more than 4GB(!) of JSON data, requiring a 64-bit length value, so this leaves us with one of:

0xCB, 0xCC, 0xDB, 0xDC, 0xEB, 0xEC

…as likely candidates at the start of the data, followed in close proximity by data which is likely to be human readable given that almost all data types are actually encoded as text. You could also validate that the length directly following the first tag accounts for the rest of the data.

The above description probably doesn’t describe this format exclusively but given the context of finding it stored in an SQLite database, it should at least ring alarm-bells that you may be seeing JSONB.

As part of our research into the format, we built a Python module which can read and convert JSONB data which we have open sourced here. Given that the format can be converted or queried easily by SQLite itself (as long as you have an up-to-date library), this is provided more for research purposes, although it might be useful for further research or dealing with recovered deleted data from databases.

We're here to help

Our experts are on hand to learn about your organisation and suggest the best approach to meet your needs. Contact an expert today.

Get in touch
hexes