(You don't) Insert unicode NULL character as Postgres jsonb

With JSON data type it’s easy to treat Postgres as a document database, which doesn’t need strong schema. One can define a table, that has a field of a type jsonb and insert any valid JSON string (a “document”).

I’ve learned lately, that Postgres’s jsonb prohibits insertion of a valid JSON string if the string contains NULL (U+0000) character. Postgres’s own docs on JSON Types says:

RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the input function for the json type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness. However, the input function for jsonb is stricter: it disallows Unicode escapes for non-ASCII characters (those above U+007F) unless the database encoding is UTF8. The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL’s text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct.

In my case, a Go backend inserts tracing logs to Postgres. A trace consists of multiple “spans”, some of which can contain the reply from an external API. As we found out, sometimes, in the event of a failure, the API replies with an empty GIF <facepalm/>. Our backend converts the response to a string, marshals it to a JSON and later tries to insert the JSON into a Postgres table.

Consider the following Go code:

// data is an empty GIF
var data = []byte{
    0x47, 0x49, 0x46, 0x38, 0x39, 0x61, 0x01, 0x00,
    0x01, 0x00, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00,
    0xff, 0xff, 0xff, 0x21, 0xf9, 0x04, 0x01, 0x00,
    0x00, 0x00, 0x00, 0x2c, 0x00, 0x00, 0x00, 0x00,
    0x01, 0x00, 0x01, 0x00, 0x00, 0x02, 0x01, 0x44,
    0x00, 0x3b,
}

func main() {
    v, _ := json.Marshal(struct {
        Resp interface{} `json:"resp,omitempty"`
    }{
        Resp: string(data),
    })
    fmt.Printf("%s\n", v)
    // Output (truncated for readability):
    // {"resp":"GIF89a\u0001\u0000\u0001\u0000\ufffd\···\u0001D\u0000;"}
}

Above, json.Marshal produces a perfectly valid JSON. But if I try to insert it into a Postgres table as jsonb, the insert fails with “unsupported Unicode escape sequence”:

= CREATE TABLE logs (data jsonb);
= INSERT INTO logs VALUES ('{"resp":"GIF89a\u0001\u0000\u0001\u0000\ufffd\···\u0001D\u0000;"}');

ERROR:  unsupported Unicode escape sequence
LINE 1: insert into logs values ('{"resp":"GIF89a\u0001\u0000\u0001\...
                                 ^
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: {"resp":...

Because in my code, there were only a couple of places where I didn’t control the actual data, that went into a span, the way I’ve chosen to handle that was by introducing a wrapper type, that implements json.Marshaller. The wrapper checks the value is a valid UTF-8 sequence and doesn’t contain NULL character before it marshals the value into a JSON string. If the value is not a valid UTF-8, the marshaller sees it as a binary data and base64-encodes it.

// RawText handles invalid UTF-8 and NULL-bytes, encoding them as base64-string.
// Because we have to make sure the resulting JSON will be compatible with Postgres's jsonb,
// we must use RawText when we don't control the data, e.g. when log the error from an external API.
// Refer to https://www.postgresql.org/docs/10/datatype-json.html
type RawText []byte

func (v RawText) MarshalEasyJSON(w *Writer) {
    if utf8.Valid(v) && !bytes.ContainsRune(v, '\u0000') {
        // "valid" text is marshalled as string
        w.String(string(v))
    } else {
        // "invalid" text is marshalled as binary data
        w.Raw(json.Marshal([]byte(v)))
    }
}

Note, the code above is a marshaller for github.com/mailru/easyjson, which we use in the project.

Here is how it looks in practice:

func main() {
    v, _ := json.Marshal(struct {
        Resp1 interface{} `json:"resp1,omitempty"`
        Resp2 interface{} `json:"resp2,omitempty"`
    }{
        Resp1: RawText(bin),             // wrap the bin data into RawText
        Rest2: RawText("normal string"), // wrap (copy) a string into RawText
    })
    fmt.Printf("%s\n", v)
    // Output:
    // {
    // "resp1":"R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7",
    // "resp2":"normal string"
    // }
}