Re: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?

Bryn Llewellyn <bryn@xxxxxxxxxxxx> · Fri, 17 Jun 2022 17:14:14 -0700

mmoncure@xxxxxxxxx wrote:

yeah. I would expect for json or jsonb, two values, "a, b",  "a" is distinct from "b" should give the same answer as "a::text is distinct from b::text".

tgl@xxxxxxxxxxxxx wrote:

bryn@xxxxxxxxxxxx wrote:

I'm going to try to think like this: The number of possible spellings of the names of keys in a JSON object is some flavor of infinite. So including this in an object:

"k": null

really is saying something. It says that I do know about "k" and that yet I have simply no information available about its value.

I'd read it as asserting that key "k" is meaningful for this object, but the correct value for that key is not known.

I have a hard time with your assertion that {"x": 42, "y": null} should be considered equivalent to {"x": 42}, because it would
render key-exists predicates useless. Either you have to say that key "y" is claimed to exist in both of these objects and indeed every object, or you have to make it fail if the key's value is null (so that it'd say "false" in both of these cases). Either of those options seems both weird and useless.

The quirkiness that my first example showed can be seen differently from how I saw it first. When I now consider this _expression_:

('{"x": 42}'::jsonb)->>'y'

it seems that its evaluation should simply raise an exception. But you said:

This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL.

Right. This is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data. And really the point of using JSON inside a SQL database is to cope with irregularly-structured data, so fuzziness seems like what we want.

Thank you very much for this, Tom. And thanks again to you, David, for your input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if my questions (here and on other topics) might seem to challenge established wisdom and precedent.

Thank you very much for this, Tom. And thanks again to you, David, for your earlier input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if, contrary to my intention, my questions (here and on other topics) might seem to challenge established wisdom and precedent.

I delayed my reply until I'd had time to think, to do some study, and (most importantly) to implement a complete, self-contained proof-of-concept to substantiate my conclusion. I'm ready, now, to report back.

Summary: I can meet my goal by using PG's native functionality appropriately.

So, w.r.t. this email’s subject, this:

  select strip_null_keys('{"x": 42, "y": null}'::jsonb) = '{"x": 42}'::jsonb;

returns "true".

The point at issue is whether the presence of « "some key": null » is different from the absence of "some key". And you (all) have argued that the two locutions for what I have wanted to see as one notion are indeed different. Well, yes, of course they are. This is tautologically true if you think only of the Unicode text of a JSON document that's so far waiting to be ingested by who knows what system. And you pointed out that, in PostgreSQL

  my_doc ? 'some key'

detects the difference. You've convinced me that some use cases will care about this—in other words, it's a distinction *with* a difference. But, in my use case, the two distinct locutions bring no semantic difference. But the actual distinction causes me a problem when I try to prove that this pair of transformations is idempotent:

  JSON → relational → JSON

But that problem is behind me now.

So... (and if you read this far at all) you can stop now unless you're interested to read further.

Before getting to my use case, I noted that Tom said this about the fact that using ->> to read a non-existent key returns a SQL null:

[This] is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data.

It's easy to fix this, when the app requires strictness, by implementing a user-defined operator pair, say +>> and its partner. I did this (to respect what my use case needs) —and it passed all my tests. However, it turned out, for reasons that I explain below, that I didn't need it. The operator's implementation function reads the value with the appropriate native operator and only if it returns SQL null (or JSON null) does it do more testing. First it checks if the key is absent with the ? operator—raising an error if it is so. Then it reads the value (again) with the -> native operator and raises an error if it gets a JSON null.

————————————————————

My use case

I should stress that what I have, so far, is simply a demo implementation on my laptop. All the tests that I've managed to invent (including the idempotency test) work as I expect them to. However, any aspect of my code can be changed in a heartbeat if I realize that it's suspect.

Here's an example of the documents that I have in my table's "jsonb" column.

{
  "isbn"    : "978-0-14-303809-2",
  "title"   : "Joy Luck Club",
  "year"    : 2006, 
  "authors" : [
                {"given name": "Amy", "family name" : "Tan"}
              ],
  "genre"   : "Novel"
}

The documents are very definitely supposed to adhere to a JSON Schema.

  https://json-schema.org/

But I don't need the formality of JSON Schema's notation for such a simple case. Prose will do fine.

/* ————— START OF SPEC —————————————————————————————— */

The document's top-level object may use only these keys:

"isbn" — string
values must be unique across the entire set of documents (in other words, it defines the unique business key); values must have this pattern:

  « ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »

"title" — string

"year" — number
must be a positive integral value

"authors" — array of objects;
must be at least one object

"genre" — string

Each object in the "authors" array object may use only these keys:

"family name" — string

"given name" — string

String values other than for "isbn" are unconstrained.

Any key other than the seven listed here is illegal. The "genre" and "given name" keys are not required. All the other keys are required.

The meaning of *required* is that no extracted value must bring a SQL null (so a required key must not have a JSON null value).

And the meaning of *not required* is simply "no information is available for this key" (with no nuances). The spec author goes further by adding a rule: this meaning must be expressed by the absence of such a key.

/* ————— END OF SPEC ———————————————————————————————— */

The rule that « "some key": null » is not allowed brings the benefit that a document can be maximally terse. Moreover, it helps document authors by removing the need to decide which locution to use; and it helps programmers to write code to check that incoming documents adhere to the spec and then to extract their meaning (for example to a classic relational representation).

It's easy to see that the information content implies these business rules:

—Each book must have at least one author. Each author may be among the authors of one or several books.

—Each book may be of exactly one (known) genre. Each genre may classify one or several books.

My code does this:

—Ingests the input JSON documents into a "source(k... primary key, book_info jsonb)" table.

—Shreds the books facts into a classic Codd-and-Date relational representation (with the obvious tables "books", "genres", "authors", and "books_authors" with the usual PK and FK constraints.

—Transforms the aggregated set of facts for each book back to a set of JSON documents for transport to a different system. Critically, these must adhere to the same JSON Schema (and rules) that govern the incoming documents.

This is where the idempotency requirement that I mentioned above comes from:

  JSON → relational → JSON

It's been suggested that this is an ignoble and unattainable goal. I disagree—on both counts.

I implemented two complementary functions:

—"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some key": null »

—"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" value

The code checks with "no_null_keys()" that, as expected, no ingested JSON document has an occurrence of « "some key": null ».

And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as appropriate, any other built-in JSON function that produces a "jsonb" value.

It was straightforward to implement these two functions by using REGEXP built-in functionality on the canonically formatted "text" value produced by the "jsonb::text" typecast.

The check on the incoming documents is included in the "j_books_book_info_is_conformant(jsonb)" function that is the basis of a constraint that's created on the "source" table's "book_info" column.

More code is needed to implement other constraints like, for example, the value of the "isbn" (string) key must satisfy a particular regular _expression_ and the value of the "year" (number) key must convert to a positive integer (You've heard about that test already.)

These tests, too, (and other tests) are included in the "j_books_book_info_is_conformant(jsonb)" function.

Critically, one test uses "jsonb_object_keys()" to scan the top-level object to ensure that all the required keys are present, that every key has the specified JSON data type, and that no keys that the JSON Schema doesn't mention are present. A similar test does the same for the "authors" array. This is why I can be sure that the native ->> and -> operators are sufficient for my purpose.

I considered using the "j_books_book_info_is_conformant(jsonb)" function in a constraint for a domain based on "jsonb" but decided against that.