Thank you very much for this, Tom. And thanks again to you, David, for your input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if my questions (here and on other topics) might seem to challenge established wisdom and precedent. Thank you very much for this, Tom. And thanks again to you, David, for your earlier input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if, contrary to my intention, my questions (here and on other topics) might seem to challenge established wisdom and precedent. I delayed my reply until I'd had time to think, to do some study, and (most importantly) to implement a complete, self-contained proof-of-concept to substantiate my conclusion. I'm ready, now, to report back. Summary: I can meet my goal by using PG's native functionality appropriately. So, w.r.t. this email’s subject, this: select strip_null_keys('{"x": 42, "y": null}'::jsonb) = '{"x": 42}'::jsonb; returns "true". The point at issue is whether the presence of « "some key": null » is different from the absence of "some key". And you (all) have argued that the two locutions for what I have wanted to see as one notion are indeed different. Well, yes, of course they are. This is tautologically true if you think only of the Unicode text of a JSON document that's so far waiting to be ingested by who knows what system. And you pointed out that, in PostgreSQL my_doc ? 'some key' detects the difference. You've convinced me that some use cases will care about this—in other words, it's a distinction *with* a difference. But, in my use case, the two distinct locutions bring no semantic difference. But the actual distinction causes me a problem when I try to prove that this pair of transformations is idempotent: JSON → relational → JSON But that problem is behind me now. So... (and if you read this far at all) you can stop now unless you're interested to read further. Before getting to my use case, I noted that Tom said this about the fact that using ->> to read a non-existent key returns a SQL null: [This] is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data. It's easy to fix this, when the app requires strictness, by implementing a user-defined operator pair, say +>> and its partner. I did this (to respect what my use case needs) —and it passed all my tests. However, it turned out, for reasons that I explain below, that I didn't need it. The operator's implementation function reads the value with the appropriate native operator and only if it returns SQL null (or JSON null) does it do more testing. First it checks if the key is absent with the ? operator—raising an error if it is so. Then it reads the value (again) with the -> native operator and raises an error if it gets a JSON null. ———————————————————— My use case I should stress that what I have, so far, is simply a demo implementation on my laptop. All the tests that I've managed to invent (including the idempotency test) work as I expect them to. However, any aspect of my code can be changed in a heartbeat if I realize that it's suspect. Here's an example of the documents that I have in my table's "jsonb" column. { "isbn" : "978-0-14-303809-2", "title" : "Joy Luck Club", "year" : 2006, "authors" : [ {"given name": "Amy", "family name" : "Tan"} ], "genre" : "Novel" } The documents are very definitely supposed to adhere to a JSON Schema. But I don't need the formality of JSON Schema's notation for such a simple case. Prose will do fine. /* ————— START OF SPEC —————————————————————————————— */ The document's top-level object may use only these keys: "isbn" — string values must be unique across the entire set of documents (in other words, it defines the unique business key); values must have this pattern: « ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ » "title" — string "year" — number must be a positive integral value "authors" — array of objects; must be at least one object "genre" — string Each object in the "authors" array object may use only these keys: "family name" — string "given name" — string String values other than for "isbn" are unconstrained. Any key other than the seven listed here is illegal. The "genre" and "given name" keys are not required. All the other keys are required. The meaning of *required* is that no extracted value must bring a SQL null (so a required key must not have a JSON null value). And the meaning of *not required* is simply "no information is available for this key" (with no nuances). The spec author goes further by adding a rule: this meaning must be expressed by the absence of such a key. /* ————— END OF SPEC ———————————————————————————————— */ The rule that « "some key": null » is not allowed brings the benefit that a document can be maximally terse. Moreover, it helps document authors by removing the need to decide which locution to use; and it helps programmers to write code to check that incoming documents adhere to the spec and then to extract their meaning (for example to a classic relational representation). It's easy to see that the information content implies these business rules: —Each book must have at least one author. Each author may be among the authors of one or several books. —Each book may be of exactly one (known) genre. Each genre may classify one or several books. My code does this: —Ingests the input JSON documents into a "source(k... primary key, book_info jsonb)" table. —Shreds the books facts into a classic Codd-and-Date relational representation (with the obvious tables "books", "genres", "authors", and "books_authors" with the usual PK and FK constraints. —Transforms the aggregated set of facts for each book back to a set of JSON documents for transport to a different system. Critically, these must adhere to the same JSON Schema (and rules) that govern the incoming documents. This is where the idempotency requirement that I mentioned above comes from: JSON → relational → JSON It's been suggested that this is an ignoble and unattainable goal. I disagree—on both counts. I implemented two complementary functions: —"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some key": null » —"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" value The code checks with "no_null_keys()" that, as expected, no ingested JSON document has an occurrence of « "some key": null ». And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as appropriate, any other built-in JSON function that produces a "jsonb" value. It was straightforward to implement these two functions by using REGEXP built-in functionality on the canonically formatted "text" value produced by the "jsonb::text" typecast. The check on the incoming documents is included in the "j_books_book_info_is_conformant(jsonb)" function that is the basis of a constraint that's created on the "source" table's "book_info" column. More code is needed to implement other constraints like, for example, the value of the "isbn" (string) key must satisfy a particular regular _expression_ and the value of the "year" (number) key must convert to a positive integer (You've heard about that test already.) These tests, too, (and other tests) are included in the "j_books_book_info_is_conformant(jsonb)" function. Critically, one test uses "jsonb_object_keys()" to scan the top-level object to ensure that all the required keys are present, that every key has the specified JSON data type, and that no keys that the JSON Schema doesn't mention are present. A similar test does the same for the "authors" array. This is why I can be sure that the native ->> and -> operators are sufficient for my purpose. I considered using the "j_books_book_info_is_conformant(jsonb)" function in a constraint for a domain based on "jsonb" but decided against that. |