Re: working around JSONB's lack of stats?

Jim Nasby <Jim.Nasby@xxxxxxxxxxxxxx> · Fri, 30 Jan 2015 19:34:40 -0600

On 1/30/15 2:26 PM, Josh Berkus wrote:
On 01/28/2015 03:50 PM, Peter Geoghegan wrote:
On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote:
jsonb_col @> '[ "key1" ]'
or jsonb_col ? 'key1'
         if in MCE, assign % from MCE
         otherwise assign 1% of non-MCE %

jsonb_col @> '{ "key1": "value1" }'
         if in MCE, assign MCE% * 0.1
         otherwise assign 0.01 of non-MCE %

Does that make sense?

I suspect it makes a lot less sense. The way people seem to want to
use jsonb is as a document store with a bit of flexibility. Individual
JSON documents tend to be fairly homogeneous in structure within a
table, just like with systems like MongoDB. Strings within arrays are
keys for our purposes, and these are often used for tags and so on.
But Strings that are the key of an object/pair are much less useful to
index, in my estimation.

Yeah, I see your point; except for arrays, people are usually searching
for a key:value pair, and the existence of the key is not in doubt.

That would make the "element" the key:value pair, no?  But
realistically, we would only want to do that for simple keys and values.

Although: if you "flatten" a nested JSON structure into just keys with
scalar values (and array items as their own thing), then you could have
a series of expanded key:value pairs to put into MCE.

For example:

{ house : { city : San Francisco,
      sqft: 1200,
      color: blue,
      occupants: [ mom, dad, child1 ]
      }
   occupation: programmer
}

... would get flattened out into the following pairs:

city: san francisco
sqft: 1200
color: blue
occupants: [ mom ]
occupants: [ dad ]
occupants: [ child1 ]
occupation: programmer

This would probably work because there aren't a lot of data structures
where people would have the same key:value pair in different locations
in the JSON, and care about it stats-wise.  Alternatetly, if the same
key-value pair appears multiple times in the same sample row, we could
cut the MC% by that multiple.

Even if there were multiple occurrences, this would probably still be an 
improvement.

Another idea... at one time in the past when discussing statistics on 
multiple columns, one idea was to build statistics on indexes. If we 
built that, we could also do the same thing for at least JSONB (not sure 
about JSON). Obviously doesn't help for stuff you haven't indexed, but 
presumably if you care about performance and have any significant size 
of data you've also indexed parts of the JSON, yes?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance