On 29.1.2015 00:03, Josh Berkus wrote: > On 01/28/2015 11:48 AM, Tomas Vondra wrote: >> On 27.1.2015 08:06, Josh Berkus wrote: >>> Folks, >>> >> ... >>> >>> On a normal column, I'd raise n_distinct to reflect the higher >>> selecivity of the search terms. However, since @> uses contsel, >>> n_distinct is ignored. Anyone know a clever workaround I don't >>> currently see? >> >> I don't see any reasonable workaround :-( >> >> ISTM we'll have to invent a way to collect useful stats about contents >> of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly >> relying on defaults that may be reasonable, but still misfire in many >> cases. Do we have any ideas of how that might work? >> >> We're already collecting stats about contents of arrays, and maybe we >> could do something similar for JSONB? The nested nature of JSON makes >> that rather incompatible with the flat MCV/histogram stats, though. > > Well, I was thinking about this. > > We already have most_common_elem (MCE) for arrays and tsearch. What > if we put JSONB's most common top-level keys (or array elements, > depending) in the MCE array? Then we could still apply a simple rule > for any path criteria below the top-level keys, say assuming that any > sub-key criteria would match 10% of the time. While it wouldn't be > perfect, it would be better than what we have now. So how would that work with your 'tags' example? ISTM most of your documents have 'tags' as top-level key, so that would end up in the MCV list. But there's no info about the elements of the 'tags' array (thus the 10% default, which may work in this particular case, but it's hardly a general solution and I doubt it's somehow superior to the defaults we're using right now). I think a 'proper' solution to JSONB stats needs to somehow reflect the nested structure. What I was thinking about is tracking MCV for "complete paths", i.e. for a document: { "keyA" : { "keyB" : "x", "keyC" : "z", } "keyD" : [1, 2, 3, 4] } We'd extract three paths "keyA.keyB" "keyA.keyC" "keyD" and aggregate that over all the documents to select the MCV paths. And then, for each of those MCV paths track the most common values. ISTM this would allow better estimations, but it has issues too: Firstly, it does not match the MCV structure, because it requires storing (a) MCV paths and (b) MCV values for those paths. Moreover, (b) probably stores different data types (some values are strings, some integers, etc.). Arrays might be handled just like regular arrays, i.e. tracking stats of elements, but it's still mixed data types. Secondly, I think it's based on the assumption of independence (i.e. that the occurence of one path does not depend on occurence of a different path in the same document). Same for values x paths. Which may or may not be be true - it's essentially the same as assumption of independence for predicates on multiple columns. While I do have ideas on how to approach this in the multi-column case, handling this for JSONB is going to be much more complex I think. But the first question (what stats to collect and how to store them) is the most important at this point, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance