On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> I think the non-extended stats code also has trouble with booleans.
> pg_stats gives me a correlation of 0.8 or higher for the flag column.
It's not due to the boolean though; you see the same thing if you do:
CREATE INDEX aaa_f ON aaa((flag::text));
ANALYZE aaa;
correlation | 0.81193
or:
ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
correlation | 0.81193
I think it's caused by having so few (2) values to correlate.
most_common_vals | {f,t}
most_common_freqs | {0.9014,0.0986}
correlation | 0.822792
It thinks there's somewhat-high correlation since it gets a list of x and y
values (integer positions by logical and physical sort order) and 90% of the x
list (logical value) are the same value ('t'), and the CTIDs are in order on
the new index, so 90% of the values are 100% correlated.
But there is no index involved (except in the case of the functional index). The correlation of table columns to physical order of the table doesn't depend on the existence of an index, or the physical order within an index.
But I do see that ties within the logical order of the column values are broken to agree with the physical order. That is wrong, right? Is there any argument that this is desirable?
It looks like it could be fixed with a few extra double calcs per distinct value. Considering we already sorted the sample values using SQL-callable collation dependent comparators, I doubt a few C-level double calcs is going to be meaningful.
Cheers,
Jeff