Re: hashjoins, index loops to retrieve pk/ux constrains in pg12

Arturas Mazeika <mazeika@xxxxxxxxx> · Wed, 29 Sep 2021 15:05:36 +0200

Hi Tom,

I agree that the query needs to be first correct, and second fast. I also agree that this query works only if there are no duplicates among schemas (if one chooses to create a table with the same names and index names and constraint names in a different schema, this would not work). Provided the assumptions are correct (what  it is on our customer systems), we use intermediate liquibase scripts to keep track of our database (schema) changes, those intermediate scripts fire queries as mentioned above, i.e., we cannot directly influence how the query looks like.

Given these very hard constraints (i.e., the query is formulated using information_schema, and not directly) is it possible to assess why the hash joins plan is chosen? At the end of the day, the io block hit rate of this query in hash joins is 3-4 orders of magnitude higher compared to sort/index joins? Is there anything one can do on the configuration side to avoid such hash-join pitfalls? 

Cheers,
Arturas

On Tue, Sep 28, 2021 at 4:13 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Arturas Mazeika <mazeika@xxxxxxxxx> writes:

> Thanks a lot for having a look at the query once again in more detail. In

> short, you are right, I fired the liquibase scripts and observed the exact

> query that was hanging in pg_stats_activity. The query was:

> SELECT

>       FK.TABLE_NAME       as "TABLE_NAME"

>       , CU.COLUMN_NAME    as "COLUMN_NAME"

>       , PK.TABLE_NAME     as "REFERENCED_TABLE_NAME"

>       , PT.COLUMN_NAME    as "REFERENCED_COLUMN_NAME"

>       , C.CONSTRAINT_NAME as "CONSTRAINT_NAME"

> FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C

> INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK ON

> C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME

> INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK ON

> C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME

> INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU ON C.CONSTRAINT_NAME

> = CU.CONSTRAINT_NAME

> INNER JOIN (

>       SELECT

>               i1.TABLE_NAME

>               , i2.COLUMN_NAME

>               FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1

>               INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2 ON

> i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME

>               WHERE i1.CONSTRAINT_TYPE = 'PRIMARY KEY'

> ) PT ON PT.TABLE_NAME = PK.TABLE_NAME WHERE

> lower(FK.TABLE_NAME)='secrole_condcollection'

TBH, before worrying about performance you should be worrying about

correctness.  constraint_name alone is not a sufficient join key

for these tables, so who's to say whether you're even getting the

right answers?

Per SQL spec, the join key to use is probably constraint_catalog

plus constraint_schema plus constraint_name.  You might say you

don't need to compare constraint_catalog because that's fixed

within any one Postgres database, and that observation would be

correct.  But you can't ignore the schema.

What's worse, the SQL-spec join keys are based on the assumption that

constraint names are unique within schemas, which is not enforced in

Postgres.  Maybe you're all right here, because you're only looking

at primary key constraints, which are associated with indexes, which

being relations do indeed have unique-within-schema names.  But you

still can't ignore the schema.

On the whole I don't think you're buying anything by going through

the SQL-spec information views, because this query is clearly pretty

dependent on Postgres-specific assumptions even if it looks like it's

portable.  And you're definitely giving up a lot of performance, since

those views have so many complications from trying to map the spec's

view of whats-a-constraint onto the Postgres objects (not to mention

the spec's arbitrary opinions about which objects you're allowed to

see).  This query would be probably be simpler, more correct, and a

lot faster if rewritten to query the Postgres catalogs directly.

                        regards, tom lane