Hi Tom,
I agree that the query needs to be first correct, and second fast. I also agree that this query works only if there are no duplicates among schemas (if one chooses to create a table with the same names and index names and constraint names in a different schema, this would not work). Provided the assumptions are correct (what it is on our customer systems), we use intermediate liquibase scripts to keep track of our database (schema) changes, those intermediate scripts fire queries as mentioned above, i.e., we cannot directly influence how the query looks like.
Given these very hard constraints (i.e., the query is formulated using information_schema, and not directly) is it possible to assess why the hash joins plan is chosen? At the end of the day, the io block hit rate of this query in hash joins is 3-4 orders of magnitude higher compared to sort/index joins? Is there anything one can do on the configuration side to avoid such hash-join pitfalls?
Cheers,
Arturas
On Tue, Sep 28, 2021 at 4:13 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Arturas Mazeika <mazeika@xxxxxxxxx> writes:
> Thanks a lot for having a look at the query once again in more detail. In
> short, you are right, I fired the liquibase scripts and observed the exact
> query that was hanging in pg_stats_activity. The query was:
> SELECT
> FK.TABLE_NAME as "TABLE_NAME"
> , CU.COLUMN_NAME as "COLUMN_NAME"
> , PK.TABLE_NAME as "REFERENCED_TABLE_NAME"
> , PT.COLUMN_NAME as "REFERENCED_COLUMN_NAME"
> , C.CONSTRAINT_NAME as "CONSTRAINT_NAME"
> FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
> INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK ON
> C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
> INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK ON
> C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
> INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU ON C.CONSTRAINT_NAME
> = CU.CONSTRAINT_NAME
> INNER JOIN (
> SELECT
> i1.TABLE_NAME
> , i2.COLUMN_NAME
> FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
> INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2 ON
> i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
> WHERE i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
> ) PT ON PT.TABLE_NAME = PK.TABLE_NAME WHERE
> lower(FK.TABLE_NAME)='secrole_condcollection'
TBH, before worrying about performance you should be worrying about
correctness. constraint_name alone is not a sufficient join key
for these tables, so who's to say whether you're even getting the
right answers?
Per SQL spec, the join key to use is probably constraint_catalog
plus constraint_schema plus constraint_name. You might say you
don't need to compare constraint_catalog because that's fixed
within any one Postgres database, and that observation would be
correct. But you can't ignore the schema.
What's worse, the SQL-spec join keys are based on the assumption that
constraint names are unique within schemas, which is not enforced in
Postgres. Maybe you're all right here, because you're only looking
at primary key constraints, which are associated with indexes, which
being relations do indeed have unique-within-schema names. But you
still can't ignore the schema.
On the whole I don't think you're buying anything by going through
the SQL-spec information views, because this query is clearly pretty
dependent on Postgres-specific assumptions even if it looks like it's
portable. And you're definitely giving up a lot of performance, since
those views have so many complications from trying to map the spec's
view of whats-a-constraint onto the Postgres objects (not to mention
the spec's arbitrary opinions about which objects you're allowed to
see). This query would be probably be simpler, more correct, and a
lot faster if rewritten to query the Postgres catalogs directly.
regards, tom lane