Hi all,
I can reproduce the error in the subject from time to time when querying catalog tables while DDL is happening concurrently. Here's a bash script that reproduces it (not always, you might have to run it many times until you see ERROR: could not open relation with OID XXXX):
#!/usr/bin/env bash
psql -c "create table test(x serial primary key); select oid, relname from pg_class where relname='test'"
# The next two queries will run concurrently
psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from pg_class join pg_index on indexrelid=pg_class.oid WHERE relname='test_pkey';" 2>&1 1>/tmp/pgbug.log &
sleep 1
psql -c "drop table test"
cat /tmp/pgbug.log
wait
psql -c "create table test(x serial primary key); select oid, relname from pg_class where relname='test'"
# The next two queries will run concurrently
psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from pg_class join pg_index on indexrelid=pg_class.oid WHERE relname='test_pkey';" 2>&1 1>/tmp/pgbug.log &
sleep 1
psql -c "drop table test"
cat /tmp/pgbug.log
wait
I am confused as to how this is possible. I assume if the row with the test_pkey index exists in the pg_index catalog table, that the snapshot of the catalog tables contains the test table itself and is generally consistent, so querying the catalog should not run into such errors.
I've seen this happen in Production without pg_sleep in the mix, too. I added pg_sleep to the example above only because it makes the error easier to reproduce.
Is there something I can do to avoid this? Is my understanding of how the catalog tables work wrong?
Thanks,
Marcelo.