ETL - sql orchestrator is stuck when there is not sleep() between queries

Allan Barrielle <allan.barrielle@xxxxxxxxx> · Thu, 8 Jul 2021 01:00:28 +0200

Hello guys, I'm facing a problem. Currently I'm working on a Data transformation Pipeline on Postgres. The strategy is,

We select every tables in a given schema ( 50 tables ), we apply some case when, translation, enum and load it into a different new schema with a CREATE TABLE SCHEMA_2.table_1 AS SELECT * FROM SCHEMA_1.TABLE_1, then we do it again about 3 more times and everytime it’s a new schema, new table. We only keep and don’t drop the schema1. 

To orchestrate the whole, we've got a bunch of .sql files that we run by using psql directly. That's our "strategy".

So we're copying a lot of data, but it allows us to debug, and investigate business bugs, because we can plug us into schema 2,3 and search why it's an issue.

All is fine, and can work great.
But sometimes, some queries that used to take about 20 secs to complete can suddenly end in 5mins.
 Important all queries have the same shape -> CREATE TABLE SELECT AS *(a bit of transform) FROM TABLE). No update, nothing, it’s dead simple.

We are just trying to copy a table from schema1, to schema2, to schema3 and finally schema3. That’s it.
The thing to understand here is schema2, schema3 are dropped at every pipeline transformation, so everytime we run the script, it drops everything from schema2 to the final stage.

We tuned the config a little bit, and we tried kind of everything ( synchronous_commit, wal, vacuum )
Nothing works, it’s very random, some query won’t simply work ( even after hours ).

We use different machines, different config, and different datasets.

The only thing that makes it work every time, in 100% cases, is to put a sleep(10sec) between each schema.
So we select 50 tables, we create a new schema with it, then we sleep 10 sec then we do again the same query but with the freshly created schema and we create a third schema, sleep 10s and again..

And that makes the whole pipeline successful each time.

So, It seems it's a background process inside postgres, that should ingest a lot of data, and we have to give him time to take a rest, like a bg_writers  or something else ? 
I disabled autovacuum=off . Same.
Why does the query never end even after hours ? Why there is no log about where the query is stuck.
To be clear, if I kill the stuck query and run again it will work.

I don't know much about what's going on inside Postgres, which randomly takes a lot of time, with the same code, same data.

 PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Thank you so much for your time..