Hi,
What I'm trying to do:
- I have 3rd party DB (any type Postgres, Oracle...) I have no control over it. Let's call it REMOTE_DB
- I have my own Postgres instance 13.2. I fully control it. Let's call it MY_PG
- REMOTE_DB has a 20-200M records table
- I need from time to time to fetch 100K - 1M records from REMOTE_DB, insert into MY_PG and then do sophisticated joins to produce some analytical results. Let's call it REMOTE_DB_FETCH
- REMOTE_DB_FETCH saves records to MY_PG
- MY_PG has table that mimics schema of source table stored in REMOTE_DB. It mimics PK too.
- there are concurrent processes that run REMOTE_DB_FETCH to insert data into MY_PG
Long story short:
- I cache subset of REMOTE_DB at MY_PG
- Do analytics in MY_PG
What I've tried to far
- Concurrent REMOTE_DB_FETCH work with UPSERT "on conflict do nothing"
- I consider to disable vacuum too: https://gist.github.com/valyala/ae3cbfa4104f1a022a2af9b8656b1131
What else can I try in order to UPSERT from REMOTE_DB to MY_PG faster from concurrent REMOTE_DB_FETCH processes?
I've created isolated test to try out performance. Inserting around 48K rows into a table with 150 fields with only PK constraint on to fields: varchar and bigint.
Table is unlogged
isUnlogged: [{relpersistence=u, relname=cache_1694}, {relpersistence=u, relname=cache_1694_pkey}]
sql
INSERT INTO "cache_1694" ("varchar_column", "bigint_column", "another_column" /* more column 150 in total */) VALUES (?, ?, ? /**/)
ON CONFLICT ("varchar_column", "bigint_column") DO NOTHING;
It took 30 sec to insert 48819 records. looks slow, what can I try else? I'm running postgres on my Mac Pro with an SSD disk. Here are the settings:
postgres:
image: postgres:13.2
volumes:
- ~/pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
environment:
- POSTGRES_USER=pguser
- POSTGRES_PASSWORD=pguser
- POSTGRES_DB=pgdb
- PGDATA=/var/lib/postgresql/data/pgdata
command:
- "postgres"
- "-c"
- "max_connections=50"
- "-c"
- "shared_buffers=1GB"
- "-c"
- "effective_cache_size=1GB"
- "-c"
- "synchronous_commit=off"
SELECT * FROM pg_settings where pg_settings.name like '%commi%'
shows
{
"name": "synchronous_commit",
"setting": "off"
}
]
Strange thing is that LOGGED table performance is not much greater than UNLOGGED. Same 30 sec for 48819 inserted duplicated rows