Hello list,
I am having issues with performance inserting data in Postgres and would like
to ask for help figuring out the problem as I ran out of ideas.
I have a process that generates a CSV file with 1 million records in it every
5 minutes and each file is about 240MB. I need this data to be inserted into a
table on Postgres. A peculiarity about it is that the data on these files
might be duplicate. I might have a row on the first file that is also present
on the second or the third and so on. I don't care about the duplicates, so I
have a unique constraint on my table to discard those.
The data in the CSV is pretty simple:
```
user_id,name,url
```
The table is defined like this:
```
create unlogged table users_no_dups (
created_ts timestamp without time zone,
user_id bigint not null,
name text,
url text,
unique(user_id)
);
My process for inserting data into the table is as follows:
* Create an intermediary table `users` as follows:
```
create unlogged table users (
created_ts timestamp without time zone default current_timestamp,
user_id bigint,
name text,
url text
) with (autovacuum_enabled = false, toast.autovacuum_enabled = false)
```
* Use `COPY` to copy the data from the CSV file into an intermediary table
```
copy users(user_id, name, url) from
'myfile.csv' with(format csv, header true, delimiter ',', quote '"', escape '\\')
```
* Insert the data from the `users` table into the `users_no_dups` table
```
insert into users_no_dups (
created_ts,
user_id,
name,
url
) (
select
created_ts,
user_id,
name,
url
from
users
) on conflict do nothing
```
* Drop the `users` table
* Repeat the whole thing for the next file.
Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.
All of a sudden inserting from `users` into `users_no_dups` started taking 20+
minutes.
I recreated the table with a `fillfactor` of `30` and tried again and things
were running well again with that same 30 seconds for processing. Again after
about 12 hours, things got really slow.
Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).
Watching on iotop, the `INSERT` statement `WRITE` speed is always between 20
and 100 K/s now. When I first started inserting the `WRITE` speed is always
above 100M/s.
If I try to copy the `users_no_dups` table to another table (say
users_no_dups_2 with the same structure), the `WRITE` speed also goes to
100M/s or more until it gets to the last 2 GB of data being copied. Then speed
goes down to the 20 to 100K/s again and stays there (I know this from watching
`iotop`).
I have the following custom configuration on my postgres installation that
I've done in order to try to improve the performance:
```
ssl = off
shared_buffers = 8GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
```
Information about the machine:
```
Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads)
RAM: 256GB
Disk1: 2TB SSD SATA-3 Samsung Evo 860
Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk1 and Disk2 are configured as a single logical volume.
Table `users_no_dups` is in a tablespace on `Disk3`. The defaul tablespace is
in the logical volume composed by `Disk1` and `Disk2`.
OS: Ubuntu Linux 19.10
Postgres version: PostgreSQL 11.7 (Ubuntu 11.7-0ubuntu0.19.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit
```
Any ideas why I am seeing this decrease in performance with the insert or any
suggestions on how I can try to figure this out?
Sorry for the wall of text. Just trying to give as much info as I have.
Henrique
I am having issues with performance inserting data in Postgres and would like
to ask for help figuring out the problem as I ran out of ideas.
I have a process that generates a CSV file with 1 million records in it every
5 minutes and each file is about 240MB. I need this data to be inserted into a
table on Postgres. A peculiarity about it is that the data on these files
might be duplicate. I might have a row on the first file that is also present
on the second or the third and so on. I don't care about the duplicates, so I
have a unique constraint on my table to discard those.
The data in the CSV is pretty simple:
```
user_id,name,url
```
The table is defined like this:
```
create unlogged table users_no_dups (
created_ts timestamp without time zone,
user_id bigint not null,
name text,
url text,
unique(user_id)
);
```
Table is created as `unlogged` as a way to improve performance. I am aware of the consequences of this possibly causing data loss.
My process for inserting data into the table is as follows:
* Create an intermediary table `users` as follows:
```
create unlogged table users (
created_ts timestamp without time zone default current_timestamp,
user_id bigint,
name text,
url text
) with (autovacuum_enabled = false, toast.autovacuum_enabled = false)
```
* Use `COPY` to copy the data from the CSV file into an intermediary table
```
copy users(user_id, name, url) from
'myfile.csv' with(format csv, header true, delimiter ',', quote '"', escape '\\')
```
* Insert the data from the `users` table into the `users_no_dups` table
```
insert into users_no_dups (
created_ts,
user_id,
name,
url
) (
select
created_ts,
user_id,
name,
url
from
users
) on conflict do nothing
```
* Drop the `users` table
* Repeat the whole thing for the next file.
Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.
All of a sudden inserting from `users` into `users_no_dups` started taking 20+
minutes.
I recreated the table with a `fillfactor` of `30` and tried again and things
were running well again with that same 30 seconds for processing. Again after
about 12 hours, things got really slow.
Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).
Watching on iotop, the `INSERT` statement `WRITE` speed is always between 20
and 100 K/s now. When I first started inserting the `WRITE` speed is always
above 100M/s.
If I try to copy the `users_no_dups` table to another table (say
users_no_dups_2 with the same structure), the `WRITE` speed also goes to
100M/s or more until it gets to the last 2 GB of data being copied. Then speed
goes down to the 20 to 100K/s again and stays there (I know this from watching
`iotop`).
I have the following custom configuration on my postgres installation that
I've done in order to try to improve the performance:
```
ssl = off
shared_buffers = 8GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
```
Information about the machine:
```
Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads)
RAM: 256GB
Disk1: 2TB SSD SATA-3 Samsung Evo 860
Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk1 and Disk2 are configured as a single logical volume.
Table `users_no_dups` is in a tablespace on `Disk3`. The defaul tablespace is
in the logical volume composed by `Disk1` and `Disk2`.
OS: Ubuntu Linux 19.10
Postgres version: PostgreSQL 11.7 (Ubuntu 11.7-0ubuntu0.19.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit
```
Any ideas why I am seeing this decrease in performance with the insert or any
suggestions on how I can try to figure this out?
Sorry for the wall of text. Just trying to give as much info as I have.
Henrique