Hi All,
The database is postgresql 9.3, running on debian7, with 8 cpu cores and 8096MB physical memory.
There is a big table, with 70 more columns. It would be constantly at 700 rows/sec. It's not feasible to use COPY, because the data is not predefined or provisioned, and it's generated on demand by clients.
To make a clean test env, I clone a new table, removing the indexes (keeping the primary key) and triggers, and use pgbench to test insert statement purely.
Here is some key items in the postgresql.conf:
--------------
shared_buffers = 1024MB
work_mem = 32MB
maintenance_work_mem = 128MB
bgwriter_delay = 20ms
synchronous_commit = off
checkpoint_segments = 64
checkpoint_completion_target = 0.9
effective_cache_size = 4096MB
log_min_duration_statement = 1000
--------------
The problem:
The log would show that the duration of some inserts exceed 1 second.
I use iotop to view the io activities of the pg backend worker process, it shows that it writes some MB per second. It's the most confused part. The commit is async, so it would not do and wait the wal writing, as well as the shared buffers. I doubt it would be flush at the shared buffer allocation. So I modify the codes, print the pgBufferUsage.blk_write_time along with the long duration printing. But I found it's a small fraction the total duration. I also add codes to record the total time on lock waiting within the statement execution and print it, and it's also a small fraction of the duration. I could not explain the result.
Then I use systemtap to check what files the process frequenlty write out:
-----
Tracing 20991 (/usr/lib/postgresql/9.3/bin/postgres)...
Please wait for 30 seconds.
=== Top 10 file writes ===
#1: 6004 times, 49184768 bytes writes in file 21986.44.
#2: 400 times, 3276800 bytes writes in file 21988.3.
#3: 12 times, 98304 bytes writes in file 57ED.
#4: 10 times, 81920 bytes writes in file 57F0.
#5: 10 times, 81920 bytes writes in file 57EE.
#6: 9 times, 73728 bytes writes in file 57F1.
#7: 9 times, 73728 bytes writes in file 57F3.
#8: 8 times, 65536 bytes writes in file 57EB.
#9: 8 times, 65536 bytes writes in file 57F2.
#10: 4 times, 32768 bytes writes in file 57EF.
-----
The "21986.44" is the table data file, and the "21988.3" is the primary key index, and the others are subtrans files.
Obviously, the process does a lot of IO (vfs level), and I doubt the duration spikes are due to the IO contention from time to time.
But I do not understand that why the process do so many IO with async commit? And it does not even happen at the shared buffer flushing and locks waiting. Where's the code path doing these IO?
Regards,
Jinhua Luo