Search Postgresql Archives

Postgres 10, slave not catching up with master

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4, Ubuntu 16.04.

Multi-tenant DB with about 40000 tables, insert heavy.

I started a new slave with identical HW specs, SR. DB started syncing from master, which took about 4 hours, then it started applying the WALs. However, it seems it can't catch up. Delay is still around 3 hours (measured with now() - pg_last_xact_replay_timestamp()), even a day later. It goes a few 100s up and down, but it seems to float around 3h mark.

Disk IO is low at about 10%, measured with iostat, no connected clients, recovery process is at around 90% CPU single core usage.

Tried tuning the various parameters, but with no avail. Only thing I found suspicious is stracing the recovery process constantly produces many errors such as:

lseek(428, 0, SEEK_END)                 = 780124160
lseek(30, 0, SEEK_END)                  = 212992
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(680, 0, SEEK_END)                 = 493117440
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(774, 0, SEEK_END)                 = 583368704

...[snip]...

read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(774, 0, SEEK_END)                 = 583368704
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(277, 0, SEEK_END)                 = 502882304
lseek(6, 516096, SEEK_SET)              = 516096
read(6, "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., 8192) = 8192
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(735, 0, SEEK_END)                 = 272809984
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(277, 0, SEEK_END)                 = 502882304

ls -l fd/9
lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]


Perf top on recovery produces:

 27.76%  postgres            [.] pglz_decompress
   9.90%  [kernel]            [k] entry_SYSCALL_64_after_swapgs
   7.09%  postgres            [.] hash_search_with_hash_value
   4.26%  libpthread-2.23.so  [.] llseek
   3.64%  libpthread-2.23.so  [.] __read_nocancel
   2.80%  [kernel]            [k] __fget_light
   2.67%  postgres            [.] 0x000000000034d3ba
   1.85%  [kernel]            [k] ext4_llseek
   1.84%  postgres            [.] pg_comp_crc32c_sse42
   1.44%  postgres            [.] hash_any
   1.35%  postgres            [.] 0x000000000036afad
   1.29%  postgres            [.] MarkBufferDirty
   1.21%  postgres            [.] XLogReadRecord
[...]

Tried changing the process limits with prlimit to unlimited, but no change.

I can turn off the WAL compression but I doubt this is the main culprit. Any ideas appreciated.

Regards,
Boris


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux