Hi to all! We've got problem with a very serious repetitive incident on our core system. Namely, cpu load spikes to 300-400 and the whole db becomes unresponsive. From db point of view nothing special is happening, memory looks fine, disks io's are ok and the only problem is huge cpu load. Kernel parameters that are increasing with load are always the same: * page tables size * Committed_AS * Active anon <http://www.postgresql-archive.org/file/t342733/pagetables.png> and the total number of connections are increasing very fast (but I suppose it's the symptom not the root cause of cpu load) and exceed max_connections (1000). System: * CentOS Linux release 7.2.1511 (Core) * Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux * postgresql95-9.5.5-1PGDG.rhel7.x86_64 * postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64 * postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64 * postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64 * postgresql95-server-9.5.5-1PGDG.rhel7.x86_64 * 4 sockets/80 cores * vm.dirty_background_bytes = 0 * vm.dirty_background_ratio = 2 * vm.dirty_bytes = 0 * vm.dirty_expire_centisecs = 3000 * vm.dirty_ratio = 20 * vm.dirty_writeback_centisecs = 500 after the first incident we have changed: * increased shared_buffers to 16GB (completely on huge pages. previously 2GB) * adjusted vm.nr_hugepages to 8000 (we've got 2mb pages) * changed vm.overcommit_memory = 2 and vm.overcommit_ratio = 99 * disabled transparent huge pages (they were set before unfortunately to 'always') It's a highly transactional db. Today I've run: select now(), txid_current(); and the results: 3 339 351 transactions between 2017-10-10 14:42 and 2017-10-10 16:24 * db size 1,1TB * RAM over 500GB * biggest tables (the rest isn't big): 369 GB 48 GB 48 GB 34 GB 23 GB 19 GB 19 GB 17 GB 16 GB 12 GB 9910 MB We have captured some of db statistics, for instance bgwriter and buffercache. Today the load spides happened at: 1). 10:44 2). 11:04 (and then several times during a day) The premiere was yesterday about 6PM. What we observed back then was for instance autovacuum process to prevent wraparound on the biggest table (369GB). We did vacuum freeze manually after this happened but before that we gathered statistics with the query: SELECT oid::regclass::text AS table, age(relfrozenxid) AS xid_age, mxid_age(relminmxid) AS mxid_age, least( (SELECT setting::int FROM pg_settings WHERE name = 'autovacuum_freeze_max_age') - age(relfrozenxid), (SELECT setting::int FROM pg_settings WHERE name = 'autovacuum_multixact_freeze_max_age') - mxid_age(relminmxid) ) AS tx_before_wraparound_vacuum, pg_size_pretty(pg_total_relation_size(oid)) AS size, pg_stat_get_last_autovacuum_time(oid) AS last_autovacuum FROM pg_class ORDER BY tx_before_wraparound_vacuum; and the biggest table which was vacuumed looked like: 217310511 8156548 -17310511 369 GB 2017-09-30 01:57:33.972068+02 So, from the kernel stats we know that the failure happens when db is trying to alocate some huge amount of pages (page tables size, anons, commited_as). But what is triggering this situation? I suppose it could be lazy autovacuum (just standard settings). So autovacuum had to read whole 369gb yesterday to clean xids. today did the same on some other tables. Another idea is too small shared buffers setting. Today it looked like: <http://www.postgresql-archive.org/file/t342733/buffercache1040.png> c - means count the number after c is the usage count, so c5dirty means here count of dirty pages with usagecount=5 that is the snapshot before and after the failure at 10:44 before and after the spike at 11:04: <http://www.postgresql-archive.org/file/t342733/buffercache1104.png> My interpretation of it is the following: the count of clean buffers with high usagecount is decreasing, the count of buffers with usagecount of 0 and 1 is very unstable -> so the buffers have no time to get older in the shared buffers and are thrown out? bgwriter stats: <http://www.postgresql-archive.org/file/t342733/bgwriter.png> the biggest number of buffers is cleaned by backends - so there is no free buffers with usagecount 0 and LWlocks happen? So increasing shared buffers would be a solution? Please help, it's happening quite often and I'm not sure which way is the right one... -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general