core system is getting unresponsive because over 300 cpu load

pinker <pinker@xxxxxxx> · Tue, 10 Oct 2017 13:40:07 -0700 (MST)

Hi to all!

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes
unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

* page tables size
* Committed_AS
* Active anon

<http://www.postgresql-archive.org/file/t342733/pagetables.png> 

and the total number of connections are increasing very fast (but I suppose
it's the symptom not the root cause of cpu load) and exceed max_connections
(1000).

System:
* CentOS Linux release 7.2.1511 (Core) 
* Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
* postgresql95-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-server-9.5.5-1PGDG.rhel7.x86_64

* 4 sockets/80 cores
* vm.dirty_background_bytes = 0
* vm.dirty_background_ratio = 2
* vm.dirty_bytes = 0
* vm.dirty_expire_centisecs = 3000
* vm.dirty_ratio = 20
* vm.dirty_writeback_centisecs = 500

after the first incident we have changed:
* increased shared_buffers to 16GB (completely on huge pages. previously
2GB)
* adjusted vm.nr_hugepages to 8000 (we've got 2mb pages)
* changed vm.overcommit_memory = 2 and vm.overcommit_ratio = 99
* disabled transparent huge pages (they were set before unfortunately to
'always')

It's a highly transactional db. Today I've run:
select now(), txid_current();
and the results:
3 339 351 transactions between 2017-10-10 14:42 and 2017-10-10 16:24

* db size 1,1TB
* RAM over 500GB
* biggest tables (the rest isn't big):
369 GB
48 GB
48 GB
34 GB
23 GB
19 GB
19 GB
17 GB
16 GB
12 GB
9910 MB

We have captured some of db statistics, for instance bgwriter and
buffercache.
Today the load spides happened at:
1). 10:44
2). 11:04
(and then several times during a day)
The premiere was yesterday about 6PM.

What we observed back then was for instance autovacuum process to prevent
wraparound on the biggest table (369GB). We did vacuum freeze manually after
this happened but before that we gathered statistics with the query:
SELECT
       oid::regclass::text AS table,
       age(relfrozenxid) AS xid_age, 
       mxid_age(relminmxid) AS mxid_age, 
       least( 
(SELECT setting::int
            FROM    pg_settings
            WHERE   name = 'autovacuum_freeze_max_age') - age(relfrozenxid), 
(SELECT setting::int
            FROM    pg_settings
            WHERE   name = 'autovacuum_multixact_freeze_max_age') -
mxid_age(relminmxid)  
) AS tx_before_wraparound_vacuum,
pg_size_pretty(pg_total_relation_size(oid)) AS size,
pg_stat_get_last_autovacuum_time(oid) AS last_autovacuum
FROM    pg_class
ORDER BY tx_before_wraparound_vacuum;

and the biggest table which was vacuumed looked like:
    217310511    8156548    -17310511    369 GB    2017-09-30
01:57:33.972068+02

So, from the kernel stats we know that the failure happens when db is trying
to alocate some huge amount of pages (page tables size, anons, commited_as).
But what is triggering this situation? 
I suppose it could be lazy autovacuum (just standard settings). So
autovacuum had to read whole 369gb yesterday to clean xids. today did the
same on some other tables.
Another idea is too small shared buffers setting. 
Today it looked like:
<http://www.postgresql-archive.org/file/t342733/buffercache1040.png> 

c - means count 
the number after c is the usage count, so c5dirty means here count of dirty
pages with usagecount=5

that is the snapshot before and after the failure at 10:44

before and after the spike at 11:04:
<http://www.postgresql-archive.org/file/t342733/buffercache1104.png> 

My interpretation of it is the following:
 the count of clean buffers with high usagecount is decreasing, the count of
buffers with usagecount of 0 and 1 is very unstable -> so the buffers have
no time to get older in the shared buffers and are thrown out?

bgwriter stats:
<http://www.postgresql-archive.org/file/t342733/bgwriter.png> 

the biggest number of buffers is cleaned by backends - so there is no free
buffers with usagecount 0 and LWlocks happen?

So increasing shared buffers would be a solution?
Please help, it's happening quite often and I'm not sure which way is the
right one...  

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general