Greetings, We have been having some pretty painful site outages related to heavy swapping, even though the system may appear to have as much as 10GB of free memory. It seems that as soon as the system hits ~22GB (of 32GB) of memory usage it starts to swap. As soon as we go below ~22GB, swap is released. During the worst outages we see: heavy swapping (10-15GB) heavy disk IO (up to 600ms) heavy CPU load: 4.0 (load over 100+ with 26 cores) available memory: (6-8GB) Here's the host information, it's a dedicated physical host, not a VM. ubuntu-14.04.1 LTS Linux db004 3.13.0-34-generic #60-Ubuntu SMP Wed Aug 13 15:45:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 24 cores, 32GB RAM, 32GB swapfile RAID 10 with SSDs Postgres version: postgresql-9.3.5 postgis-2.1.4 extension Here's our /etc/sysctl.conf: kernel.shmmax = 16850395136 kernel.shmall = 4113866 vm.overcommit_memory = 2 vm.overcommit_ratio = 50 vm.swappiness = 0 vm.dirty_ratio = 10 ## maximum 3.2GB dirty cache size vm.dirty_background_ratio = 5 ## ratio at which disk begins to flush cache (1.6GB) We are setting our shared_buffers, work_mem and maintenance_work_mem very high which I suspected is reason for swapping. The huge mystery to me is that during heavy swap there appears to be ~8-10GB free memory in cache. We wondered whether that unused memory in cache was being consumed by our memory settings in postgresql.conf: shared_buffers = 8GB ## 25% of system memory (32GB) maintenance_work_mem = 8GB ## autovacuuming on, 3 workers (defaults) work_mem = 256MB ## we have as many as 170 connections We have average around 170 connections and have been moving everything to pgbouncer to reduce this count. Rough estimate of cost with work_mem set to 256MB: 170 x 256MB = 43510MB (43GB) Is it possible that the high work_mem setting is causing the connections to hold on to the extra available memory? This is our assumption so we plan on dialing down work_mem and maintenance_work_mem to sane values. We have also ordered more RAM (increase from 32GB to 96GB) but I would like to understand what is happening. Here are the values we are going to change: maintenance_work_mem = 1GB work_mem = 64MB One last thing, we disabled THP (tranparent huge pages) because we were seeing compaction errors: [db003.prod:~] root% egrep 'compact_(fail|stall)' /proc/vmstat compact_stall 34682729 compact_fail 32915396 Here are our competing theories of why we are swapping as much as 10-15GB with 6GB-10GB of free memory: 1.) maintenance_work_mem and work_mem are set too high causing postgres to allocate too much memory 2.) The 6GB - 10GB of free memory observed during swapping is postgresql's shared buffer 3.) Linux filesystem caching is tuned incorrectly, maybe SSD related? 4.) This is a NUMA related issue similar to the Mysql Swap Insanity issue: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html http://www.postgresql.org/message-id/CAGTBQpacrSDcN10rTwRbH+AGm2_y0Qao6CJDoyvEp504iFbdrw@xxxxxxxxxxxxxx Note, we zone_reclaim_mode appears disabled on our database host. root% cat /proc/sys/vm/zone_reclaim_mode 0 Hoping someone here can help us sort this out because it's a huge mystery for us. We are going to add more RAM but I'm trying to understand what is happening. Thank you kindly. Christian Gough -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance