Hi all, We finally made some headway on this - we noticed messages like the below in /var/log/messages whenever the issue happened: Mar 26 07:39:58 site-db01b kernel: postmaster: page allocation failure. order:1, mode:0x20 Mar 26 07:39:58 site-db01b kernel: Pid: 39066, comm: postmaster Not tainted 2.6.32-279.el6.x86_64 #1 Mar 26 07:39:58 site-db01b kernel: Call Trace: Mar 26 07:39:58 site-db01b kernel: <IRQ> [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81161d62>] ? kmem_getpages+0x62/0x170 Mar 26 07:39:58 site-db01b kernel: [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320 Mar 26 07:39:58 site-db01b kernel: [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160 Mar 26 07:39:58 site-db01b kernel: [<ffffffff811634db>] ? kmem_cache_alloc+0x11b/0x190 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8142df32>] ? sk_clone+0x22/0x2e0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8142dc68>] ? sk_prot_alloc+0x48/0x1c0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81494ae3>] ? tcp_create_openreq_child+0x23/0x450 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8147bb86>] ? inet_csk_clone+0x16/0xd0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff814935be>] ? tcp_v4_rcv+0x4fe/0x8d0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81471200>] ? ip_local_deliver_finish+0x0/0x2d0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff814712dd>] ? ip_local_deliver_finish+0xdd/0x2d0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8149239d>] ? tcp_v4_syn_recv_sock+0x4d/0x310 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81494886>] ? tcp_check_req+0x226/0x460 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81491dbb>] ? tcp_v4_do_rcv+0x35b/0x430 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81489cfd>] ? tcp_rcv_established+0x38d/0x800 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81470fb5>] ? ip_rcv+0x275/0x350 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81470a2d>] ? ip_rcv_finish+0x12d/0x440 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81471568>] ? ip_local_deliver+0x98/0xa0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8143a7bb>] ? __netif_receive_skb+0x49b/0x6f0 Mar 26 07:39:58 site-db01b kernel: [<ffffffffa02fe1b4>] ? tg3_poll_work+0x654/0xe30 [tg3] Mar 26 07:39:58 site-db01b kernel: [<ffffffffa02fe9dc>] ? tg3_poll_msix+0x4c/0x150 [tg3] Mar 26 07:39:58 site-db01b kernel: [<ffffffff8143ca38>] ? netif_receive_skb+0x58/0x60 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8143cb40>] ? napi_skb_finish+0x50/0x70 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8143f193>] ? net_rx_action+0x103/0x2f0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8143f079>] ? napi_gro_receive+0x39/0x50 Mar 26 07:39:58 site-db01b kernel: [<ffffffff810db800>] ? handle_IRQ_event+0x60/0x170 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff81073f1f>] ? __do_softirq+0x11f/0x1e0 Mar 26 07:39:58 site-db01b kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 Doing some digging on that, we disabled TSO/TRO, GSO/GRO at the tcp layer - and that seems to have helped. $ sudo ethtool -k eth0 Offload parameters for eth0: rx-checksumming: off tx-checksumming: on scatter-gather: on tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off However, I'm looking for more information on what's happening: That stack trace above seems to indicate that it was unable to allocate 2*4k pages (8k) to the network stack. Its likely that was needed for GSO/GRO. However, wondering why the kernel is unable to allocate just 8k - we have a 768G RAM server, with over 54G in buffers/cache root@site-db01b:/proc # free -m total used free shared buffers cached Mem: 775382 773354 2028 0 1403 738735 -/+ buffers/cache: 33215 742166 Swap: 0 0 0 Looking a little deeper, I saw signs of memory being heavily fragmented: root@site-db01b:/var/log # cat /proc/buddyinfo Node 0, zone DMA 1 1 2 2 2 1 0 0 1 1 3 Node 0, zone DMA32 8 7 8 7 10 8 7 11 9 5 92 Node 0, zone Normal 13069 0 0 0 0 0 0 0 0 0 1 Node 1, zone Normal 652315 36885 1168 0 1 1 0 1 1 1 0 Node 0 has 13069 4k blocks, and zero 8k blocks available to use Which is likely what caused the problem, I'd think. A little while later though, buddyinfo changed and suddenly there was a lot more memory in 8k blocks. root@site-db01b:/proc # cat /proc/buddyinfo Node 0, zone DMA 1 1 2 2 2 1 0 0 1 1 3 Node 0, zone DMA32 8 7 8 7 10 8 7 11 9 5 92 Node 0, zone Normal 9645 5495 1115 0 0 0 0 0 0 0 1 Node 1, zone Normal 409734 10953 1 0 1 1 0 1 1 1 0 (Note the change in the Node 0 line, 8k blocks went up from 0 to 5495) Anyone have any idea why memory was so fragmented, and what causes memory to be defragged? Is it something postgres does? Are there any kernel specific settings that control it? Thanks, Karthik On 3/14/14 3:37 PM, "Anand Kumar, Karthik" <Karthik.AnandKumar@xxxxxxxxxxxxxx> wrote: >For anyone that's still following - we tried upgrading to postgres 9.3.3 - >that hasn't helped. > >Running an strace on the pid that was consuming the highest CPU at the >time of the outage shows: > >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91881569, {{12, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(89325587, {{14, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90964037, {{4, 1, 0}}, 1) = 0 >semop(90308657, {{5, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(88866821, {{12, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90439733, {{13, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90341426, {{2, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90308657, {{5, 1, 0}}, 1) = 0 >semop(91881569, {{12, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(88866821, {{12, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91881569, {{12, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90865730, {{5, 1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(91521110, {{12, -1, 0}}, 1) = 0 >semop(90865730, {{5, 1, 0}}, 1) = 0 > >I've seen other people talk of this problem with a lot of semop calls, >haven't yet found a clear solution. >Anyone have any ideas? > >I've also downloaded the perf tool based on >http://rhaas.blogspot.com/2012/06/perf-good-bad-ugly.html - will see what >that has to show. > >Thanks, >Karthik > > >On 3/11/14 1:06 PM, "John R Pierce" <pierce@xxxxxxxxxxxx> wrote: > >>On 3/11/2014 10:20 AM, Anand Kumar, Karthik wrote: >>> We typically see about 500-700 active queries at a time >> >>if these are primarily small/fast queries, like OLTP operations, and you >>DONT have 200-400 CPU cores on this server, you will likely find that if >>you use a queueing mechanism to only execute about 2X your CPU core >>count concurrently, you will get MORE total transactions/second than >>trying to do 500-700 at once. >> >>if your apps are using persistent connections, then the session pooling >>model won't do any good, you should use transaction pooling. you want >>the actual active query count to be tunable, probably down around 2X the >>cpu core count, depending on various things. some folks say, CPU >>cores/threads plus disk spindles is the optimal number. >> >> >> >>-- >>john r pierce 37N 122W >>somewhere on the middle of the left coast >> >> >> >>-- >>Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) >>To make changes to your subscription: >>http://www.postgresql.org/mailpref/pgsql-general > > > >-- >Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general