Thanks, Tom. The system is running 16 cores (dual E5620's), 96GB RAM. Nothing changed on the system when this started happening - at least nothing in the configuration - it has not been touched since it was set up. Load averages are in between 6 and 8 (all 3 of them) Memory wise, I do not see a problem - plenty of free memory, and only 120K of swap has been used. The only thing which I see when I run iostat is load on the drives every so often. Mifd0 is the database, mifd1 is the pg_xlog. db1# iostat -x 5 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 1.6 0 ad6 0.0 0.0 0.0 0.0 0 0.8 0 mfid0 11.6 172.2 171.7 2988.5 0 178.1 6 mfid1 0.1 734.0 1.1 5898.8 0 0.2 4 ar0 0.3 0.7 1.0 10.3 0 114.7 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 2.8 73.4 22.4 1961.9 0 0.2 2 mfid1 0.0 1784.9 0.0 14303.2 0 0.0 5 ar0 0.0 1.0 0.0 6.8 0 0.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 2.2 652.7 44.8 7671.1 2053 99.6 10 mfid1 0.0 1525.9 0.0 12207.6 1 0.1 12 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 2202.4 8.0 19776.3 0 632.0 68 mfid1 0.0 574.3 0.0 4666.4 0 0.5 23 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.8 75.8 14.4 2242.8 0 0.3 2 mfid1 0.0 1722.5 0.0 13802.0 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.4 71.4 11.2 1991.6 0 0.2 1 mfid1 0.0 1708.9 0.0 13671.4 0 0.0 4 ar0 0.2 2.4 3.2 38.4 0 2.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 78.0 8.0 2222.5 0 0.1 1 mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5 ar0 0.0 0.6 0.0 9.6 0 0.6 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.4 56.6 11.2 1629.1 0 0.3 2 mfid1 0.0 1558.3 0.0 12468.3 0 0.0 6 ar0 0.0 0.6 0.0 3.2 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.8 72.2 14.4 1706.7 0 0.2 1 mfid1 0.0 1696.5 0.0 13596.2 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 1757.7 0.0 16093.5 351 648.0 48 mfid1 0.0 503.9 0.0 4031.3 0 0.2 10 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 125.0 0.0 2542.5 0 404.4 1 mfid1 0.0 1743.9 0.0 13951.1 0 0.1 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.2 73.4 9.6 2122.6 0 0.1 1 mfid1 0.0 1624.3 0.0 13020.0 0 0.0 4 ar0 0.0 2.0 0.0 32.0 0 1.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.2 67.4 1.6 1955.8 0 0.1 0 mfid1 0.0 1334.5 0.0 10699.7 0 0.0 5 ar0 0.0 0.4 0.0 6.4 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 64.6 8.0 1789.1 0 0.2 1 mfid1 0.0 1541.0 0.0 12352.1 0 0.0 5 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 78.4 8.0 1855.3 0 0.2 1 mfid1 0.0 1652.1 0.0 13217.0 0 0.0 5 ar0 0.0 0.8 0.0 6.4 0 0.7 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.6 691.6 4.8 7919.5 3461 82.4 11 mfid1 0.0 1477.7 0.0 11821.7 1 0.3 15 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.4 1346.2 3.2 12558.1 0 855.1 41 mfid1 0.0 1020.8 0.0 8190.6 0 0.5 39 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.0 62.2 8.0 1792.6 0 0.3 2 mfid1 0.0 1636.3 0.0 13112.4 0 0.0 4 ar0 0.0 0.8 0.0 12.8 0 0.8 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.2 63.0 1.6 1818.1 0 0.1 0 mfid1 0.0 1710.6 0.0 13684.6 0 0.0 5 ar0 0.0 1.2 0.0 19.2 0 1.2 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.8 83.8 6.4 2330.6 0 0.1 1 mfid1 0.0 1218.8 0.0 9750.2 0 0.0 4 ar0 0.0 0.4 0.0 6.4 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 68.6 0.0 1655.7 0 0.1 0 mfid1 0.0 1708.9 0.0 13671.4 0 0.0 5 ar0 0.0 0.6 0.0 3.2 0 0.5 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.0 71.6 0.0 1999.4 0 0.0 0 mfid1 0.0 1071.3 0.0 8594.1 0 0.1 3 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.4 1373.6 3.2 12956.5 0 550.7 35 mfid1 0.0 999.8 0.0 8022.7 0 0.8 37 ar0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 0.8 68.8 8.0 1908.5 0 0.2 1 mfid1 0.0 1370.5 0.0 10962.3 0 0.0 3 ar0 0.0 0.6 0.0 9.6 0 0.6 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 ad6 0.0 0.0 0.0 0.0 0 0.0 0 mfid0 1.2 63.8 9.6 1912.1 0 0.3 2 mfid1 0.0 1355.8 0.0 10846.2 0 0.0 3 ar0 0.0 1.0 0.0 16.0 0 1.1 0 Looking at /var/log/messages, everything appears to be ok. The only thing I found, but it was from over a month ago, was the following: Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory error Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000 Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280 Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory error Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000 Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141 Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory error Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141 Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000 Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0 Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory error Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000 Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707 But it has not reoccurred. > -----Original Message----- > From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] > Sent: Thursday, December 22, 2011 10:55 AM > To: Benjamin Krajmalnik > Cc: pgsql-admin > Subject: Re: Problem with pgstat timneouts > > "Benjamin Krajmalnik" <kraj@xxxxxxxxxxxxx> writes: > > About a month ago, I started receiving quite a few pgstat timeouts on > my > > production database. > > > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc > (GCC) > > 4.2.1 20070719 [FreeBSD], 64-bit > > > I am not sure where to start looking for the cause. > > Yeah, so are the rest of us :-(. People have been reporting sometimes > seeing these messages without any clear cause. We've been unable to > find any bug in Postgres that would explain them, and the obvious > explanation of "your system is so overloaded that the stats collector > is too slow to respond" isn't very satisfying if response time seems > snappy otherwise. I think we have seen one case where the problem was > provably caused by significant (multi-second) clock skew between > different processors, but that's likely not applicable to many people. > > Can you reconstruct what changed in your system around the time you > started seeing these? > > > It appears that these timeouts are affecting the performance of the > > application - sometimes it hangs for a few seconds before data gets > > returned, which was not the case before this. > > Usually these timeouts are reported by autovacuum, and so should not be > affecting the performance of foreground queries, at least not queries > that are not explicitly requesting pgstats data. The fact that your > log messages don't include any query context reinforces the idea that > they are coming from autovacuum for you too. So if you are seeing > query response time problems, maybe the pgstat timeouts are another > symptom, not the cause, of an underlying too-much-load problem. You > might try turning on query duration logging and see if you can > correlate > response-time glitches and pgstat timeouts to anything else observable, > such as peaks of system load average. > > regards, tom lane -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin