Re: Problem with pgstat timneouts

"Benjamin Krajmalnik" <kraj@xxxxxxxxxxxxx> · Thu, 22 Dec 2011 11:25:46 -0700

Thanks, Tom.
The system is running 16 cores (dual E5620's), 96GB RAM.
Nothing changed on the system when this started happening - at least
nothing in the configuration - it has not been touched since it was set
up.
Load averages are in between 6 and 8 (all 3 of them)
Memory wise, I do not see a problem - plenty of free memory, and only
120K of swap has been used.
The only thing which I see when I run iostat is load on the drives every
so often.  Mifd0 is the database, mifd1 is the pg_xlog.

db1# iostat -x 5
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   1.6   0 
ad6        0.0   0.0     0.0     0.0    0   0.8   0 
mfid0     11.6 172.2   171.7  2988.5    0 178.1   6 
mfid1      0.1 734.0     1.1  5898.8    0   0.2   4 
ar0        0.3   0.7     1.0    10.3    0 114.7   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      2.8  73.4    22.4  1961.9    0   0.2   2 
mfid1      0.0 1784.9     0.0 14303.2    0   0.0   5 
ar0        0.0   1.0     0.0     6.8    0   0.8   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      2.2 652.7    44.8  7671.1 2053  99.6  10 
mfid1      0.0 1525.9     0.0 12207.6    1   0.1  12 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.0 2202.4     8.0 19776.3    0 632.0  68 
mfid1      0.0 574.3     0.0  4666.4    0   0.5  23 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.8  75.8    14.4  2242.8    0   0.3   2 
mfid1      0.0 1722.5     0.0 13802.0    0   0.0   5 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.4  71.4    11.2  1991.6    0   0.2   1 
mfid1      0.0 1708.9     0.0 13671.4    0   0.0   4 
ar0        0.2   2.4     3.2    38.4    0   2.5   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.0  78.0     8.0  2222.5    0   0.1   1 
mfid1      0.0 1541.0     0.0 12352.1    0   0.0   5 
ar0        0.0   0.6     0.0     9.6    0   0.6   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.4  56.6    11.2  1629.1    0   0.3   2 
mfid1      0.0 1558.3     0.0 12468.3    0   0.0   6 
ar0        0.0   0.6     0.0     3.2    0   0.5   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.8  72.2    14.4  1706.7    0   0.2   1 
mfid1      0.0 1696.5     0.0 13596.2    0   0.0   5 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.0 1757.7     0.0 16093.5  351 648.0  48 
mfid1      0.0 503.9     0.0  4031.3    0   0.2  10 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.0 125.0     0.0  2542.5    0 404.4   1 
mfid1      0.0 1743.9     0.0 13951.1    0   0.1   5 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.2  73.4     9.6  2122.6    0   0.1   1 
mfid1      0.0 1624.3     0.0 13020.0    0   0.0   4 
ar0        0.0   2.0     0.0    32.0    0   1.8   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.2  67.4     1.6  1955.8    0   0.1   0 
mfid1      0.0 1334.5     0.0 10699.7    0   0.0   5 
ar0        0.0   0.4     0.0     6.4    0   0.5   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.0  64.6     8.0  1789.1    0   0.2   1 
mfid1      0.0 1541.0     0.0 12352.1    0   0.0   5 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.0  78.4     8.0  1855.3    0   0.2   1 
mfid1      0.0 1652.1     0.0 13217.0    0   0.0   5 
ar0        0.0   0.8     0.0     6.4    0   0.7   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.6 691.6     4.8  7919.5 3461  82.4  11 
mfid1      0.0 1477.7     0.0 11821.7    1   0.3  15 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.4 1346.2     3.2 12558.1    0 855.1  41 
mfid1      0.0 1020.8     0.0  8190.6    0   0.5  39 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.0  62.2     8.0  1792.6    0   0.3   2 
mfid1      0.0 1636.3     0.0 13112.4    0   0.0   4 
ar0        0.0   0.8     0.0    12.8    0   0.8   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.2  63.0     1.6  1818.1    0   0.1   0 
mfid1      0.0 1710.6     0.0 13684.6    0   0.0   5 
ar0        0.0   1.2     0.0    19.2    0   1.2   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.8  83.8     6.4  2330.6    0   0.1   1 
mfid1      0.0 1218.8     0.0  9750.2    0   0.0   4 
ar0        0.0   0.4     0.0     6.4    0   0.5   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.0  68.6     0.0  1655.7    0   0.1   0 
mfid1      0.0 1708.9     0.0 13671.4    0   0.0   5 
ar0        0.0   0.6     0.0     3.2    0   0.5   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.0  71.6     0.0  1999.4    0   0.0   0 
mfid1      0.0 1071.3     0.0  8594.1    0   0.1   3 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.4 1373.6     3.2 12956.5    0 550.7  35 
mfid1      0.0 999.8     0.0  8022.7    0   0.8  37 
ar0        0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      0.8  68.8     8.0  1908.5    0   0.2   1 
mfid1      0.0 1370.5     0.0 10962.3    0   0.0   3 
ar0        0.0   0.6     0.0     9.6    0   0.6   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  %b  
ad4        0.0   0.0     0.0     0.0    0   0.0   0 
ad6        0.0   0.0     0.0     0.0    0   0.0   0 
mfid0      1.2  63.8     9.6  1912.1    0   0.3   2 
mfid1      0.0 1355.8     0.0 10846.2    0   0.0   3 
ar0        0.0   1.0     0.0    16.0    0   1.1   0

Looking at /var/log/messages, everything appears to be ok.
The only thing I found, but it was from over a month ago, was the
following:
Nov 17 06:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 06:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 06:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 06:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 06:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 06:35:03 db1 kernel: MCA: Misc 0x4701041000041280
Nov 17 07:35:03 db1 kernel: MCA: Bank 8, Status 0x8c0000400001009f
Nov 17 07:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 07:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 07:35:03 db1 kernel: MCA: CPU 0 COR (1) RD channel ?? memory
error
Nov 17 07:35:03 db1 kernel: MCA: Address 0x84aa98000
Nov 17 07:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 17 09:35:03 db1 kernel: MCA: Bank 8, Status 0x88000040000200cf
Nov 17 09:35:03 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 17 09:35:03 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 17 09:35:03 db1 kernel: MCA: CPU 0 COR (1) MS channel ?? memory
error
Nov 17 09:35:03 db1 kernel: MCA: Misc 0x4701041000046141
Nov 18 13:34:59 db1 kernel: MCA: Bank 8, Status 0xcc0000800001009f
Nov 18 13:34:59 db1 kernel: MCA: Global Cap 0x0000000000001c09, Status
0x0000000000000000
Nov 18 13:34:59 db1 kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC
ID 0
Nov 18 13:34:59 db1 kernel: MCA: CPU 0 COR (2) OVER RD channel ?? memory
error
Nov 18 13:34:59 db1 kernel: MCA: Address 0x84aa98000
Nov 18 13:34:59 db1 kernel: MCA: Misc 0x4701041000044707
But it has not reoccurred.

> -----Original Message-----
> From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx]
> Sent: Thursday, December 22, 2011 10:55 AM
> To: Benjamin Krajmalnik
> Cc: pgsql-admin
> Subject: Re:  Problem with pgstat timneouts
> 
> "Benjamin Krajmalnik" <kraj@xxxxxxxxxxxxx> writes:
> > About a month ago, I started receiving quite a few pgstat timeouts
on
> my
> > production database.
> 
> > PostgreSQL 9.0.3 on amd64-portbld-freebsd8.1, compiled by GCC cc
> (GCC)
> > 4.2.1 20070719  [FreeBSD], 64-bit
> 
> > I am not sure where to start looking for the cause.
> 
> Yeah, so are the rest of us :-(.  People have been reporting sometimes
> seeing these messages without any clear cause.  We've been unable to
> find any bug in Postgres that would explain them, and the obvious
> explanation of "your system is so overloaded that the stats collector
> is too slow to respond" isn't very satisfying if response time seems
> snappy otherwise.  I think we have seen one case where the problem was
> provably caused by significant (multi-second) clock skew between
> different processors, but that's likely not applicable to many people.
> 
> Can you reconstruct what changed in your system around the time you
> started seeing these?
> 
> > It appears that these timeouts are affecting the performance of the
> > application - sometimes it hangs for a few seconds before data gets
> > returned, which was not the case before this.
> 
> Usually these timeouts are reported by autovacuum, and so should not
be
> affecting the performance of foreground queries, at least not queries
> that are not explicitly requesting pgstats data.  The fact that your
> log messages don't include any query context reinforces the idea that
> they are coming from autovacuum for you too.  So if you are seeing
> query response time problems, maybe the pgstat timeouts are another
> symptom, not the cause, of an underlying too-much-load problem.  You
> might try turning on query duration logging and see if you can
> correlate
> response-time glitches and pgstat timeouts to anything else
observable,
> such as peaks of system load average.
> 
> 			regards, tom lane

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin