Hi all,
We are operating a 1.5GB postgresql database for a year and we have
problems for nearly a month. Usually everything is OK with the database,
queries are executed fast even if they are complicated but sometimes and
for half an hour, we have a general slow down.
The server is a dedicated quad xeon with 4GB and a RAID1 array for the
system and a RAID10 one for postgresql data. We have very few
updates/inserts/deletes during the day.
Postgresql version is 7.4.8.
- the database is vacuumed, analyzed regularly (but we are not using
autovacuum) and usually performs pretty well ;
- IOs are OK, the database is entirely in RAM (see top.txt and
iostat.txt attached) ;
- CPUs are usually 25% idle, load is never really growing and its
maximum is below 5 ;
- I attached two plans for a simple query, the one is what we have when
the server is fast, the other when we have a slow down: it's exactly the
same plan but, as you can see it, the time to fetch the first row from
indexes is quite high for the slow query ;
- during this slow down, queries that usually take 500ms can take up to
60 seconds (and sometimes even more) ;
- we have up to 130 permanent connections from our apache servers during
this slow down as we have a lot of apache children waiting for a response.
I attached a vmstat output. Context switches are quite high but I'm not
sure it can be called a context switch storm and that this is the cause
of our problem.
Thanks for any advice or idea to help us understanding this problem and
hopefully solve it.
Regards,
--
Guillaume
[root@bd root]# iostat 10
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 7.20 0.00 92.00 0 920
sda1 0.00 0.00 0.00 0 0
sda2 6.40 0.00 78.40 0 784
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0
sda6 0.80 0.00 13.60 0 136
sdb 5.00 0.00 165.60 0 1656
sdb1 5.00 0.00 165.60 0 1656
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 1.30 0.00 20.80 0 208
sda1 0.00 0.00 0.00 0 0
sda2 0.70 0.00 9.60 0 96
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0
sda6 0.60 0.00 11.20 0 112
sdb 5.40 0.00 173.60 0 1736
sdb1 5.40 0.00 173.60 0 1736
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.20 0.00 28.00 0 280
sda1 0.00 0.00 0.00 0 0
sda2 2.20 0.00 28.00 0 280
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0
sda6 0.00 0.00 0.00 0 0
sdb 5.20 0.00 171.20 0 1712
sdb1 5.20 0.00 171.20 0 1712
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..13.52 rows=2 width=1119) (actual time=0.154..0.167 rows=1 loops=1)
-> Index Scan using pk_newslang on newslang nl (cost=0.00..3.87 rows=1 width=1004) (actual time=0.053..0.055 rows=1 loops=1)
Index Cond: (((codelang)::text = 'FRA'::text) AND (3498704 = numnews))
-> Nested Loop Left Join (cost=0.00..9.61 rows=2 width=119) (actual time=0.050..0.059 rows=1 loops=1)
-> Index Scan using pk_news on news n (cost=0.00..3.31 rows=2 width=98) (actual time=0.021..0.023 rows=1 loops=1)
Index Cond: (numnews = 3498704)
-> Index Scan using pk_photo on photo p (cost=0.00..3.14 rows=1 width=25) (actual time=0.021..0.025 rows=1 loops=1)
Index Cond: (p.numphoto = "outer".numphoto)
Total runtime: 0.320 ms
(9 rows)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..13.52 rows=2 width=1119) (actual time=155.286..155.305 rows=1 loops=1)
-> Index Scan using pk_newslang on newslang nl (cost=0.00..3.87 rows=1 width=1004) (actual time=44.575..44.579 rows=1 loops=1)
Index Cond: (((codelang)::text = 'FRA'::text) AND (3498704 = numnews))
-> Nested Loop Left Join (cost=0.00..9.61 rows=2 width=119) (actual time=110.648..110.660 rows=1 loops=1)
-> Index Scan using pk_news on news n (cost=0.00..3.31 rows=2 width=98) (actual time=0.169..0.174 rows=1 loops=1)
Index Cond: (numnews = 3498704)
-> Index Scan using pk_photo on photo p (cost=0.00..3.14 rows=1 width=25) (actual time=110.451..110.454 rows=1 loops=1)
Index Cond: (p.numphoto = "outer".numphoto)
Total runtime: 155.514 ms
(9 rows)
17:08:41 up 19 days, 15:16, 1 user, load average: 4.03, 4.26, 4.36
288 processes: 285 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 59.0% 0.0% 8.8% 0.2% 0.0% 0.0% 31.9%
cpu00 52.3% 0.0% 13.3% 0.9% 0.0% 0.0% 33.3%
cpu01 65.7% 0.0% 7.6% 0.0% 0.0% 0.0% 26.6%
cpu02 58.0% 0.0% 7.6% 0.0% 0.0% 0.0% 34.2%
cpu03 60.0% 0.0% 6.6% 0.0% 0.0% 0.0% 33.3%
Mem: 3857224k av, 3495880k used, 361344k free, 0k shrd, 92160k buff
2374048k actv, 463576k in_d, 37708k in_c
Swap: 4281272k av, 25412k used, 4255860k free 2173392k cached
[root@bd root]# vmstat 10
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa
4 0 25412 250160 91736 2170944 0 1 3 3 4 3 5 3 3 0
7 0 25412 247160 91736 2171024 0 0 0 116 588 4483 65 4 31 0
3 0 25412 235456 91752 2171132 0 0 0 129 491 3670 70 4 26 0
5 0 25412 233696 91760 2171216 0 0 0 152 530 4768 61 4 34 0
5 0 25412 233248 91768 2171232 0 0 0 183 624 5379 59 5 36 0
9 0 25412 195332 91788 2171304 0 0 0 127 541 4811 58 5 37 0
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match