Large object loading stalls

Michael Akinde <michael.akinde@xxxxxx> · Wed, 18 Feb 2009 14:27:41 +0100

Hi,

We have been running into a strange problem on our production servers of 
a system we are about to roll out with the loading of large objects. At 
various times (usually separated by a few days), the loading process 
against the server will suddenly either take an extremely long time 
(going from the usual 15-20 minutes up to 45 or more minutes) or simply 
hang up altogether (i.e., when 8+ hours have gone by, we haven't had the 
patience to wait and see if it eventually finishes). This is on an 8.3.3 
server running on debian etch - most of the previous testing occurred on 
8.1.10, where we did not encounter this problem (though we had other 
issues on that platform).

I have run some tests with the loading of large objects into a 
PostgreSQL database (8.3.3) using a more rapid turnaround (loading, 
deleting, vacuuming directly and then starting the loading over again) 
than in production and have been able to provoke what seems to be the 
same issue there - usually within 12 hours.

Essentially, we have a C++ program (pqxx) loading a large number of 
large objects (somewhere around a 80,000 or so) exhibiting strange 
behavior. It varies quite a lot in loading time (from 1600 to 2500 
seconds), and at random times it will simply just hang up (at least it 
seems so, given that it has not progressed at all after 7-8 hours). It 
displays this behaviour irrespective of whether we load all 80000 fields 
in one go, or the loading is broken up into smaller continuos bits of 
1000 blobs at a time (though it does seem easier to provoke the problem 
with the former loading approach). It can run fine for hours at a time, 
and then suddenly lock up (in the latest test I did, it loaded the data 
correctly 22 times in a row, before locking up on the 23rd run through).

The pattern seems to be identical each time; ps aux shows the process 
(wdb) idle in transaction, with no activity on the system being apparent:

postgres 19965  0.0  0.0  72628  5584 ?        S    Feb17   0:01 
/usr/lib/postgresql/8.3/bin/postgres -D /var/lib/postgresql/8.3/main -c 
config_file=/etc/postgresql/8.3/main/postgresql.conf
postgres 19967  0.4  0.0  72748 27636 ?        Ss   Feb17   6:38 
postgres: writer 
process                                                                                                   

postgres 19968  0.0  0.0  72628  1784 ?        Ss   Feb17   0:09 
postgres: wal writer 
process                                                                                               

postgres 19969  0.0  0.0  72940  1924 ?        Ss   Feb17   0:00 
postgres: autovacuum launcher 
process                                                                                      

postgres 19970  0.0  0.0  42352  1736 ?        Ss   Feb17   0:02 
postgres: stats collector 
process                                                                                          

postgres 16617  0.0  0.0  95156 10948 ?        Ss   08:42   0:00 
postgres: autovacuum worker process   
wdb                                                                                  

postgres 18834  0.0  0.1 118632 37320 ?        Ss   09:06   0:01 
postgres: wdb wdb [local] idle in 
transaction                                                                              

A core dump on the program we did earlier seemed to indicate that the 
program was waiting on the database to return on the insert statement. 
top indicates that both postgres and the application are idling (no load 
on the server whatsoever).

pg_locks and pg_stat_activity seem to indicate that there may perhaps be 
some issue with autovacuum?

locktype  | database | relation | page | tuple | virtualxid | 
transactionid | classid | objid | objsubid | virtualtransaction |  pid  
|           mode           | granted
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------
relation   |    16385 |     2683 |      |       |            
|               |         |       |          | 2/1122117          | 
16617 | RowExclusiveLock         | t
relation   |    16385 |    10969 |      |       |            
|               |         |       |          | 3/1105             | 
27895 | AccessShareLock          | t
relation   |    16385 |     2613 |      |       |            
|               |         |       |          | 2/1122117          | 
16617 | ShareUpdateExclusiveLock | t
virtualxid |          |          |      |       | 3/1105     
|               |         |       |          | 3/1105             | 
27895 | ExclusiveLock            | t
virtualxid |          |          |      |       | 2/1122117  
|               |         |       |          | 2/1122117          | 
16617 | ExclusiveLock            | t
(5 rows)

datid | datname | procpid | usesysid | usename  |                    
current_query                     | waiting |          
xact_start           |          query_start          |         
backend_start         | client_addr | client_port
-------+---------+---------+----------+----------+------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------
16385 | wdb     |   16617 |       10 | postgres | autovacuum: VACUUM 
ANALYZE pg_catalog.pg_largeobject | f       | 2009-02-18 
08:42:45.066922+00 | 2009-02-18 08:42:45.066922+00 | 2009-02-18 
08:42:45.065857+00 |             |
16385 | wdb     |   27895 |    16384 | wdb      | select * from 
pg_stat_activity ;                     | f       | 2009-02-18 
12:21:47.925813+00 | 2009-02-18 12:21:47.925813+00 | 2009-02-18 
11:09:31.201507+00 |             |          -1
16385 | wdb     |   18834 |    16384 | wdb      | <IDLE> in 
transaction                                | f       
|                               | 2009-02-18 09:06:26.479633+00 | 
2009-02-18 09:06:23.443698+00 |             |          -1
(3 rows)

Any leads as to how to figure out the problem or alternatively how to 
work around it would be much appreciated.

I noticed that the disk on the test machine seemed to be running low on 
disk caused by an error in the fsm settings, but this was not the case 
in the production systems, so I assume that this is not the problem here 
either (will fix it for future test runs, though).

Regards,

Michael Akinde

begin:vcard
fn:Michael Akinde
n:Akinde;Michael
org:Meteorologisk Institutt, Norge;IT
adr;quoted-printable:;;Gaustadall=C3=A9en 30D;Oslo;;0313;Norge
email;internet:michael.akinde@xxxxxx
tel;work:22963379
tel;cell:45885379
x-mozilla-html:FALSE
url:http://www.met.no
version:2.1
end:vcard

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general