Re: PostgreSQL 9.0.4 blocking in lseek?

Matteo Beccati <php@xxxxxxxxxxx> · Wed, 28 Dec 2011 10:57:10 +0100

Hi everyone,

On 28/10/2011 02:32, Tom Lane wrote:
> =?ISO-8859-1?Q?S=F6ren_Meyer-Eppler?= <soerenme@xxxxxxxxxx> writes:
>> A fairly complex insert query on an empty destination table will run for 
>> an indefinite amount of time (we waited literally days for the query to 
>> complete). This does not happen every time we run the query but often. 
>> Now ordinarily I'd assume we did something wrong with our indices or 
>> query, but the execution plan looks sane and, more tellingly, we have:
>> - no CPU load
>> - no network traffic
>> - no disk I/O
>> - no other load on the server except this single query
>> and strace displaying a seemingly endless list of lseek calls.
> 
>> So my assumption is that we are not running into bad Big-O() runtime 
>> behavior but rather into some locking problem.
> 
> If it were blocked on a lock, it wouldn't be doing lseeks().
> 
> The most obvious explanation for a backend that's doing repeated lseeks
> and nothing else is that it's repeatedly doing seqscans on a table
> that's fully cached in shared buffers.  I'd wonder about queries
> embedded in often-executed plpgsql functions, for instance.  Can you
> identify which table the lseeks are issued against?

I'm resuming the old thread as we've recently been hit twice by a
similar issue, both on 9.1.1 and 9.1.2. Details follow the quoted part.

The first time we've seen such behaviour was a few weeks ago, using
9.1.1. I've decided to avoid posting as I messed up with gdb while
trying to gather more information. This time, I've kept the query
running, just in case I need to provide more information.

So, here they are:

Linux (ubuntu oneiric x64) on and EC2 instance (8 virtual cores, 68GB
RAM). Storage is on EBS using xfs as filesystem. Currently using
PostgreSQL 9.1.2 manually compiled.

A rather long import process is stuck since ~12hrs on a query that
normally returns in a few seconds or minutes. Kust like to OP there's
almost no activity on the box, only a single postgres process at 100% CPU.

Two separate queries, (although similar) have triggered the issue, and
they are run one after another. Also, the import is also not always
failing, I think it happened twice in about 10 runs. We're expected to
go live in about two weeks so these were only test runs.

The explain looks like this:
http://explain.depesz.com/s/TqD

The strace output:

...
2011-12-28 09:33:59.506037500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.555167500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.604315500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.653482500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.676128500 lseek(67, 0, SEEK_END) = 65511424
2011-12-28 09:33:59.676287500 write(67,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192
...
2011-12-28 09:34:04.935944500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:34:04.945279500 lseek(67, 0, SEEK_END) = 65519616
2011-12-28 09:34:04.945394500 write(67,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192
...

so an lseek on fd 64 roughly every 50ms and lseek/write to fd 67 every
5s. Lsof shows:

postgres 1991 postgres   64u   REG              252,0   27623424
134250744 /data/postgresql/main/base/16387/44859
postgres 1991 postgres   67u   REG              252,0   65732608
134250602 /data/postgresql/main/base/16387/44675

with 64 being the carts table and 67 cart_products, according to
pg_class' relfilenode. All the tables involved are regular tables but
log_conversion_item, which is unlogged.

I'm also trying an EXPLAIN ANALYZE for the SELECT part, but it seems to
take a while too and is seemingly calling only gettimeofday.

Here are the custom values in postgresql.conf:

max_connections = 200
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 128MB
wal_buffers = 16MB
checkpoint_segments = 64
shared_buffers = 8GB
synchronous_commit = off
random_page_cost = 2

archive_mode = off
wal_level = hot_standby
max_wal_senders = 2
wal_keep_segments = 256

Streaming replication was accidentally left enabled during the run, but
I'm pretty sure that I didn't set it up yet when we had the previous
failure.

I'll be happy to provide more information. For now we're going to
disconnect the slave and use that for another test import run.

Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance