Hi everyone, On 28/10/2011 02:32, Tom Lane wrote: > =?ISO-8859-1?Q?S=F6ren_Meyer-Eppler?= <soerenme@xxxxxxxxxx> writes: >> A fairly complex insert query on an empty destination table will run for >> an indefinite amount of time (we waited literally days for the query to >> complete). This does not happen every time we run the query but often. >> Now ordinarily I'd assume we did something wrong with our indices or >> query, but the execution plan looks sane and, more tellingly, we have: >> - no CPU load >> - no network traffic >> - no disk I/O >> - no other load on the server except this single query >> and strace displaying a seemingly endless list of lseek calls. > >> So my assumption is that we are not running into bad Big-O() runtime >> behavior but rather into some locking problem. > > If it were blocked on a lock, it wouldn't be doing lseeks(). > > The most obvious explanation for a backend that's doing repeated lseeks > and nothing else is that it's repeatedly doing seqscans on a table > that's fully cached in shared buffers. I'd wonder about queries > embedded in often-executed plpgsql functions, for instance. Can you > identify which table the lseeks are issued against? I'm resuming the old thread as we've recently been hit twice by a similar issue, both on 9.1.1 and 9.1.2. Details follow the quoted part. The first time we've seen such behaviour was a few weeks ago, using 9.1.1. I've decided to avoid posting as I messed up with gdb while trying to gather more information. This time, I've kept the query running, just in case I need to provide more information. So, here they are: Linux (ubuntu oneiric x64) on and EC2 instance (8 virtual cores, 68GB RAM). Storage is on EBS using xfs as filesystem. Currently using PostgreSQL 9.1.2 manually compiled. A rather long import process is stuck since ~12hrs on a query that normally returns in a few seconds or minutes. Kust like to OP there's almost no activity on the box, only a single postgres process at 100% CPU. Two separate queries, (although similar) have triggered the issue, and they are run one after another. Also, the import is also not always failing, I think it happened twice in about 10 runs. We're expected to go live in about two weeks so these were only test runs. The explain looks like this: http://explain.depesz.com/s/TqD The strace output: ... 2011-12-28 09:33:59.506037500 lseek(64, 0, SEEK_END) = 27623424 2011-12-28 09:33:59.555167500 lseek(64, 0, SEEK_END) = 27623424 2011-12-28 09:33:59.604315500 lseek(64, 0, SEEK_END) = 27623424 2011-12-28 09:33:59.653482500 lseek(64, 0, SEEK_END) = 27623424 2011-12-28 09:33:59.676128500 lseek(67, 0, SEEK_END) = 65511424 2011-12-28 09:33:59.676287500 write(67, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 ... 2011-12-28 09:34:04.935944500 lseek(64, 0, SEEK_END) = 27623424 2011-12-28 09:34:04.945279500 lseek(67, 0, SEEK_END) = 65519616 2011-12-28 09:34:04.945394500 write(67, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 ... so an lseek on fd 64 roughly every 50ms and lseek/write to fd 67 every 5s. Lsof shows: postgres 1991 postgres 64u REG 252,0 27623424 134250744 /data/postgresql/main/base/16387/44859 postgres 1991 postgres 67u REG 252,0 65732608 134250602 /data/postgresql/main/base/16387/44675 with 64 being the carts table and 67 cart_products, according to pg_class' relfilenode. All the tables involved are regular tables but log_conversion_item, which is unlogged. I'm also trying an EXPLAIN ANALYZE for the SELECT part, but it seems to take a while too and is seemingly calling only gettimeofday. Here are the custom values in postgresql.conf: max_connections = 200 maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 effective_cache_size = 48GB work_mem = 128MB wal_buffers = 16MB checkpoint_segments = 64 shared_buffers = 8GB synchronous_commit = off random_page_cost = 2 archive_mode = off wal_level = hot_standby max_wal_senders = 2 wal_keep_segments = 256 Streaming replication was accidentally left enabled during the run, but I'm pretty sure that I didn't set it up yet when we had the previous failure. I'll be happy to provide more information. For now we're going to disconnect the slave and use that for another test import run. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance