Hi, I have a weird slow query issue I can't seem to find the cause of, so I'm hoping someone here can shed some light on this. Context: I have an application which (among other things) implements a sort of job queue using postgres for persistence. While I know a RDBMS is not necessarily the most ideal tool for this, there are some constraints and requirements which make including a messaging/queueing middleware less than trivial. To reduce locking issues we make use of advisory locks and CTE queries to select jobs from the queue (inspired by https://github.com/chanks/que/blob/master/lib/que/sql.rb) as well as some explicit or optimistic locking during job processing. As the volume of jobs to process isn't extremely high, performance seems very good - except for the issue I'm describing below. The problem: Most queries execute fast, however sometimes queries on the job table (which contains the job queue) take exactly 122 seconds (+ 0-50ms) to execute for no clear reason. I have set log_min_duration_statement to 200ms, and the only queries that are logged are the 2-minute queries. I have also enabled log_lock_waits, but no lock waits are logged. Whereas extremely high performance is not to be expected, a query taking two minutes is way too much - especially seeing as other concurrent queries seem to be completing perfectly on time. As a reference, in a time span of less than 4 minutes the application processes a test load of 2000+ jobs going through 3 stages in the job processing 'pipeline' - each stage using the CTE query to pick up a job. So that's 6000+ executions of the CTE + several 1000 executions of other queries, of which only a few (usually 1-5 per test run) inexplicably take two minutes to complete. The offending queries seem to be quite random: SELECT ... FOR UPDATE queries, CTE queries, and sometimes even regular SELECTs querying our job table take inexplicably long. PostgreSQL version: PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit How PostgreSQL was installed: through GUI installer from EnterpriseDB Changes made to the settings in the postgresql.conf file: name | current_setting | source -----------------------------+----------------------+---------------------- application_name | psql | client client_encoding | WIN1252 | client DateStyle | ISO, DMY | configuration file default_text_search_config | pg_catalog.english | configuration file lc_messages | English_Ireland.1252 | configuration file lc_monetary | English_Ireland.1252 | configuration file lc_numeric | English_Ireland.1252 | configuration file lc_time | English_Ireland.1252 | configuration file listen_addresses | * | configuration file log_autovacuum_min_duration | 100ms | configuration file log_destination | stderr | configuration file log_line_prefix | %m | configuration file log_lock_waits | on | configuration file log_min_duration_statement | 200ms | configuration file log_timezone | Europe/Brussels | configuration file logging_collector | on | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable port | 5432 | configuration file shared_buffers | 128MB | configuration file tcp_keepalives_count | 0 | configuration file tcp_keepalives_idle | 30 | configuration file tcp_keepalives_interval | 30 | configuration file TimeZone | Europe/Brussels | configuration file Operating system and version: Windows 8.1 64bit (version 6.3 build 9600) Program used to connect to PostgreSQL: custom Java application using Hibernate, HikariCP and the pgjdbc-ng 0.6 driver. The application runs on the same machine as the PostgreSQL database (at least for now, during development) Relevant or unusual information in the PostgreSQL server logs?: aside from the slow query log, there is nothing special in the logs I think. For example, this is the entirety of what was logged during one test run - I just shortened the select query a bit (replacing a list of fields with *) for readability as hibernate doesn't produce the most clean queries: 2016-06-27 22:53:06.764 CEST LOG: database system was shut down at 2016-06-27 22:53:04 CEST 2016-06-27 22:53:06.892 CEST LOG: database system is ready to accept connections 2016-06-27 22:53:07.039 CEST LOG: autovacuum launcher started 2016-06-27 22:55:07.764 CEST LOG: automatic analyze of table "app.public.job" system usage: CPU 0.00s/0.06u sec elapsed 0.20 sec 2016-06-27 22:57:09.570 CEST LOG: duration: 122006.000 ms bind cached-1453392550: select job.* from job inner join project on job.project_id = project.id inner join category on project.category_id = category.id inner join type on job.type_id = type.id where job.id=$1 2016-06-27 22:57:09.570 CEST DETAIL: parameters: $1 = '34309' 2016-06-27 22:57:09.572 CEST LOG: duration: 122011.000 ms bind cached--195714239: WITH RECURSIVE jobs AS (SELECT (i).*, pg_try_advisory_lock((i).id) AS locked FROM ( SELECT i FROM job_priority AS i WHERE status = $1 ORDER BY priority DESC, received ASC LIMIT 1) AS i1 UNION ALL (SELECT (i).*, pg_try_advisory_lock((i).id) AS locked FROM ( SELECT ( SELECT i FROM job_priority AS i WHERE status = $2 AND (-priority, received) > (-jobs.priority, jobs.received) ORDER BY priority DESC, received ASC LIMIT 1) AS i FROM jobs WHERE jobs.id IS NOT NULL LIMIT 1) AS i1 ) ) SELECT id, path, status, received, attempts, last_modified, version FROM jobs WHERE locked LIMIT 1 2016-06-27 22:57:09.572 CEST DETAIL: parameters: $1 = 'RECEIVED', $2 = 'RECEIVED' 2016-06-27 22:58:07.332 CEST LOG: automatic analyze of table "app.public.job " system usage: CPU 0.00s/0.06u sec elapsed 0.11 sec 2016-06-27 22:59:07.693 CEST LOG: automatic vacuum of table "app.public.job ": index scans: 1 pages: 0 removed, 140 remain tuples: 136 removed, 2391 remain buffer usage: 546 hits, 0 misses, 209 dirtied avg read rate: 0.000 MB/s, avg write rate: 3.388 MB/s system usage: CPU 0.00s/0.00u sec elapsed 0.48 sec Full table and index schema: CREATE TABLE job ( id INTEGER PRIMARY KEY, a VARCHAR(64) NOT NULL, path VARCHAR(512) NOT NULL UNIQUE, status VARCHAR(20) NOT NULL, project_id INTEGER NOT NULL REFERENCES project(id), type_id INTEGER NOT NULL REFERENCES type(id), b INTEGER NOT NULL, c INTEGER NOT NULL, d VARCHAR(32) NOT NULL, received TIMESTAMP NOT NULL, done TIMESTAMP, e VARCHAR(512), error VARCHAR(512), error_time TIMESTAMP, attempts INTEGER NOT NULL, last_modified TIMESTAMP, version INTEGER ); \d+ output: Column | Type | Modifiers | Storage | Stats ------------------+-----------------------------+-----------+----------+------ target | Description --------+------------- id | integer | not null | plain | a | character varying(64) | not null | extended | path | character varying(512) | not null | extended | status | character varying(20) | not null | extended | project_id | integer | not null | plain | type_id | integer | not null | plain | b | integer | not null | plain | c | integer | not null | plain | d | character varying(32) | not null | extended | received | timestamp without time zone | not null | plain | done | timestamp without time zone | | plain | e | character varying(512) | | extended | error | character varying(512) | | extended | error_time | timestamp without time zone | | plain | attempts | integer | not null | plain | last_modified | timestamp without time zone | | plain | version | integer | | plain | Indexes: "job_pkey" PRIMARY KEY, btree (id) "job_path_key" UNIQUE CONSTRAINT, btree (path) Foreign-key constraints: "job_project_id_fkey" FOREIGN KEY (project _id) REFERENCES project (id) "job_type_id_fkey" FOREIGN KEY (type_id) REFERENCES type(id) Has OIDs: no Additional information on the table: - a few columns contain mostly nulls (i.e. a column indicating any error messages which may occur during processing of the job) - Due to the nature of the application the table does get a high number of inserts/updates in a short time. As mentioned above, in the test in which I observe this issue the job table grows from 0 to 2000-2500 rows very quickly, with each row receiving on average 4 more updates. - Note: the CTE selects from a job_priority view which is a JOIN of 3 tables: job - project - category, with categories having a priority which impacts job processing order. The test case I am currently using comprises only 5 projects and a single category. Explain: unfortunately it's not always the same query that causes issues. If you want I can still run the most common queries through EXPLAIN ANALYZE but haven't included this output here as this email is already long enough... History: this is during development of an application, so there is no production history Hardware: i7 CPU, 15 GB of RAM, SSD - this is a development machine, so there's lots of other stuff running too (but no other databases in active use) Things I tried: - Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit This did not solve the problem, queries still take 122 seconds from time to time - Enable auto_explain to obtain more information. Unfortunately, for the 2-minute queries no plan is logged. If I manually force a query to take a long time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration low enough plans are logged for slow queries, _except_ for these 2-minute mysteries. This makes it very hard to see exactly why the query is taking so long. Is this a bug in auto_explain? Are cached plans never logged (this is not indicated in the documentation)? Some other reason why no plan would be logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3 - Running some queries through EXPLAIN ANALYZE; however, these execute quickly (a few ms) and don't seem to indicate a clear problem to me When I run my test there is very high I/O on the system for a few minutes. However, I am reluctant to point to that as the sole cause for the following reasons: - other concurrent queries execute fast (it's a multi threaded application and other threads continue normally) - if I look at the resource monitor, some queries remain hung up even after disk I/O has quieted down - if I/O performance were the cause I would expect to see a lot more variance, now the queries always take exactly 122s (with only about 50ms variance) Given the reasonably small dataset (a pg_dump of the full database containing about 2500 jobs is less than 1MB) I would think that the whole database fits in memory anyway, making this issue all the more puzzling. Have I missed something obvious? Best regards, Roel -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance