>> At which commit ID? 83fcc615020647268bb129cbf86f7661feee6412 (5/6) >>do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this? A few worker nodes, a table is hash partitioned by "aTable.did" by Citus, and further partitioned by PG10 by time range on field "ts". As far as I could tell, Citus just does a query rewrite, and execute the same type of queries to all nodes. >>so this happened at the same time or at different times? At the same time. The queries are simple count and sum queries, here is the relevant part from one of the worker nodes: 2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not find 7 free pages 2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT aTable.did, avg((aTable.sum OPERATOR(pg_catalog./) (aTable.count)::double precision)) AS avg FROM public.aTable_102117 aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25 00:00:00+00'::timestamp with time zone) AND (aTable.ts OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary) >> a parallel worker process I think this is more of PG10 parallel bg worker issue. I don't think Citus just lets each worker PG server do its own planning. I will try to do more experiments about this, and see if there is any specific query to cause the parallel query execution to fail. As far as I can tell, the level of concurrency triggered this issue. That is executing 10s of queries as shown on the worker nodes, depending on the stats, the PG10 core may or may not spawn more bg workers. Thanks for your time! On Tue, May 22, 2018 at 9:44 PM, Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> wrote: > On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@xxxxxxxxx> wrote: >>>>dsa_allocate could not find 7 free pages >> I just this error message again on all of my worker nodes (I am using >> Citus 7.4 rel). The PG core is my own build of release_10_stable >> (10.4) out of GitHub on Ubuntu. > > At which commit ID? > > All of your worker nodes... so this happened at the same time or at > different times? I don't know much about Citus -- do you mean that > these were separate PostgreSQL clusters, and they were all running the > same query and they all crashed like this? > >> What's the best way to debug this? I am running pre-production tests >> for the next few days, so I could gather info. if necessary (I cannot >> pinpoint a query to repro this yet, as we have 10K queries running >> concurrently). > > Any chance of an EXPLAIN plan for the query that crashed like this? > Do you know if it's using multiple Gather[Merge] nodes and parallel > bitmap heap scans? Was it a regular backend process or a parallel > worker process (or a Citus worker process, if that is a thing?) that > raised the error? > > -- > Thomas Munro > http://www.enterprisedb.com