Just as a follow up. I tried the parallel execution again (in a stress test environment). Now the crash seems gone. I will keep an eye on this for the next few weeks. My theory is that the Citus cluster created and shut down a lot of TCP connections between coordinator and workers. If running on untuned Linux machines, the TCP ports might run out. Of course, I am using "newer" PG10 bits and Citus7.5 this time. On Wed, May 23, 2018 at 7:06 AM Sand Stone <sand.m.stone@xxxxxxxxx> wrote: > > >> At which commit ID? > 83fcc615020647268bb129cbf86f7661feee6412 (5/6) > > >>do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this? > A few worker nodes, a table is hash partitioned by "aTable.did" by > Citus, and further partitioned by PG10 by time range on field "ts". As > far as I could tell, Citus just does a query rewrite, and execute the > same type of queries to all nodes. > > >>so this happened at the same time or at different times? > At the same time. The queries are simple count and sum queries, here > is the relevant part from one of the worker nodes: > 2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not > find 7 free pages > 2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker > STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS > sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT > aTable.did, avg((aTable.sum OPERATOR(pg_catalog./) > (aTable.count)::double precision)) AS avg FROM public.aTable_102117 > aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25 > 00:00:00+00'::timestamp with time zone) AND (aTable.ts > OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time > zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY > aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary) > > > >> a parallel worker process > I think this is more of PG10 parallel bg worker issue. I don't think > Citus just lets each worker PG server do its own planning. > > I will try to do more experiments about this, and see if there is any > specific query to cause the parallel query execution to fail. As far > as I can tell, the level of concurrency triggered this issue. That is > executing 10s of queries as shown on the worker nodes, depending on > the stats, the PG10 core may or may not spawn more bg workers. > > Thanks for your time! > > > > > > On Tue, May 22, 2018 at 9:44 PM, Thomas Munro > <thomas.munro@xxxxxxxxxxxxxxxx> wrote: > > On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@xxxxxxxxx> wrote: > >>>>dsa_allocate could not find 7 free pages > >> I just this error message again on all of my worker nodes (I am using > >> Citus 7.4 rel). The PG core is my own build of release_10_stable > >> (10.4) out of GitHub on Ubuntu. > > > > At which commit ID? > > > > All of your worker nodes... so this happened at the same time or at > > different times? I don't know much about Citus -- do you mean that > > these were separate PostgreSQL clusters, and they were all running the > > same query and they all crashed like this? > > > >> What's the best way to debug this? I am running pre-production tests > >> for the next few days, so I could gather info. if necessary (I cannot > >> pinpoint a query to repro this yet, as we have 10K queries running > >> concurrently). > > > > Any chance of an EXPLAIN plan for the query that crashed like this? > > Do you know if it's using multiple Gather[Merge] nodes and parallel > > bitmap heap scans? Was it a regular backend process or a parallel > > worker process (or a Citus worker process, if that is a thing?) that > > raised the error? > > > > -- > > Thomas Munro > > http://www.enterprisedb.com