Re: dsa_allocate() faliure

Sand Stone <sand.m.stone@xxxxxxxxx> · Wed, 15 Aug 2018 13:32:45 -0700

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.
On Wed, May 23, 2018 at 7:06 AM Sand Stone <sand.m.stone@xxxxxxxxx> wrote:
>
> >> At which commit ID?
> 83fcc615020647268bb129cbf86f7661feee6412 (5/6)
>
> >>do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?
> A few worker nodes, a table is hash partitioned by "aTable.did" by
> Citus, and further partitioned by PG10 by time range on field "ts". As
> far as I could tell, Citus just does a query rewrite, and execute the
> same type of queries to all nodes.
>
> >>so this happened at the same time or at different times?
> At the same time. The queries are simple count and sum queries, here
> is the relevant part from one of the worker nodes:
> 2018-05-23 01:24:01.492 UTC [130536] ERROR:  dsa_allocate could not
> find 7 free pages
> 2018-05-23 01:24:01.492 UTC [130536] CONTEXT:  parallel worker
> STATEMENT:  COPY (SELECT count(1) AS count, sum(worker_column_1) AS
> sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
> aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
> (aTable.count)::double precision)) AS avg FROM public.aTable_102117
> aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
> 00:00:00+00'::timestamp with time zone) AND (aTable.ts
> OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
> zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
> aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)
>
>
> >> a parallel worker process
> I think this is more of PG10 parallel bg worker issue. I don't think
> Citus just lets each worker PG server do its own planning.
>
> I will try to do more experiments about this, and see if there is any
> specific query to cause the parallel query execution to fail. As far
> as I can tell, the level of concurrency triggered this issue. That is
> executing 10s of queries as shown on the worker nodes, depending on
> the stats, the PG10 core may or may not spawn more bg workers.
>
> Thanks for your time!
>
>
>
>
>
> On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
> <thomas.munro@xxxxxxxxxxxxxxxx> wrote:
> > On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@xxxxxxxxx> wrote:
> >>>>dsa_allocate could not find 7 free pages
> >> I just this error message again on all of my worker nodes (I am using
> >> Citus 7.4 rel). The PG core is my own build of release_10_stable
> >> (10.4) out of GitHub on Ubuntu.
> >
> > At which commit ID?
> >
> > All of your worker nodes... so this happened at the same time or at
> > different times?  I don't know much about Citus -- do you mean that
> > these were separate PostgreSQL clusters, and they were all running the
> > same query and they all crashed like this?
> >
> >> What's the best way to debug this? I am running pre-production tests
> >> for the next few days, so I could gather info. if necessary (I cannot
> >> pinpoint a query to repro this yet, as we have 10K queries running
> >> concurrently).
> >
> > Any chance of an EXPLAIN plan for the query that crashed like this?
> > Do you know if it's using multiple Gather[Merge] nodes and parallel
> > bitmap heap scans?  Was it a regular backend process or a parallel
> > worker process (or a Citus worker process, if that is a thing?) that
> > raised the error?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com