Re: dsa_allocate() faliure

Sand Stone <sand.m.stone@xxxxxxxxx> · Wed, 23 May 2018 07:06:41 -0700

>> At which commit ID?
83fcc615020647268bb129cbf86f7661feee6412 (5/6)

>>do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?
A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

>>so this happened at the same time or at different times?
At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR:  dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT:  parallel worker
STATEMENT:  COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

>> a parallel worker process
I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas.munro@xxxxxxxxxxxxxxxx> wrote:
> On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@xxxxxxxxx> wrote:
>>>>dsa_allocate could not find 7 free pages
>> I just this error message again on all of my worker nodes (I am using
>> Citus 7.4 rel). The PG core is my own build of release_10_stable
>> (10.4) out of GitHub on Ubuntu.
>
> At which commit ID?
>
> All of your worker nodes... so this happened at the same time or at
> different times?  I don't know much about Citus -- do you mean that
> these were separate PostgreSQL clusters, and they were all running the
> same query and they all crashed like this?
>
>> What's the best way to debug this? I am running pre-production tests
>> for the next few days, so I could gather info. if necessary (I cannot
>> pinpoint a query to repro this yet, as we have 10K queries running
>> concurrently).
>
> Any chance of an EXPLAIN plan for the query that crashed like this?
> Do you know if it's using multiple Gather[Merge] nodes and parallel
> bitmap heap scans?  Was it a regular backend process or a parallel
> worker process (or a Citus worker process, if that is a thing?) that
> raised the error?
>
> --
> Thomas Munro
> http://www.enterprisedb.com