Re: dsa_allocate() faliure

Sand Stone <sand.m.stone@xxxxxxxxx> · Tue, 22 May 2018 21:10:02 -0700

>>dsa_allocate could not find 7 free pages
I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

On Mon, Jan 29, 2018 at 1:35 PM, Rick Otten <rottenwindfish@xxxxxxxxx> wrote:
> If I do a "set max_parallel_workers_per_gather=0;" before I run the query in
> that session, it runs just fine.
> If I set it to 2, the query dies with the dsa_allocate error.
>
> I'll use that as a work around until 10.2 comes out.  Thanks!  I have
> something that will help.
>
>
> On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro
> <thomas.munro@xxxxxxxxxxxxxxxx> wrote:
>>
>> On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> > Rick Otten <rottenwindfish@xxxxxxxxx> writes:
>> >> I'm wondering if there is anything I can tune in my PG 10.1 database to
>> >> avoid these errors:
>> >
>> >> $  psql -f failing_query.sql
>> >> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free
>> >> pages
>> >> CONTEXT:  parallel worker
>> >
>> > Hmm.  There's only one place in the source code that emits that message
>> > text:
>> >
>> >         /*
>> >          * Ask the free page manager for a run of pages.  This should
>> > always
>> >          * succeed, since both get_best_segment and make_new_segment
>> > should
>> >          * only return a non-NULL pointer if it actually contains enough
>> >          * contiguous freespace.  If it does fail, something in our
>> > backend
>> >          * private state is out of whack, so use FATAL to kill the
>> > process.
>> >          */
>> >         if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
>> >             elog(FATAL,
>> >                  "dsa_allocate could not find %zu free pages", npages);
>> >
>> > Now maybe that comment is being unreasonably optimistic, but it sure
>> > appears that this is supposed to be a can't-happen case, in which case
>> > you've found a bug.
>>
>> This is probably the bug fixed here:
>>
>>
>> https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org
>>
>> That was back patched, so 10.2 will contain the fix.  The bug was not
>> in dsa.c itself, but in the parallel query code that mixed up DSA
>> areas, corrupting them.  The problem comes up when the query plan has
>> multiple Gather nodes (and a particular execution pattern) -- is that
>> the case here, in the EXPLAIN output?  That seems plausible given the
>> description of a 50-branch UNION.  The only workaround until 10.2
>> would be to reduce max_parallel_workers_per_gather to 0 to prevent
>> parallelism completely for this query.
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>
>