Re: dsa_allocate() faliure

Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> · Mon, 4 Feb 2019 19:22:28 +1100

On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa <jakub.glapa@xxxxxxxxx> wrote:
> I see the error showing up every night on 2 different servers. But it's a bit of a heisenbug because If I go there now it won't be reproducible.

Huh.  Ok well that's a lot more frequent that I thought.  Is it always
the same query?  Any chance you can get the plan?  Are there more
things going on on the server, like perhaps concurrent parallel
queries?

> It was suggested by Justin Pryzby that I recompile pg src with his patch that would cause a coredump.

Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.

> But I don't feel comfortable doing this especially if I would have to run this with prod data.
> My question is. Can I do anything like increasing logging level or enable some additional options?
> It's a production server but I'm willing to sacrifice a bit of it's performance if that would help.

If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something.  This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.

In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.

-- 
Thomas Munro
http://www.enterprisedb.com