RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode

Reid Thompson <Reid.Thompson@xxxxxxxxxxxx> · Mon, 19 Apr 2021 16:54:16 +0000

Responses inline below
Thanks,
reid

-----Original Message-----
From: Tom Lane <tgl@xxxxxxxxxxxxx> 
Sent: Monday, April 19, 2021 9:54 AM
To: Reid Thompson <Reid.Thompson@xxxxxxxxxxxx>
Cc: pgsql-general@xxxxxxxxxxxxxxxxxxxx
Subject: Re: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode

Reid Thompson <Reid.Thompson@xxxxxxxxxxxx> writes:
> Hi I'm looking for some guidance related to the subject line issue.

Is this repeatable?  If so you've found a bug, I think.

Alvaro has noted   https://www.postgresql.org/message-id/20210419155928.GA3253%40alvherre.pgsql  that the mentioned bug fix on the error reporting should correct this ( as the hash join will abort with the fix rather than not recognizing that it needs to, and then attempt a later read)
Somewhat -- I think it's dependent on what else may be writing tmp files ( i.e. it's happened twice in the past few days with successful runs in the interim and I believe that this query has been in place for a long time prior to the first occurrence)

>   1.  That the error message has been updated ( i.e. SUCCESS is not 
> the proper value)

Yeah, what this really indicates is an incomplete read (file shorter than expected).  Since 11.8, we've improved the error reporting for that, but that wouldn't in itself fix whatever the underlying problem is.  

See message noted above from Alvaro.

>   2.  That the error is due to running out of temporary space either disk space or maybe temp_buffers?

That could be the proximate cause, although then there would be a bug that the original write failure wasn't detected.  But it seems about as likely that there's just some inconsistency between what the temp file writing code wrote and what the reading code expects to read.

Is this a parallelized hash join by any chance?  That's new in v11 if memory serves, so it'd be interesting to see if disabling enable_parallel_hash changes anything.

Explain on the host does not indicate parallelization anywhere - it does indicate that we can narrow the result set by optimizing our query which we are implementing now.  Hopefully this will get us to a scheduled upgrade timeframe.

Anyway, I'd counsel updating to current (11.11), and then if you can still reproduce the problem, try to reduce it to a self-contained test case.

                        regards, tom lane