We have a long script of sql that we run, several
thousand lines of sql. If I execute the script from start to
finish, somewhere in the middle of it, one sql command will
hang and take 2 to 3 hours. During this time, "htop" shows
100% cpu usage with a postgresql process, so it appears to
be doing something.
If I stop that command, then immediately rerun the rest
of the script starting from the command that "hung", the
"hung" command then completes in 5 seconds and the rest of
the script just continues on perfectly. I can interrupt the
command within 30 mins of hanging, or 2 hours of hanging;
regardless of when I interrrupt the hung command, it always
immediately completes successully in < 5 seconds when I
rerun the script immediately after that.
I have noticed this in multiple scripts, multiple
commands; one was an "insert into", a different script hung
on a "update" and a third one would hang on a "create table
x as (select ... )". These are all short running commands
normally, and we have indexes on all the columns for the
appropriate "where" clauses.
I can't send the sql (forbidden), but the actual command
does not appear to be the issue. Rerunning the same little
section of sql manually which includes the hung command,
always takes less than a minute, but the entire sql which
should only take 30 minutes in total will always hang at the
same spot; and then interrupting it (regardless of how long
it was running) and rerunning immediately completes.
The "hung" command does eventually finish at times after
2 to 3 hours, most often we can't afford to wait for it to
run since this batch of sql has to run daily.
Some other notes:
- Environment is Ubuntu 16.04; postgresql version was
originaly 9.5(?) but we have upgraded to 11.1 using
postgresql packages for ubuntu and the problem has persisted
across postgresql versions.
- There are no other queries running on the postgresql
server during this time, and server load is idling (other
than whatever psql is doing).
- It is an amazon server, with ZFS as the files system,
it has 50% disk free on this storage partition that
postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged
table" everywhere as well, no difference.
- We have made tunings to the postgresql configuration to
make it use more RAM.
- All sql commands are sent through a custom Java
application; so we use java + JDBC to connect to
postgresql for this. The script has custom commands
which are interpreted by the Java app, so we can't test
running the exact script start to finish in psql native
tools since the native tools won't understand our custom
sql. However the java app is very simple. All individual sql
commands are committed immediately after they run so there
is no long transaction in play here. The psql jdbc driver is
"postgresql-42.2.4".
- One may suspect the custom java app; but again, htop
shows postgresql using 100% CPU usage on one core, so
postgresql is chewing on the command, doing something.
I have attached our current postgresql.conf file.
We're kind of pulling out our hair here, any ideas?