autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

"Jim Hurne" <jhurne@xxxxxxxxxx> · Tue, 16 Jun 2020 11:59:37 -0400

We have a cloud service that uses PostgreSQL to temporarily store binary 
content. We're using PostgreSQL's Large Objects to store the binary 
content. Each large object lives anywhere from a few hundred milliseconds 
to 5-10 minutes, after which it is deleted.

Normally, this works just fine and we have no issues. However, roughly 
every 3 months or so, disk usage on our PostgreSQL database starts to 
increase at an alarming rate. More specifically, the pg_largeobject 
table's disk usage steadily increases even though we've deleted most of 
the large objects. This will continue until the database server runs out 
of disk space (and then all sorts of bad things happen of course).

It looks to us like autovacuum against the pg_largeobject table stops 
working because the pg_stat_all_tables.last_autovacuum column stops 
updating, or is updated infrequently, while it is still updated on other 
tables:

 schemaname |         relname         | n_live_tup | n_dead_tup | 
last_autovacuum
------------+-------------------------+------------+------------+-------------------------------
 pg_catalog | pg_largeobject          |      37205 |   92614852 | 
2020-06-15 01:55:09.037101+00
 pg_toast   | pg_toast_973434724      |        281 |       3701 | 
2020-06-15 17:22:54.746452+00
 pg_catalog | pg_largeobject_metadata |        320 |       1236 | 
2020-06-15 17:23:49.765878+00
 pg_catalog | pg_shdepend             |        344 |        933 | 
2020-06-15 17:23:54.448855+00
 pg_toast   | pg_toast_2619           |         52 |         85 | 
2020-06-15 17:22:01.387933+00
 pg_catalog | pg_statistic            |        413 |        100 | 
2020-06-15 17:15:52.656791+00
 pg_catalog | pg_class                |        349 |          2 | 
2020-06-12 17:58:13.147931+00
 pg_catalog | pg_attribute            |       2633 |          1 | 
2020-06-12 17:58:13.000114+00

Looking at our PostgreSQL logs, it looks like the autovacuum task takes 
increasingly more time to run (against the pg_largeobject table):

2020-06-12T19:41:58.335931494Z stderr F         system usage: CPU: user: 
0.02 s, system: 0.00 s, elapsed: 1.77 s
2020-06-12T19:42:59.704884752Z stderr F         system usage: CPU: user: 
0.02 s, system: 0.02 s, elapsed: 3.06 s
2020-06-12T19:44:01.928300512Z stderr F         system usage: CPU: user: 
0.06 s, system: 0.01 s, elapsed: 5.44 s
2020-06-12T19:45:14.124209167Z stderr F         system usage: CPU: user: 
0.11 s, system: 0.05 s, elapsed: 17.13 s
2020-06-12T19:46:16.28758936Z stderr F  system usage: CPU: user: 0.13 s, 
system: 0.08 s, elapsed: 19.04 s
2020-06-12T19:47:34.264882626Z stderr F         system usage: CPU: user: 
0.20 s, system: 0.19 s, elapsed: 36.22 s
2020-06-12T19:49:15.383436343Z stderr F         system usage: CPU: user: 
0.28 s, system: 0.38 s, elapsed: 74.06 s
2020-06-12T19:53:47.229361981Z stderr F         system usage: CPU: user: 
0.66 s, system: 1.06 s, elapsed: 214.12 s
2020-06-12T20:19:39.619748109Z stderr F         system usage: CPU: user: 
4.13 s, system: 5.30 s, elapsed: 1461.16 s
2020-06-12T21:30:31.634637945Z stderr F         system usage: CPU: user: 
13.73 s, system: 19.59 s, elapsed: 4225.61 s
2020-06-12T23:54:32.511015886Z stderr F         system usage: CPU: user: 
33.57 s, system: 41.90 s, elapsed: 8514.23 s
2020-06-13T04:23:32.230960572Z stderr F         system usage: CPU: user: 
66.09 s, system: 82.95 s, elapsed: 16011.25 s
2020-06-13T12:00:37.43434175Z stderr F  system usage: CPU: user: 99.42 s, 
system: 130.48 s, elapsed: 27296.71 s
2020-06-14T02:40:25.19122979Z stderr F  system usage: CPU: user: 202.96 s, 
system: 263.66 s, elapsed: 52653.66 s
2020-06-15T01:55:09.03766272Z stderr F  system usage: CPU: user: 317.54 s, 
system: 544.48 s, elapsed: 83550.21 s

We have several instances of our cloud service, and each instance has it's 
own database. On other instances, the autovacuum elapsed is consistently 
less than 25 seconds on every run.

Other than the increasing elapsed times for the autovacuum, we don't see 
any other indication in the logs of a problem (no error messages, etc).

We're currently using PostgreSQL version 10.10. Our service is JVM-based 
and we're using the PostgreSQL JDBC driver version 42.2.5.

Have we stumbled upon a potential bug here, or do we need to tweak some 
autovacuum settings?  What should we look at next or what should we try 
next to further troubleshoot this?

Regards,

Jim Hurne