Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS

Ian J Cottee <ian@xxxxxxxxxx> · Wed, 30 Oct 2024 07:34:03 +0000

Hello everyone, I’ve been using postgres for over 25 years
now and never had any major issues which were not caused by my own stupidity.
In the last 24 hours however I’ve had a number of issues on one client's server which I assume
are a bug in postgres or a possible hardware issue (they are running on a Linode) but I need
some clarification and would welcome advice on how to proceed. I will also forward this mail to Linode support to ask them to check for any memory issues they can detect. 

This particular Postgres is running on Ubuntu LTS 22.04 and
has the following version information:

```

PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0,
64-bit

```

The quick summary is that over a 24 hour period I had the
following errors appear in the postgres logs at different times causing the
system processes to restart:

stuck spinlock detected 
free(): corrupted unsorted chunks 
double free or corruption (!prev)
corrupted size vs. prev_size 
corrupted double-linked list 
*** stack smashing detected ***: terminated 
Segmentation fault 

Here’s the more detailed breakdown. 

On Monday evening this week, the following event occurred on the
server

```

2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck
spinlock detected at LWLockWaitListLock,
./build/../src/backend/storage/lmgr/lwlock.c:913 

```

Followed by:

```

2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any
other active server processes 

2024-10-28 18:12:47.284 GMT [1880289] LOG: all server
processes terminated; reinitializing

```

And eventually 

```

2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the
database system is in recovery mode 

2024-10-28 18:12:48.476 GMT [575550] LOG: database system
was not properly shut down; automatic recovery in progress 

2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at
DD/405E83A8 

2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record
length at DD/405EF818: wanted 24, got 0 

2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at
DD/405EF7E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 

2024-10-28 18:12:48.515 GMT [1880289] LOG: database system
is ready to accept connections

```

This wasn’t noticed by myself or any users as they tend to
all be finished by 17:30.  However later, 

```

2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected
EOF on client connection with an open transaction 

2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG:
unexpected EOF on client connection with an open transaction 

free(): corrupted unsorted chunks 

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process
(PID 623803) was terminated by signal 6: Aborted 

2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any
other active server processes 

```

This time it could not recover and I didn’t notice until
early the next morning whilst doing some routine checks. 

```

2024-10-28 21:15:03.643 GMT [623807] LOG: database system
was not properly shut down; automatic recovery in progress 

2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at
DD/47366740 

2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record
length at DD/475452A0: wanted 24, got 0 

2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at
DD/47545268 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 

2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the
database system is in recovery mode 

double free or corruption (!prev) 

2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process
(PID 623807) was terminated by signal 6: Aborted 

2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup
due to startup process failure 

2024-10-28 21:15:03.835 GMT [1880289] LOG: database system
is shut down

```

When I noticed in the morning it was able to start without
an issue. From googling it appeared to be a memory issue and I wondered if the
problem was sorted now the server process had stopped completely and restarted.
The problem was not sorted although all the above errors were recovered from
automatically without any input from myself or the client’s noticing. 

```

corrupted size vs. prev_size 

2024-10-29 09:55:24.417 GMT [894747] LOG: background worker
"parallel worker" (PID 947642) was terminated by signal 6: Aborted 

```

```

corrupted double-linked list 

2024-10-29 13:14:28.322 GMT [894747] LOG: background worker
"parallel worker" (PID 1019071) was terminated by signal 6: Aborted

```

```

*** stack smashing detected ***: terminated 

2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker
"parallel worker" (PID 528630) was terminated by signal 6: A\ borted

```

```

2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker
"parallel worker" (PID 533515) was terminated by signal 11: \ 

Segmentation fault 

2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process
was running: SELECT "formula_line".id FROM "formul\ 

```

I rebooted the server at 18:30 and have had no further
issues so far, although work has yet to start. When rebooting the server,
postgres seemed to take a long time to terminate. 

Now there is one odd thing that has been happening recently. Due to a bug in my code I've had more deadlocks than would normally be expected. 

```
2024-10-29 19:26:51.680 GMT [71152] xxx@xxx ERROR:  could not serialize access due to concurrent update
```

I believe I have fixed that bug in my code this morning and the errors above did not seem to coincide with the errors appearing but I'm raising it in case related. 

Comments and insights are warmly welcomed. 

Best regards

Ian Cottee