Hi Laurenz. Thanks for writing. I can tell you that while the error message usually identifies just one byte (though sometimes up to three), inspection of the data has frequently shown several bytes impacted. It often seems that the corruption begins at one point in the row and continues to the end.
At this point, we've taken a few mitigating steps: we have switched all the VARCHAR columns in the alert_instance and alert_attribute tables to use "plain" storage. Also, we've dropped a (name, value) index we were using for the alert_attribute table.
I looked at the checksums feature sometime earlier. This is something we could try, with some effort, but it appears a sum is applied to a page leaving shared_buffers and then checked when the page is read back. Our Slony-I replication service is pulling the latest changes into subscriber DBs every couple seconds. Normally, when data corruption occurs, it appears very quickly with a Slony-I COPY failure. It seems there may not be time to write a checksum.
Tom :-)
On Tuesday, March 26, 2019, 4:25:33 AM EDT, Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
Thomas Tignor wrote:
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to
> identify and eliminate all sources. We are using two independent services for data
> replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based
> on COPY operations. DB corruption is observed when COPY operations fail with an error
> of the form: 'invalid byte sequence for encoding "UTF8"'.
> This occurs with a frequency ranging between a few days and several weeks.
It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.
I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.
You could use block checksums to see if the storage is at fault, but unfortunately
it requires an "initdb", dump and restore to switch to checksums.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to
> identify and eliminate all sources. We are using two independent services for data
> replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based
> on COPY operations. DB corruption is observed when COPY operations fail with an error
> of the form: 'invalid byte sequence for encoding "UTF8"'.
> This occurs with a frequency ranging between a few days and several weeks.
It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.
I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.
You could use block checksums to see if the storage is at fault, but unfortunately
it requires an "initdb", dump and restore to switch to checksums.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com