Re: pt_toast table seems to be

Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> · Thu, 4 Jan 2018 14:37:55 +1300

Also the Ceph version might be important (if you are running pre Jewel 
then there was a data corruption bug with db type workloads for 0.94.9 
and earlier).

regards

Mark

On 04/01/18 10:42, Rui DeSousa wrote:

Interesting setup.   What is your ceph settings? Are you using caching 
and is it set to write-back or write-through?

If you’re using caching it should be using write-through with the max 
dirty set to zero.

|rbd cache max dirty| to 0

http://docs.ceph.com/docs/master/rbd/rbd-config-ref/

On Jan 3, 2018, at 3:38 PM, Jorge Daniel <elgaita@xxxxxxxxxxx 
<mailto:elgaita@xxxxxxxxxxx>> wrote:

We're using ceph, ext4  volumes where resides our PGDATA , it's 
mounted like this:
/dev/rbd5 on /pg_data type ext4 
(rw,relatime,discard,stripe=1024,data=ordered)

Jorge Daniel Fernandez

------------------------------------------------------------------------
*From:*Rui DeSousa <rui.desousa@xxxxxxxxxx 
<mailto:rui.desousa@xxxxxxxxxx>>
*Sent:*Wednesday, January 3, 2018 5:20 PM
*To:*Jorge Daniel
*Subject:*Re: pt_toast table seems to be
I’m not a docker expert nor have I ever run Postgres in a docker 
container; but my thought is docker is not handle sync() truly to 
form — i.e. buffering for performance improvements instead.  If it 
buffers it but then crashes before the data is actual saved it will 
lead to corruption.  I’ve seen similar issues with RAID controllers 
that buffer the sync() call but eventual fail to actual save the data 
for whatever reason; buffer overrun, system crashed, etc.

How is the Postgres data stored In the docker container? What file 
system is it using? Is using a docker data volume, directory mount, 
or storage plug in?

Doing a quick google search it does seem that docker has it’s own 
storage model which introduces it’s own COW semantics, plugins, etc; 
although, I can’t find anything on if it guarantees sync() calls.

On Jan 3, 2018, at 2:43 PM, Jorge Daniel <elgaita@xxxxxxxxxxx 
<mailto:elgaita@xxxxxxxxxxx>> wrote:

Hi Rui, every time autoAnalyze ran against that table , after a long 
time running, the OOM killer act up and killed the autovacuum , in 
some cases the engine got in recovery and other cases it was causing 
the crash of the whole container.  We're pretty sure that this last 
ones were the ones that lead to this issue.

We were able to reproduce the OOM killing and our findings lead us 
to a strange Docker setting for the SHMMAX:

Linux 434d18e30d83 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 
4 15:57:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
postgres@434d18e30d83:~$ cat /proc/sys/kernel/shmmax
18446744073692774399

after we a fixed that   number, we were able to run the ANALYZE and 
Vacuum to its end without any OOM killing.

This is related to
https://github.com/moby/moby/issues/16315

I don't have now that  syslog to examine if the sync() has worked 
properly or not.

<https://github.com/moby/moby/issues/16315>

Container is reporting incorrect /proc/sys/kernel/shmmax value · 
Issue #16315 · moby/moby <https://github.com/moby/moby/issues/16315>
github.com <http://github.com/>
BUG REPORT INFORMATION My host is Ubuntu 15.04 (kernel 
3.19.0-28-generic) docker version: Client version: 1.5.0 Client API 
version: 1.17 Go version (client): go1.3.3 Git commit (client): 
a8a31ef OS...

Jorge Daniel Fernandez

------------------------------------------------------------------------
*From:*Rui DeSousa <rui.desousa@xxxxxxxxxx 
<mailto:rui.desousa@xxxxxxxxxx>>
*Sent:*Wednesday, January 3, 2018 4:23 PM
*To:*Jorge Daniel
*Cc:*pgsql-admin@xxxxxxxxxxxxxxxxxxxx 
<mailto:pgsql-admin@xxxxxxxxxxxxxxxxxxxx>
*Subject:*Re: pt_toast table seems to be
Do you mean the Postgres instance is killed or the docker instance 
is killed?  For OOM, what’s the memory configuration and how much 
swap is allocated?

I would look at your disk subsystem from the prospected of Postgres; 
when it issues a sync() does your setup honor it?  If you docker 
instance is crashing then it seems like sync() might not honored 
which would lead to corruption issues you’re describing.

On Jan 3, 2018, at 10:25 AM, Jorge Daniel <elgaita@xxxxxxxxxxx 
<mailto:elgaita@xxxxxxxxxxx>> wrote:

Hi guys
This is my first post in the comunity so my apologies in advance 
about the formalities.

In a production db , and after several OOM killer events on the 
postgres running on a docker (now fixed) we remain with a kind of 
corrupted pg_toast 😐:

2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL 
JDBC Driver:user=veon:db=veon:ERROR: unexpected chunk number 0 
(expected 1) for toast value 17143928 in pg_toast_77809
2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL 
JDBC Driver:user=veon:db=veon:STATEMENT: SELECT "id", 
"needs_pumping", "needs_lucene", "surveybatchid", "ip_address", 
"is_open_proxy", "is_partial", "cookies", "social_media_source", 
"social_media_review_id", "social_media_pull_id", 
"social_media_pull_id_long", "source_config_unitid", 
"source_config_source_id", "updated", 
"external_id_printable_coder", "unique_reusable_id", 
"invite_errmsg", "reminder_errmsg", "last_seen_pagename", 
"last_submitted_pagename", "has_content_fields", "logid", 
"alertid", "language_name", "e_status", "e_reminder_status", 
"e_lastupdated", "e_creationdate", "e_sampleddate", 
"e_responsedate", "e_invitationdate", "reminderdate", 
"e_delivereddate", "e_remindeddate", "e_accepteddate", 
"e_initialfinishdate", "e_ta_completed_date", "e_expirationdate", 
"e_survey_method", "e_survey_source", "e_survey_type", 
"parse_status", "tagging_attempts", "e_optout", 
"e_bounce_category", "feed_fileid", "feed_file_recordid", 
"startdate", "e_lastname", "e_firstname", "e_address", 
"e_address2", "e_city", "e_state", "e_postalcode", "e_phone", 
"e_email", "is_mobile", "is_mobile_first", "is_mobile_finished", 
"is_cookie_confirmation_needed", "exclude_from_problem_tracker", 
"is_invitation_error", "page_number_seen_last", 
"page_number_seen_highest", "pages_submitted", 
"pages_validation_failed", "last_submit_date", "user_agent", 
"jsonb_fields", "thrift", "episodeid", "e_unitid", 
"e_committed_survey_specid", "customer_veonid", "xmin" FROM 
"survey_prior" WHERE survey_veon.surveyid = ANY($1)

The usual fix is :

REINDEX TABLE pg_toast_77809;

After this we can select * from survery_prior with no problem, and 
we can dump the entire table if we wish it.

Then the jobs resumes and we're happy until we hit again with the 
same error , this occurs randomly , we suspect on the UPDATE  on 
certain row is creating the corruption.

We can't reproduce the ERROR because we can't do a full scan of the 
table while the error is still alive () on-call guys applies the 
fix inmediatly to resume production ).

So how can we discover the bad-rows and expose them, or how do we 
check the consistency of the pt_toast table?

We need this evidence to ask for window and a complete dump/restore 
that will wipe out this problem.

Thx in advance

Jorge Daniel Fernandez