Hi guys
This is my first post in the comunity so my apologies in advance about the formalities.
In a production db , and after several OOM killer events on the postgres running on a docker (now fixed) we remain with a kind of corrupted pg_toast 😐:
2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL JDBC Driver:user=veon:db=veon:ERROR: unexpected chunk number 0 (expected 1) for toast value 17143928 in pg_toast_77809
2017-11-29 23:46:13.147 PST rhost=10.149.54.5(54750) app=PostgreSQL JDBC Driver:user=veon:db=veon:STATEMENT: SELECT "id", "needs_pumping", "needs_lucene", "surveybatchid", "ip_address", "is_open_proxy", "is_partial", "cookies", "social_media_source", "social_media_review_id", "social_media_pull_id", "social_media_pull_id_long", "source_config_unitid", "source_config_source_id", "updated", "external_id_printable_coder", "unique_reusable_id", "invite_errmsg", "reminder_errmsg", "last_seen_pagename", "last_submitted_pagename", "has_content_fields", "logid", "alertid", "language_name", "e_status", "e_reminder_status", "e_lastupdated", "e_creationdate", "e_sampleddate", "e_responsedate", "e_invitationdate", "reminderdate", "e_delivereddate", "e_remindeddate", "e_accepteddate", "e_initialfinishdate", "e_ta_completed_date", "e_expirationdate", "e_survey_method", "e_survey_source", "e_survey_type", "parse_status", "tagging_attempts", "e_optout", "e_bounce_category", "feed_fileid", "feed_file_recordid", "startdate", "e_lastname", "e_firstname", "e_address", "e_address2", "e_city", "e_state", "e_postalcode", "e_phone", "e_email", "is_mobile", "is_mobile_first", "is_mobile_finished", "is_cookie_confirmation_needed", "exclude_from_problem_tracker", "is_invitation_error", "page_number_seen_last", "page_number_seen_highest", "pages_submitted", "pages_validation_failed", "last_submit_date", "user_agent", "jsonb_fields", "thrift", "episodeid", "e_unitid", "e_committed_survey_specid", "customer_veonid", "xmin" FROM "survey_prior" WHERE survey_veon.surveyid = ANY($1)
The usual fix is :
REINDEX TABLE pg_toast_77809;
After this we can select * from survery_prior with no problem, and we can dump the entire table if we wish it.
Then the jobs resumes and we're happy until we hit again with the same error , this occurs randomly , we suspect on the UPDATE on certain row is creating the corruption.
We can't reproduce the ERROR because we can't do a full scan of the table while the error is still alive () on-call guys applies the fix inmediatly to resume production ).
So how can we discover the bad-rows and expose them, or how do we check the consistency of the pt_toast table?
We need this evidence to ask for window and a complete dump/restore that will wipe out this problem.
Thx in advance
Jorge Daniel Fernandez