Re: Memory Errors

Sam Nelson <samn@xxxxxxxxxxxxxxxxxxx> · Tue, 21 Sep 2010 10:57:46 -0600

Okay, we're finally getting the last bits of corruption fixed, and I finally remembered to ask my boss about the kill script.
The only details I have are these:

1) The script does nothing if there are fewer than 1000 locks on tables in the database

2) If there are 1000 or more locks, it will grab the processes in pg_stat_activity that are in a waiting state

3) for each of the previous processes, it will do a system kill $pid call

The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not a kill -9. ÂJust a normal kill.

As far as the postgres and EC2 instances go, we're not really sure if anyone shut down, created, or migrated them in a weird way, but Kevin (my boss) said that it wouldn't surprise him.

All I can say is that where we were getting 1 new row of corruption every day when the kill script was running, we haven't gotten any new corruption since we stopped it.

As far as the table with memory errors goes, we had asked them to rebuild the table, and they came back saying that they no longer need that table. ÂSo they're just going to drop it.

We'll try to keep digging, but I'm not sure we'll get much more info than that. ÂWe're quite busy and my ability to remember things is ... questionable.

-Sam

On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:

On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson <samn@xxxxxxxxxxxxxxxxxxx> wrote:

> Even if the corruption wasn't a result of that, we weren't too excited about

> the process being there to begin with. ÂWe thought there had to be a better

> solution than just killing the processes. ÂSo we had a discussion about the

> intent of that script and my boss dealt with something that solved the same

> problem without killing queries, then had them stop that daemon and we have

> been working with that database to make sure it doesn't go screwy again. ÂNo

> new corruption has shown up since stopping that daemon.

> That memory allocation issue looked drastically different from the toast

> value errors, though, so it seemed like a separate problem. ÂBut now it's

> looking like more corruption.

> ---

> We're requesting that they do a few things (this is their production

> database, so we usually don't alter any data unless they ask us to),

> including deleting those rows. ÂMy memory is insufficient, so there's a good

> chance that I'll forget to post back to the mailing list with the results,

> but I'll try to remember to do so.

> Thank you for the help - I'm sure I'll be back soon with many more

> questions.

Any information on repeatable data corruption, whether it is ec2

improperly flushing data on instance resets, postgres misbehaving

under atypical conditions, or bad interactions between ec2 and

postgres is highly valuable. ÂThe only cases of 'understandable' data

corruption are hardware failures, sync issues (either fsync off, or

fsync not honored by hardware), torn pages on non journaling file

systems, etc.

Naturally people are going to be skeptical of ec2 since you are so

abstracted from the hardware. ÂMaybe all your problems stem from a

single explainable incident -- but we definitely want to get to the

bottom of this...please keep us updated!

merlin

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general