My (our) complaints about EC2 aren't particularly extensive, but last time I posted to the mailing list saying they were using EC2, the first reply was someone saying that the corruption was the fault of EC2.
Not that we don't have complaints at all (there are some aspects that are very frustrating), but I was just trying to stave off anyone who was going to reply saying "Tell them to stop using EC2".
-- More detail about the script that kills queries:
Honestly, we (or, at least, I) haven't discovered which type of kill they were doing, but it does seem to be the culprit in some way. I don't talk to the customers (that's my boss's job), so I didn't get to ask specifics. If my boss did ask specifics, he didn't tell me.
The previous issue involved toast corruption showing up very regularly (e.g. once a day, in some cases), the end result being that we had to delete the corrupted rows. Coming back the next day to see the same corruption on different rows was not very encouraging.
We found out after that that they had a script running as a daemon that would, every ten minutes (I believe), check the number of locks on the table and kill all waiting queries if there were >= 1000 locks.
Even if the corruption wasn't a result of that, we weren't too excited about the process being there to begin with. We thought there had to be a better solution than just killing the processes. So we had a discussion about the intent of that script and my boss dealt with something that solved the same problem without killing queries, then had them stop that daemon and we have been working with that database to make sure it doesn't go screwy again. No new corruption has shown up since stopping that daemon.
That memory allocation issue looked drastically different from the toast value errors, though, so it seemed like a separate problem. But now it's looking like more corruption.
---
We're requesting that they do a few things (this is their production database, so we usually don't alter any data unless they ask us to), including deleting those rows. My memory is insufficient, so there's a good chance that I'll forget to post back to the mailing list with the results, but I'll try to remember to do so.
Thank you for the help - I'm sure I'll be back soon with many more questions.
-Sam
On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Merlin Moncure <mmoncure@xxxxxxxxx> writes:
> On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@xxxxxxxxxxxxxxxxxxx> wrote:
>> So ... yes, it seems that those four id's are somehow part of the problem.I think we'd established that they were doing kill -9 on backend
>> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
>> either), so memtest isn't available, but no new corruption has cropped up
>> since they stopped killing the waiting queries (I just double checked - they
>> were getting corrupted rows constantly, and we haven't gotten one since that
>> script stopped killing queries).
> That's actually a startling indictment of ec2 -- how were you killing
> your queries exactly? You say this is repeatable? What's your
> setting of full_page_writes?
processes :-(. However, PG has a lot of track record that says that
backend crashes don't result in corrupt data. What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.
regards, tom lane