On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@xxxxxxxxxxxxxxxxxxx> wrote: > It figures I'd have an idea right after posting to the mailing list. > Yeah, running COPY foo TO stdout; gets me a list of data before erroring > out, so I did a copy (select * from foo order by id asc) to stdout; to see > if I could make some kind of guess as to whether this was related to a > single row or something else. > I got the id of the last row the copy to command was able to grab normally > and tried to figure out the next id. The following started to make me think > along the lines of some kinda bad corruption (even before getting responses > that agreed with that): > Assuming that the last id copied was 1500: > 1) select * from foo where id = (select min(id) from foo where id > 1500); > Results in 0 rows > 2) select min(id) from foo where id > 1500; > Results in, for example, 200000 > 3) select max(id) from foo where id > 1500; > Results in, for example, 90000 (a much lower number than returned by min) > 4) select id from foo where id > 1500 order by id asc limit 10; > Results in (for example): > 200000 > 202000 > 210273 > 220980 > 15005 > 15102 > 15104 > 15110 > 15111 > 15113 > So ... yes, it seems that those four id's are somehow part of the problem. > They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes > either), so memtest isn't available, but no new corruption has cropped up > since they stopped killing the waiting queries (I just double checked - they > were getting corrupted rows constantly, and we haven't gotten one since that > script stopped killing queries). That's actually a startling indictment of ec2 -- how were you killing your queries exactly? You say this is repeatable? What's your setting of full_page_writes? one way to identify and potentially nuke bad records of this kind is to do something like: select max(length(field1)) from foo order by 1 desc limit 5; where field1 is the first varlen field (text, bytea, etc) from left to right order. look for bogously high values and move on to the next field if you don't find any. once you hit a bad value, try deleting the record by it's key. once you've found/deleted them all, immediately pull off a dump, then rebuild the table. as always, take a filesystem dump before doing this type of surgery... merlin merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general