Re: Corrupted Data ?

Ioana Danes <ioanadanes@xxxxxxxxx> · Fri, 12 Aug 2016 11:49:42 -0400

On Fri, Aug 12, 2016 at 11:44 AM, Ioana Danes <ioanadanes@xxxxxxxxx> wrote:

On Fri, Aug 12, 2016 at 11:34 AM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 08/12/2016 08:30 AM, Ioana Danes wrote:

On Fri, Aug 12, 2016 at 11:26 AM, Adrian Klaver

<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@aklaver.com>> wrote:

    On 08/12/2016 08:10 AM, Ioana Danes wrote:

        On Fri, Aug 12, 2016 at 10:47 AM, Francisco Olarte

        <folarte@xxxxxxxxxxxxxx <mailto:folarte@xxxxxxxxxxxxxx>

        <mailto:folarte@xxxxxxxxxxxxxx <mailto:folarte@xxxxxxxxxxxxxx>>>

        wrote:

            CCing to the list...

        Thanks

            On Fri, Aug 12, 2016 at 4:10 PM, Ioana Danes

        <ioanadanes@xxxxxxxxx <mailto:ioanadanes@xxxxxxxxx>

            <mailto:ioanadanes@xxxxxxxxx <mailto:ioanadanes@xxxxxxxxx>>>

        wrote:

            >> given 318220 and 318216 are just a bit away ( 4db08/4db0c

        ), and it

            >> repeats sporadically, have you ruled out ( by having page

            checksums or

            >> other mechanism ) a potential disk read/write error ?

            >>

            >>

            >> > Also the index is correct on db3 as the record in case

        (with

            drawid =

            >> > 318216) is retrieved if I filter by drawid = 318220

            >>

            >> Specially if this happens, you may have some slightly bad

        disks/ram/

            >> leading to this kind of problems.

            >>

            >

            > Could be. I also had some issues with an rsync between db3 and

            drdb a week

            > ago that did not complete for bigger files (> 200MB) and

        gave me some

            > corruption messages. Then the system was revbooted and

        everything

            seemed

            > fine but apparently it is not.

            > I am planning to drop & create the table from a good

        backup and if

            that does

            > not fix the issue then I will rebuild the server.

            I would check whatever logs you can ( syslog or eventlog,

        smart log,

            etc.. ) hunting for disk errors ( sometimes they are

        reported ). This

            kind of problems, with programs as tested as postgres and

        rsync, tend

            to indicate controller/RAM/disk going bad ( in your case it

        could be

            caused by a single bit getting flipped in a sector for the data

            portion of the table, and not being propagated either because it

            happened after your sync of drdb or because it was synced

        from the WAL

            and not the table, or because it was read from the disk cache ).

        I agree, unfortunately I did not find any clues about corruption

        or any

        anomalies in the logs.

        I will work tonight to rebuild that table and see where I go

        from there.

    The db3 database is on a different machine from all the other

    databases you set up, correct?

Yes, they are all different vms first 3 dbs are on the same cluster but

drdb is a remote machine,

Aah, another player in the mix.

What virtualization technology are you using?

kvm
Sorry I should add more info  
kernel 4.7
and the filesystem is  xfs vs ext3/ext4

Thank you

        Thanks,

        ioana

            Francisco Olarte.

    --

    Adrian Klaver

    adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@aklaver.com>

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx