Re: Large file corruption inside of RBD

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 11 Feb 2013 18:15:41 -0500

All the OSDs are backed by xfs.  Each RBD is formatted with ext4.

Thanks for the response.

On Mon, Feb 11, 2013 at 6:12 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote:
> Are your RBD's backed by btrfs?  I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset.  The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f
>
> On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>
>> Hey folks,
>>
>> Noticed this today and it has me stumped.
>>
>> I have a 10GB raw VM disk image that I've placed inside of an
>> ext4-formatted RBD.  When I do this, it gets corrupted in weird ways.
>> I was prepared to show fsck results to show this, but then I found an
>> easier way was just by looking at the sha1sum for the file.  Here's
>> what I see.
>>
>> disk image sitting on regular (non-RBD) ext4 filesystem:
>> # sha1sum disk.img
>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>
>> same disk image sitting in RBD #1
>> # cp -p disk.img /mnt/rbd1
>> # sha1sum /mnt/rbd1/disk.img
>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>
>> Great, they match.  But then comes the problematic RBD:
>> # cp -p disk.img /mnt/rbd2
>> # sha1sum /mnt/rbd2/disk.img
>> a28d0735c0f0863a3f84151122da75a56bf5022b  disk.img
>>
>> They don't match.  I can also confirm that fsck'ing the filesystem
>> contained in disk.img reveals numerous errors in the latter case,
>> while the system is clean in the first two.
>>
>> I'm running 0.48.2argonaut on this particular cluster.
>> RBDs were mapped with the kernel client.  Kernel is 3.2.0-29-generic,
>> running in Ubuntu 12.04.1.
>>
>> The only weird thing I've observed is that while the copy was going to
>> RBD #2, I saw this in ceph -w:
>> 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to
>> osd.2 not [4,2] in e2459/2459
>> 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830
>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to
>> osd.2 not [4,2] in e2459/2459
>>
>> I hadn't seen this one before.
>>
>> Full disclosure:
>>
>> I had a ceph node failure last week (a week ago today) where all three
>> OSD processes on one of my nodes got killed by OOM.  I haven't had a
>> chance to go back and look for errors, gather logs,  or ask the list
>> for any advice on what went wrong.  Restarting my OSDs brought
>> everything back inline -- the cluster handled the failed OSDs just
>> fine, with one exception.  One of my RBDs went
>> read-only/write-protected.  Even after the cluster was back to
>> HEALTH_OK, it remained read-only.  I had to unmount, unmap, map, mount
>> my RBD to get it back.  It just so happens that that RBD is the one
>> giving me problems now.  So they could be related.  =)
>>
>> It's a small cluster:
>>
>> # ceph -s
>>   health HEALTH_OK
>>   monmap e1: 3 mons at
>> {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0},
>> election epoch 4, quorum 0,1,2 a,b,c
>>   osdmap e2459: 9 osds: 9 up, 9 in
>>    pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB
>> used, 11109 GB / 16758 GB avail
>>   mdsmap e1: 0/0/1 up
>>
>> # ceph osd tree
>> dumped osdmap tree epoch 2459
>> # id  weight  type name       up/down reweight
>> -1    18      pool default
>> -3    18              rack unknownrack
>> -2    6                       host ceph0
>> 0     2                               osd.0   up      1
>> 1     2                               osd.1   up      1
>> 2     2                               osd.2   up      1
>> -4    6                       host ceph1
>> 3     2                               osd.3   up      1
>> 4     2                               osd.4   up      1
>> 5     2                               osd.5   up      1
>> -5    6                       host ceph2
>> 6     2                               osd.6   up      1
>> 7     2                               osd.7   up      1
>> 8     2                               osd.8   up      1
>>
>> But yeah, I'm just stumped about why files going into that particular
>> RBD get corrupted.  I tried a smaller file (~140MB) and it was fine.
>> I haven't gotten to do enough testing to find the threshold for
>> corruption.  Or if it only happens for specific file types.  I did a
>> similar test with qcow2 images (10G virtual, 4.4GB actual), and the
>> fsck results were the same -- immediate corruption inside that RBD.  I
>> did not capture the sha1sum for those files though.  I expect they
>> would differ.  =)
>>
>> Thanks,
>>
>> - Travis
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com