Re: Large file corruption inside of RBD

Mike Lowe <j.michael.lowe@xxxxxxxxx> · Mon, 11 Feb 2013 18:12:38 -0500

Are your RBD's backed by btrfs?  I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset.  The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f

On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

> Hey folks,
> 
> Noticed this today and it has me stumped.
> 
> I have a 10GB raw VM disk image that I've placed inside of an
> ext4-formatted RBD.  When I do this, it gets corrupted in weird ways.
> I was prepared to show fsck results to show this, but then I found an
> easier way was just by looking at the sha1sum for the file.  Here's
> what I see.
> 
> disk image sitting on regular (non-RBD) ext4 filesystem:
> # sha1sum disk.img
> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
> 
> same disk image sitting in RBD #1
> # cp -p disk.img /mnt/rbd1
> # sha1sum /mnt/rbd1/disk.img
> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
> 
> Great, they match.  But then comes the problematic RBD:
> # cp -p disk.img /mnt/rbd2
> # sha1sum /mnt/rbd2/disk.img
> a28d0735c0f0863a3f84151122da75a56bf5022b  disk.img
> 
> They don't match.  I can also confirm that fsck'ing the filesystem
> contained in disk.img reveals numerous errors in the latter case,
> while the system is clean in the first two.
> 
> I'm running 0.48.2argonaut on this particular cluster.
> RBDs were mapped with the kernel client.  Kernel is 3.2.0-29-generic,
> running in Ubuntu 12.04.1.
> 
> The only weird thing I've observed is that while the copy was going to
> RBD #2, I saw this in ceph -w:
> 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to
> osd.2 not [4,2] in e2459/2459
> 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830
> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to
> osd.2 not [4,2] in e2459/2459
> 
> I hadn't seen this one before.
> 
> Full disclosure:
> 
> I had a ceph node failure last week (a week ago today) where all three
> OSD processes on one of my nodes got killed by OOM.  I haven't had a
> chance to go back and look for errors, gather logs,  or ask the list
> for any advice on what went wrong.  Restarting my OSDs brought
> everything back inline -- the cluster handled the failed OSDs just
> fine, with one exception.  One of my RBDs went
> read-only/write-protected.  Even after the cluster was back to
> HEALTH_OK, it remained read-only.  I had to unmount, unmap, map, mount
> my RBD to get it back.  It just so happens that that RBD is the one
> giving me problems now.  So they could be related.  =)
> 
> It's a small cluster:
> 
> # ceph -s
>   health HEALTH_OK
>   monmap e1: 3 mons at
> {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0},
> election epoch 4, quorum 0,1,2 a,b,c
>   osdmap e2459: 9 osds: 9 up, 9 in
>    pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB
> used, 11109 GB / 16758 GB avail
>   mdsmap e1: 0/0/1 up
> 
> # ceph osd tree
> dumped osdmap tree epoch 2459
> # id	weight	type name	up/down	reweight
> -1	18	pool default
> -3	18		rack unknownrack
> -2	6			host ceph0
> 0	2				osd.0	up	1	
> 1	2				osd.1	up	1	
> 2	2				osd.2	up	1	
> -4	6			host ceph1
> 3	2				osd.3	up	1	
> 4	2				osd.4	up	1	
> 5	2				osd.5	up	1	
> -5	6			host ceph2
> 6	2				osd.6	up	1	
> 7	2				osd.7	up	1	
> 8	2				osd.8	up	1
> 
> But yeah, I'm just stumped about why files going into that particular
> RBD get corrupted.  I tried a smaller file (~140MB) and it was fine.
> I haven't gotten to do enough testing to find the threshold for
> corruption.  Or if it only happens for specific file types.  I did a
> similar test with qcow2 images (10G virtual, 4.4GB actual), and the
> fsck results were the same -- immediate corruption inside that RBD.  I
> did not capture the sha1sum for those files though.  I expect they
> would differ.  =)
> 
> Thanks,
> 
> - Travis
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com