Large file corruption inside of RBD

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 11 Feb 2013 18:06:48 -0500

Hey folks,

Noticed this today and it has me stumped.

I have a 10GB raw VM disk image that I've placed inside of an
ext4-formatted RBD.  When I do this, it gets corrupted in weird ways.
I was prepared to show fsck results to show this, but then I found an
easier way was just by looking at the sha1sum for the file.  Here's
what I see.

disk image sitting on regular (non-RBD) ext4 filesystem:
# sha1sum disk.img
cfd37c33b9de926644f7b13e604374348662bc60  disk.img

same disk image sitting in RBD #1
# cp -p disk.img /mnt/rbd1
# sha1sum /mnt/rbd1/disk.img
cfd37c33b9de926644f7b13e604374348662bc60  disk.img

Great, they match.  But then comes the problematic RBD:
# cp -p disk.img /mnt/rbd2
# sha1sum /mnt/rbd2/disk.img
a28d0735c0f0863a3f84151122da75a56bf5022b  disk.img

They don't match.  I can also confirm that fsck'ing the filesystem
contained in disk.img reveals numerous errors in the latter case,
while the system is clean in the first two.

I'm running 0.48.2argonaut on this particular cluster.
RBDs were mapped with the kernel client.  Kernel is 3.2.0-29-generic,
running in Ubuntu 12.04.1.

The only weird thing I've observed is that while the copy was going to
RBD #2, I saw this in ceph -w:
2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to
osd.2 not [4,2] in e2459/2459
2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to
osd.2 not [4,2] in e2459/2459

I hadn't seen this one before.

Full disclosure:

I had a ceph node failure last week (a week ago today) where all three
OSD processes on one of my nodes got killed by OOM.  I haven't had a
chance to go back and look for errors, gather logs,  or ask the list
for any advice on what went wrong.  Restarting my OSDs brought
everything back inline -- the cluster handled the failed OSDs just
fine, with one exception.  One of my RBDs went
read-only/write-protected.  Even after the cluster was back to
HEALTH_OK, it remained read-only.  I had to unmount, unmap, map, mount
my RBD to get it back.  It just so happens that that RBD is the one
giving me problems now.  So they could be related.  =)

It's a small cluster:

# ceph -s
   health HEALTH_OK
   monmap e1: 3 mons at
{a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0},
election epoch 4, quorum 0,1,2 a,b,c
   osdmap e2459: 9 osds: 9 up, 9 in
    pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB
used, 11109 GB / 16758 GB avail
   mdsmap e1: 0/0/1 up

# ceph osd tree
dumped osdmap tree epoch 2459
# id	weight	type name	up/down	reweight
-1	18	pool default
-3	18		rack unknownrack
-2	6			host ceph0
0	2				osd.0	up	1	
1	2				osd.1	up	1	
2	2				osd.2	up	1	
-4	6			host ceph1
3	2				osd.3	up	1	
4	2				osd.4	up	1	
5	2				osd.5	up	1	
-5	6			host ceph2
6	2				osd.6	up	1	
7	2				osd.7	up	1	
8	2				osd.8	up	1

But yeah, I'm just stumped about why files going into that particular
RBD get corrupted.  I tried a smaller file (~140MB) and it was fine.
I haven't gotten to do enough testing to find the threshold for
corruption.  Or if it only happens for specific file types.  I did a
similar test with qcow2 images (10G virtual, 4.4GB actual), and the
fsck results were the same -- immediate corruption inside that RBD.  I
did not capture the sha1sum for those files though.  I expect they
would differ.  =)

Thanks,

 - Travis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com