Are your RBD's backed by btrfs? I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset. The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote: > Hey folks, > > Noticed this today and it has me stumped. > > I have a 10GB raw VM disk image that I've placed inside of an > ext4-formatted RBD. When I do this, it gets corrupted in weird ways. > I was prepared to show fsck results to show this, but then I found an > easier way was just by looking at the sha1sum for the file. Here's > what I see. > > disk image sitting on regular (non-RBD) ext4 filesystem: > # sha1sum disk.img > cfd37c33b9de926644f7b13e604374348662bc60 disk.img > > same disk image sitting in RBD #1 > # cp -p disk.img /mnt/rbd1 > # sha1sum /mnt/rbd1/disk.img > cfd37c33b9de926644f7b13e604374348662bc60 disk.img > > Great, they match. But then comes the problematic RBD: > # cp -p disk.img /mnt/rbd2 > # sha1sum /mnt/rbd2/disk.img > a28d0735c0f0863a3f84151122da75a56bf5022b disk.img > > They don't match. I can also confirm that fsck'ing the filesystem > contained in disk.img reveals numerous errors in the latter case, > while the system is clean in the first two. > > I'm running 0.48.2argonaut on this particular cluster. > RBDs were mapped with the kernel client. Kernel is 3.2.0-29-generic, > running in Ubuntu 12.04.1. > > The only weird thing I've observed is that while the copy was going to > RBD #2, I saw this in ceph -w: > 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to > osd.2 not [4,2] in e2459/2459 > 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830 > 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to > osd.2 not [4,2] in e2459/2459 > > I hadn't seen this one before. > > Full disclosure: > > I had a ceph node failure last week (a week ago today) where all three > OSD processes on one of my nodes got killed by OOM. I haven't had a > chance to go back and look for errors, gather logs, or ask the list > for any advice on what went wrong. Restarting my OSDs brought > everything back inline -- the cluster handled the failed OSDs just > fine, with one exception. One of my RBDs went > read-only/write-protected. Even after the cluster was back to > HEALTH_OK, it remained read-only. I had to unmount, unmap, map, mount > my RBD to get it back. It just so happens that that RBD is the one > giving me problems now. So they could be related. =) > > It's a small cluster: > > # ceph -s > health HEALTH_OK > monmap e1: 3 mons at > {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0}, > election epoch 4, quorum 0,1,2 a,b,c > osdmap e2459: 9 osds: 9 up, 9 in > pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB > used, 11109 GB / 16758 GB avail > mdsmap e1: 0/0/1 up > > # ceph osd tree > dumped osdmap tree epoch 2459 > # id weight type name up/down reweight > -1 18 pool default > -3 18 rack unknownrack > -2 6 host ceph0 > 0 2 osd.0 up 1 > 1 2 osd.1 up 1 > 2 2 osd.2 up 1 > -4 6 host ceph1 > 3 2 osd.3 up 1 > 4 2 osd.4 up 1 > 5 2 osd.5 up 1 > -5 6 host ceph2 > 6 2 osd.6 up 1 > 7 2 osd.7 up 1 > 8 2 osd.8 up 1 > > But yeah, I'm just stumped about why files going into that particular > RBD get corrupted. I tried a smaller file (~140MB) and it was fine. > I haven't gotten to do enough testing to find the threshold for > corruption. Or if it only happens for specific file types. I did a > similar test with qcow2 images (10G virtual, 4.4GB actual), and the > fsck results were the same -- immediate corruption inside that RBD. I > did not capture the sha1sum for those files though. I expect they > would differ. =) > > Thanks, > > - Travis > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com