Hi, I can't reproduce it... root@misc01 ~ # rbd -p cor create --size 5000 seb1 root@misc01 ~ # rbd -p cor create --size 5000 seb2 root@misc01 ~ # rbd -p cor map seb1 root@misc01 ~ # rbd -p cor map seb2 root@misc01 ~ # rbd showmapped id pool image snap device 0 cor seb1 - /dev/rbd0 1 cor seb2 - /dev/rbd1 root@misc01 ~ # mkfs.ext4 /dev/rbd0 mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=1024 blocks, Stripe width=1024 blocks 320000 inodes, 1280000 blocks 64000 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=1312817152 40 block groups 32768 blocks per group, 32768 fragments per group 8000 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done root@misc01 ~ # mkfs.ext4 /dev/rbd1 mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=1024 blocks, Stripe width=1024 blocks 320000 inodes, 1280000 blocks 64000 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=1312817152 40 block groups 32768 blocks per group, 32768 fragments per group 8000 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done root@misc01 ~ # mkdir -p /mnt/rbd/1 root@misc01 ~ # mkdir -p /mnt/rbd/2 root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/0 mount: mount point /mnt/rbd/0 does not exist root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/1 root@misc01 ~ # mount /dev/rbd1 /mnt/rbd/2 root@misc01 ~ # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/server--management-root 19G 6.5G 12G 37% / udev 5.9G 4.0K 5.9G 1% /dev tmpfs 2.4G 368K 2.4G 1% /run none 5.0M 4.0K 5.0M 1% /run/lock none 5.9G 17M 5.9G 1% /run/shm /dev/sdb1 228M 27M 189M 13% /boot cgroup 5.9G 0 5.9G 0% /sys/fs/cgroup /dev/mapper/server--management-ceph 50G 885M 47G 2% /srv/ceph/mds0 /dev/mapper/server--management-lxc 50G 5.3G 43G 12% /var/lib/lxc /dev/rbd0 4.9G 202M 4.5G 5% /mnt/rbd/1 /dev/rbd1 4.9G 202M 4.5G 5% /mnt/rbd/2 root@misc01 ~ # wget http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img --2013-02-12 09:32:34-- http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img Resolving cloud-images.ubuntu.com (cloud-images.ubuntu.com)... 91.189.88.141 Connecting to cloud-images.ubuntu.com (cloud-images.ubuntu.com)|91.189.88.141|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 229638144 (219M) [application/octet-stream] Saving to: `precise-server-cloudimg-i386-disk1.img' 100%[====================================================================================================================================================================================================>] 229,638,144 698K/s in 5m 23s 2013-02-12 09:37:56 (695 KB/s) - `precise-server-cloudimg-i386-disk1.img' saved [229638144/229638144] root@misc01 ~ # sha1sum precise-server-cloudimg-i386-disk1.img 25cde0523e060e2bce68f9f3ebaed52b38e98417 precise-server-cloudimg-i386-disk1.img root@misc01 ~ # cp precise-server-cloudimg-i386-disk1.img /mnt/rbd/1/ root@misc01 ~ # sha1sum /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img 25cde0523e060e2bce68f9f3ebaed52b38e98417 /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img root@misc01 ~ # cp /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img /mnt/rbd/2/ root@misc01 ~ # sync root@misc01 ~ # sha1sum /mnt/rbd/2/precise-server-cloudimg-i386-disk1.img 25cde0523e060e2bce68f9f3ebaed52b38e98417 /mnt/rbd/2/precise-server-cloudimg-i386-disk1.img root@misc01 ~ # ceph -v ceph version 0.48.3argonaut (commit:920f82e805efec2cae05b79c155c07df0f3ed5dd) root@misc01 ~ # uname -r 3.2.0-23-generic root@misc01 ~ # du -h precise-server-cloudimg-i386-disk1.img 219M precise-server-cloudimg-i386-disk1.img Tried the same with: root@misc01 ~ # du -h disk 4.5G disk root@misc01 ~ # sha1sum disk a1986abe9d779b296913e8d4f3bea8e5df992419 disk root@misc01 ~ # cp disk /mnt/rbd/1/ root@misc01 ~ # sha1sum /mnt/rbd/1/disk a1986abe9d779b296913e8d4f3bea8e5df992419 /mnt/rbd/1/disk root@misc01 ~ # cp /mnt/rbd/1/disk /mnt/rbd/2/disk root@misc01 ~ # sha1sum /mnt/rbd/2/disk a1986abe9d779b296913e8d4f3bea8e5df992419 /mnt/rbd/2/disk I know 0.48.3 fixed a critical bug. This one prevents data loss or corruption after a power loss or kernel panic event, so it seems a bit different. -- Regards, Sébastien Han. On Tue, Feb 12, 2013 at 12:15 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote: > All the OSDs are backed by xfs. Each RBD is formatted with ext4. > > Thanks for the response. > > On Mon, Feb 11, 2013 at 6:12 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote: >> Are your RBD's backed by btrfs? I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset. The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f >> >> On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote: >> >>> Hey folks, >>> >>> Noticed this today and it has me stumped. >>> >>> I have a 10GB raw VM disk image that I've placed inside of an >>> ext4-formatted RBD. When I do this, it gets corrupted in weird ways. >>> I was prepared to show fsck results to show this, but then I found an >>> easier way was just by looking at the sha1sum for the file. Here's >>> what I see. >>> >>> disk image sitting on regular (non-RBD) ext4 filesystem: >>> # sha1sum disk.img >>> cfd37c33b9de926644f7b13e604374348662bc60 disk.img >>> >>> same disk image sitting in RBD #1 >>> # cp -p disk.img /mnt/rbd1 >>> # sha1sum /mnt/rbd1/disk.img >>> cfd37c33b9de926644f7b13e604374348662bc60 disk.img >>> >>> Great, they match. But then comes the problematic RBD: >>> # cp -p disk.img /mnt/rbd2 >>> # sha1sum /mnt/rbd2/disk.img >>> a28d0735c0f0863a3f84151122da75a56bf5022b disk.img >>> >>> They don't match. I can also confirm that fsck'ing the filesystem >>> contained in disk.img reveals numerous errors in the latter case, >>> while the system is clean in the first two. >>> >>> I'm running 0.48.2argonaut on this particular cluster. >>> RBDs were mapped with the kernel client. Kernel is 3.2.0-29-generic, >>> running in Ubuntu 12.04.1. >>> >>> The only weird thing I've observed is that while the copy was going to >>> RBD #2, I saw this in ceph -w: >>> 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830 >>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to >>> osd.2 not [4,2] in e2459/2459 >>> >>> I hadn't seen this one before. >>> >>> Full disclosure: >>> >>> I had a ceph node failure last week (a week ago today) where all three >>> OSD processes on one of my nodes got killed by OOM. I haven't had a >>> chance to go back and look for errors, gather logs, or ask the list >>> for any advice on what went wrong. Restarting my OSDs brought >>> everything back inline -- the cluster handled the failed OSDs just >>> fine, with one exception. One of my RBDs went >>> read-only/write-protected. Even after the cluster was back to >>> HEALTH_OK, it remained read-only. I had to unmount, unmap, map, mount >>> my RBD to get it back. It just so happens that that RBD is the one >>> giving me problems now. So they could be related. =) >>> >>> It's a small cluster: >>> >>> # ceph -s >>> health HEALTH_OK >>> monmap e1: 3 mons at >>> {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0}, >>> election epoch 4, quorum 0,1,2 a,b,c >>> osdmap e2459: 9 osds: 9 up, 9 in >>> pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB >>> used, 11109 GB / 16758 GB avail >>> mdsmap e1: 0/0/1 up >>> >>> # ceph osd tree >>> dumped osdmap tree epoch 2459 >>> # id weight type name up/down reweight >>> -1 18 pool default >>> -3 18 rack unknownrack >>> -2 6 host ceph0 >>> 0 2 osd.0 up 1 >>> 1 2 osd.1 up 1 >>> 2 2 osd.2 up 1 >>> -4 6 host ceph1 >>> 3 2 osd.3 up 1 >>> 4 2 osd.4 up 1 >>> 5 2 osd.5 up 1 >>> -5 6 host ceph2 >>> 6 2 osd.6 up 1 >>> 7 2 osd.7 up 1 >>> 8 2 osd.8 up 1 >>> >>> But yeah, I'm just stumped about why files going into that particular >>> RBD get corrupted. I tried a smaller file (~140MB) and it was fine. >>> I haven't gotten to do enough testing to find the threshold for >>> corruption. Or if it only happens for specific file types. I did a >>> similar test with qcow2 images (10G virtual, 4.4GB actual), and the >>> fsck results were the same -- immediate corruption inside that RBD. I >>> did not capture the sha1sum for those files though. I expect they >>> would differ. =) >>> >>> Thanks, >>> >>> - Travis >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com