Re: Large file corruption inside of RBD

Sébastien Han <han.sebastien@xxxxxxxxx> · Tue, 12 Feb 2013 10:28:16 +0100

Hi,

I can't reproduce it...

root@misc01 ~ # rbd -p cor create --size 5000 seb1
root@misc01 ~ # rbd -p cor create --size 5000 seb2
root@misc01 ~ # rbd -p cor map seb1
root@misc01 ~ # rbd -p cor map seb2

root@misc01 ~ # rbd showmapped
id pool image snap device
0 cor seb1 - /dev/rbd0
1 cor seb2 - /dev/rbd1

root@misc01 ~ # mkfs.ext4 /dev/rbd0
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
320000 inodes, 1280000 blocks
64000 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1312817152
40 block groups
32768 blocks per group, 32768 fragments per group
8000 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

root@misc01 ~ # mkfs.ext4 /dev/rbd1
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
320000 inodes, 1280000 blocks
64000 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1312817152
40 block groups
32768 blocks per group, 32768 fragments per group
8000 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

root@misc01 ~ # mkdir -p /mnt/rbd/1
root@misc01 ~ # mkdir -p /mnt/rbd/2
root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/0
mount: mount point /mnt/rbd/0 does not exist
root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/1
root@misc01 ~ # mount /dev/rbd1 /mnt/rbd/2

root@misc01 ~ # df -h
Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/server--management-root   19G  6.5G   12G  37% /
udev                                 5.9G  4.0K  5.9G   1% /dev
tmpfs                                2.4G  368K  2.4G   1% /run
none                                 5.0M  4.0K  5.0M   1% /run/lock
none                                 5.9G   17M  5.9G   1% /run/shm
/dev/sdb1                            228M   27M  189M  13% /boot
cgroup                               5.9G     0  5.9G   0% /sys/fs/cgroup
/dev/mapper/server--management-ceph   50G  885M   47G   2% /srv/ceph/mds0
/dev/mapper/server--management-lxc    50G  5.3G   43G  12% /var/lib/lxc
/dev/rbd0                            4.9G  202M  4.5G   5% /mnt/rbd/1
/dev/rbd1                            4.9G  202M  4.5G   5% /mnt/rbd/2

root@misc01 ~ # wget
http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img
--2013-02-12 09:32:34--
http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img

Resolving cloud-images.ubuntu.com (cloud-images.ubuntu.com)... 91.189.88.141
Connecting to cloud-images.ubuntu.com
(cloud-images.ubuntu.com)|91.189.88.141|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 229638144 (219M) [application/octet-stream]
Saving to: `precise-server-cloudimg-i386-disk1.img'

100%[====================================================================================================================================================================================================>]
229,638,144  698K/s   in 5m 23s

2013-02-12 09:37:56 (695 KB/s) -
`precise-server-cloudimg-i386-disk1.img' saved [229638144/229638144]

root@misc01 ~ # sha1sum precise-server-cloudimg-i386-disk1.img
25cde0523e060e2bce68f9f3ebaed52b38e98417  precise-server-cloudimg-i386-disk1.img

root@misc01 ~ # cp precise-server-cloudimg-i386-disk1.img /mnt/rbd/1/

root@misc01 ~ # sha1sum /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img
25cde0523e060e2bce68f9f3ebaed52b38e98417
/mnt/rbd/1/precise-server-cloudimg-i386-disk1.img

root@misc01 ~ # cp /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img /mnt/rbd/2/

root@misc01 ~ # sync

root@misc01 ~ # sha1sum /mnt/rbd/2/precise-server-cloudimg-i386-disk1.img
25cde0523e060e2bce68f9f3ebaed52b38e98417
/mnt/rbd/2/precise-server-cloudimg-i386-disk1.img

root@misc01 ~ # ceph -v
ceph version 0.48.3argonaut (commit:920f82e805efec2cae05b79c155c07df0f3ed5dd)

root@misc01 ~ # uname -r
3.2.0-23-generic

root@misc01 ~ # du -h precise-server-cloudimg-i386-disk1.img
219M precise-server-cloudimg-i386-disk1.img

Tried the same with:

root@misc01 ~ # du -h disk
4.5G disk

root@misc01 ~ # sha1sum disk
a1986abe9d779b296913e8d4f3bea8e5df992419  disk

root@misc01 ~ # cp disk /mnt/rbd/1/
root@misc01 ~ # sha1sum /mnt/rbd/1/disk
a1986abe9d779b296913e8d4f3bea8e5df992419  /mnt/rbd/1/disk

root@misc01 ~ # cp /mnt/rbd/1/disk /mnt/rbd/2/disk
root@misc01 ~ # sha1sum /mnt/rbd/2/disk
a1986abe9d779b296913e8d4f3bea8e5df992419  /mnt/rbd/2/disk

I know 0.48.3 fixed a critical bug. This one prevents data loss or
corruption after a power loss or kernel panic event, so it seems a bit
different.
--
Regards,
Sébastien Han.

On Tue, Feb 12, 2013 at 12:15 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> All the OSDs are backed by xfs.  Each RBD is formatted with ext4.
>
> Thanks for the response.
>
> On Mon, Feb 11, 2013 at 6:12 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote:
>> Are your RBD's backed by btrfs?  I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset.  The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f
>>
>> On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>
>>> Hey folks,
>>>
>>> Noticed this today and it has me stumped.
>>>
>>> I have a 10GB raw VM disk image that I've placed inside of an
>>> ext4-formatted RBD.  When I do this, it gets corrupted in weird ways.
>>> I was prepared to show fsck results to show this, but then I found an
>>> easier way was just by looking at the sha1sum for the file.  Here's
>>> what I see.
>>>
>>> disk image sitting on regular (non-RBD) ext4 filesystem:
>>> # sha1sum disk.img
>>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>>
>>> same disk image sitting in RBD #1
>>> # cp -p disk.img /mnt/rbd1
>>> # sha1sum /mnt/rbd1/disk.img
>>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>>
>>> Great, they match.  But then comes the problematic RBD:
>>> # cp -p disk.img /mnt/rbd2
>>> # sha1sum /mnt/rbd2/disk.img
>>> a28d0735c0f0863a3f84151122da75a56bf5022b  disk.img
>>>
>>> They don't match.  I can also confirm that fsck'ing the filesystem
>>> contained in disk.img reveals numerous errors in the latter case,
>>> while the system is clean in the first two.
>>>
>>> I'm running 0.48.2argonaut on this particular cluster.
>>> RBDs were mapped with the kernel client.  Kernel is 3.2.0-29-generic,
>>> running in Ubuntu 12.04.1.
>>>
>>> The only weird thing I've observed is that while the copy was going to
>>> RBD #2, I saw this in ceph -w:
>>> 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>> 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830
>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to
>>> osd.2 not [4,2] in e2459/2459
>>>
>>> I hadn't seen this one before.
>>>
>>> Full disclosure:
>>>
>>> I had a ceph node failure last week (a week ago today) where all three
>>> OSD processes on one of my nodes got killed by OOM.  I haven't had a
>>> chance to go back and look for errors, gather logs,  or ask the list
>>> for any advice on what went wrong.  Restarting my OSDs brought
>>> everything back inline -- the cluster handled the failed OSDs just
>>> fine, with one exception.  One of my RBDs went
>>> read-only/write-protected.  Even after the cluster was back to
>>> HEALTH_OK, it remained read-only.  I had to unmount, unmap, map, mount
>>> my RBD to get it back.  It just so happens that that RBD is the one
>>> giving me problems now.  So they could be related.  =)
>>>
>>> It's a small cluster:
>>>
>>> # ceph -s
>>>   health HEALTH_OK
>>>   monmap e1: 3 mons at
>>> {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0},
>>> election epoch 4, quorum 0,1,2 a,b,c
>>>   osdmap e2459: 9 osds: 9 up, 9 in
>>>    pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB
>>> used, 11109 GB / 16758 GB avail
>>>   mdsmap e1: 0/0/1 up
>>>
>>> # ceph osd tree
>>> dumped osdmap tree epoch 2459
>>> # id  weight  type name       up/down reweight
>>> -1    18      pool default
>>> -3    18              rack unknownrack
>>> -2    6                       host ceph0
>>> 0     2                               osd.0   up      1
>>> 1     2                               osd.1   up      1
>>> 2     2                               osd.2   up      1
>>> -4    6                       host ceph1
>>> 3     2                               osd.3   up      1
>>> 4     2                               osd.4   up      1
>>> 5     2                               osd.5   up      1
>>> -5    6                       host ceph2
>>> 6     2                               osd.6   up      1
>>> 7     2                               osd.7   up      1
>>> 8     2                               osd.8   up      1
>>>
>>> But yeah, I'm just stumped about why files going into that particular
>>> RBD get corrupted.  I tried a smaller file (~140MB) and it was fine.
>>> I haven't gotten to do enough testing to find the threshold for
>>> corruption.  Or if it only happens for specific file types.  I did a
>>> similar test with qcow2 images (10G virtual, 4.4GB actual), and the
>>> fsck results were the same -- immediate corruption inside that RBD.  I
>>> did not capture the sha1sum for those files though.  I expect they
>>> would differ.  =)
>>>
>>> Thanks,
>>>
>>> - Travis
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com