Re: Large file corruption inside of RBD

Travis Rhoden <trhoden@xxxxxxxxx> · Tue, 12 Feb 2013 09:50:28 -0500

Sébastien,

Thanks so much for trying!

I really think the problem is probably related to the error/crash I
had last week.  A couple of other tidbits:

The two RBDs are actually in different pools.  Maybe that's one reason
they behave differently.

The warning messages I saw: 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
osd.2 not [4,2] in e2459/2459

Those are from osd.2.  When I lost a node last week, it was "ceph1",
which has OSDs 3, 4, and 5.  So it kind of all adds up that maybe
something hasn't fully recovered from that.  Kind of seems like the
message is saying "I'm osd.2, and I got a message that should have
been sent to osd.4".  SInce OSD 4 was one that crashed, seems
suspicous...

Does anyone know how to diagnose the "misdirected client" warnings?
They make my nervous.  =)  Perhaps restart osd.4 in this case?

 - Travis

On Tue, Feb 12, 2013 at 4:28 AM, Sébastien Han <han.sebastien@xxxxxxxxx> wrote:
> Hi,
>
> I can't reproduce it...
>
> root@misc01 ~ # rbd -p cor create --size 5000 seb1
> root@misc01 ~ # rbd -p cor create --size 5000 seb2
> root@misc01 ~ # rbd -p cor map seb1
> root@misc01 ~ # rbd -p cor map seb2
>
> root@misc01 ~ # rbd showmapped
> id pool image snap device
> 0 cor seb1 - /dev/rbd0
> 1 cor seb2 - /dev/rbd1
>
> root@misc01 ~ # mkfs.ext4 /dev/rbd0
> mke2fs 1.42 (29-Nov-2011)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=1024 blocks, Stripe width=1024 blocks
> 320000 inodes, 1280000 blocks
> 64000 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=1312817152
> 40 block groups
> 32768 blocks per group, 32768 fragments per group
> 8000 inodes per group
> Superblock backups stored on blocks:
>     32768, 98304, 163840, 229376, 294912, 819200, 884736
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
>
> root@misc01 ~ # mkfs.ext4 /dev/rbd1
> mke2fs 1.42 (29-Nov-2011)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=1024 blocks, Stripe width=1024 blocks
> 320000 inodes, 1280000 blocks
> 64000 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=1312817152
> 40 block groups
> 32768 blocks per group, 32768 fragments per group
> 8000 inodes per group
> Superblock backups stored on blocks:
>     32768, 98304, 163840, 229376, 294912, 819200, 884736
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
>
> root@misc01 ~ # mkdir -p /mnt/rbd/1
> root@misc01 ~ # mkdir -p /mnt/rbd/2
> root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/0
> mount: mount point /mnt/rbd/0 does not exist
> root@misc01 ~ # mount /dev/rbd0 /mnt/rbd/1
> root@misc01 ~ # mount /dev/rbd1 /mnt/rbd/2
>
> root@misc01 ~ # df -h
> Filesystem                           Size  Used Avail Use% Mounted on
> /dev/mapper/server--management-root   19G  6.5G   12G  37% /
> udev                                 5.9G  4.0K  5.9G   1% /dev
> tmpfs                                2.4G  368K  2.4G   1% /run
> none                                 5.0M  4.0K  5.0M   1% /run/lock
> none                                 5.9G   17M  5.9G   1% /run/shm
> /dev/sdb1                            228M   27M  189M  13% /boot
> cgroup                               5.9G     0  5.9G   0% /sys/fs/cgroup
> /dev/mapper/server--management-ceph   50G  885M   47G   2% /srv/ceph/mds0
> /dev/mapper/server--management-lxc    50G  5.3G   43G  12% /var/lib/lxc
> /dev/rbd0                            4.9G  202M  4.5G   5% /mnt/rbd/1
> /dev/rbd1                            4.9G  202M  4.5G   5% /mnt/rbd/2
>
> root@misc01 ~ # wget
> http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img
> --2013-02-12 09:32:34--
> http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-disk1.img
>
> Resolving cloud-images.ubuntu.com (cloud-images.ubuntu.com)... 91.189.88.141
> Connecting to cloud-images.ubuntu.com
> (cloud-images.ubuntu.com)|91.189.88.141|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 229638144 (219M) [application/octet-stream]
> Saving to: `precise-server-cloudimg-i386-disk1.img'
>
> 100%[====================================================================================================================================================================================================>]
> 229,638,144  698K/s   in 5m 23s
>
>
> 2013-02-12 09:37:56 (695 KB/s) -
> `precise-server-cloudimg-i386-disk1.img' saved [229638144/229638144]
>
> root@misc01 ~ # sha1sum precise-server-cloudimg-i386-disk1.img
> 25cde0523e060e2bce68f9f3ebaed52b38e98417  precise-server-cloudimg-i386-disk1.img
>
> root@misc01 ~ # cp precise-server-cloudimg-i386-disk1.img /mnt/rbd/1/
>
> root@misc01 ~ # sha1sum /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img
> 25cde0523e060e2bce68f9f3ebaed52b38e98417
> /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img
>
> root@misc01 ~ # cp /mnt/rbd/1/precise-server-cloudimg-i386-disk1.img /mnt/rbd/2/
>
> root@misc01 ~ # sync
>
> root@misc01 ~ # sha1sum /mnt/rbd/2/precise-server-cloudimg-i386-disk1.img
> 25cde0523e060e2bce68f9f3ebaed52b38e98417
> /mnt/rbd/2/precise-server-cloudimg-i386-disk1.img
>
> root@misc01 ~ # ceph -v
> ceph version 0.48.3argonaut (commit:920f82e805efec2cae05b79c155c07df0f3ed5dd)
>
> root@misc01 ~ # uname -r
> 3.2.0-23-generic
>
> root@misc01 ~ # du -h precise-server-cloudimg-i386-disk1.img
> 219M precise-server-cloudimg-i386-disk1.img
>
> Tried the same with:
>
> root@misc01 ~ # du -h disk
> 4.5G disk
>
> root@misc01 ~ # sha1sum disk
> a1986abe9d779b296913e8d4f3bea8e5df992419  disk
>
> root@misc01 ~ # cp disk /mnt/rbd/1/
> root@misc01 ~ # sha1sum /mnt/rbd/1/disk
> a1986abe9d779b296913e8d4f3bea8e5df992419  /mnt/rbd/1/disk
>
> root@misc01 ~ # cp /mnt/rbd/1/disk /mnt/rbd/2/disk
> root@misc01 ~ # sha1sum /mnt/rbd/2/disk
> a1986abe9d779b296913e8d4f3bea8e5df992419  /mnt/rbd/2/disk
>
> I know 0.48.3 fixed a critical bug. This one prevents data loss or
> corruption after a power loss or kernel panic event, so it seems a bit
> different.
> --
> Regards,
> Sébastien Han.
>
>
> On Tue, Feb 12, 2013 at 12:15 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>> All the OSDs are backed by xfs.  Each RBD is formatted with ext4.
>>
>> Thanks for the response.
>>
>> On Mon, Feb 11, 2013 at 6:12 PM, Mike Lowe <j.michael.lowe@xxxxxxxxx> wrote:
>>> Are your RBD's backed by btrfs?  I struggled for a very long time with corruption of RBD images until Sage and Samuel helped find a btrfs bug that can truncate sparse files if they are written to at a lower offset right after a higher offset.  The fix for this is now in 3.8rc7 and the commit is here https://git.kernel.org/?p=linux/kernel/git/josef/btrfs-next.git;a=commit;h=d468abec6b9fd7132d012d33573ecb8056c7c43f
>>>
>>> On Feb 11, 2013, at 6:06 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Noticed this today and it has me stumped.
>>>>
>>>> I have a 10GB raw VM disk image that I've placed inside of an
>>>> ext4-formatted RBD.  When I do this, it gets corrupted in weird ways.
>>>> I was prepared to show fsck results to show this, but then I found an
>>>> easier way was just by looking at the sha1sum for the file.  Here's
>>>> what I see.
>>>>
>>>> disk image sitting on regular (non-RBD) ext4 filesystem:
>>>> # sha1sum disk.img
>>>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>>>
>>>> same disk image sitting in RBD #1
>>>> # cp -p disk.img /mnt/rbd1
>>>> # sha1sum /mnt/rbd1/disk.img
>>>> cfd37c33b9de926644f7b13e604374348662bc60  disk.img
>>>>
>>>> Great, they match.  But then comes the problematic RBD:
>>>> # cp -p disk.img /mnt/rbd2
>>>> # sha1sum /mnt/rbd2/disk.img
>>>> a28d0735c0f0863a3f84151122da75a56bf5022b  disk.img
>>>>
>>>> They don't match.  I can also confirm that fsck'ing the filesystem
>>>> contained in disk.img reveals numerous errors in the latter case,
>>>> while the system is clean in the first two.
>>>>
>>>> I'm running 0.48.2argonaut on this particular cluster.
>>>> RBDs were mapped with the kernel client.  Kernel is 3.2.0-29-generic,
>>>> running in Ubuntu 12.04.1.
>>>>
>>>> The only weird thing I've observed is that while the copy was going to
>>>> RBD #2, I saw this in ceph -w:
>>>> 2013-02-11 22:18:14.134683 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034857 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.135159 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034858 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.136699 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034859 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.139479 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034860 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.139588 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034861 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.139667 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034862 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.139748 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034863 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>> 2013-02-11 22:18:14.139827 osd.2 [WRN] client.7830
>>>> 10.40.30.0:0/1548040543 misdirected client.7830.1:48034864 4.127 to
>>>> osd.2 not [4,2] in e2459/2459
>>>>
>>>> I hadn't seen this one before.
>>>>
>>>> Full disclosure:
>>>>
>>>> I had a ceph node failure last week (a week ago today) where all three
>>>> OSD processes on one of my nodes got killed by OOM.  I haven't had a
>>>> chance to go back and look for errors, gather logs,  or ask the list
>>>> for any advice on what went wrong.  Restarting my OSDs brought
>>>> everything back inline -- the cluster handled the failed OSDs just
>>>> fine, with one exception.  One of my RBDs went
>>>> read-only/write-protected.  Even after the cluster was back to
>>>> HEALTH_OK, it remained read-only.  I had to unmount, unmap, map, mount
>>>> my RBD to get it back.  It just so happens that that RBD is the one
>>>> giving me problems now.  So they could be related.  =)
>>>>
>>>> It's a small cluster:
>>>>
>>>> # ceph -s
>>>>   health HEALTH_OK
>>>>   monmap e1: 3 mons at
>>>> {a=10.40.30.0:6789/0,b=10.40.30.1:6789/0,c=10.40.30.2:6789/0},
>>>> election epoch 4, quorum 0,1,2 a,b,c
>>>>   osdmap e2459: 9 osds: 9 up, 9 in
>>>>    pgmap v9525714: 2880 pgs: 2880 active+clean; 2841 GB data, 5649 GB
>>>> used, 11109 GB / 16758 GB avail
>>>>   mdsmap e1: 0/0/1 up
>>>>
>>>> # ceph osd tree
>>>> dumped osdmap tree epoch 2459
>>>> # id  weight  type name       up/down reweight
>>>> -1    18      pool default
>>>> -3    18              rack unknownrack
>>>> -2    6                       host ceph0
>>>> 0     2                               osd.0   up      1
>>>> 1     2                               osd.1   up      1
>>>> 2     2                               osd.2   up      1
>>>> -4    6                       host ceph1
>>>> 3     2                               osd.3   up      1
>>>> 4     2                               osd.4   up      1
>>>> 5     2                               osd.5   up      1
>>>> -5    6                       host ceph2
>>>> 6     2                               osd.6   up      1
>>>> 7     2                               osd.7   up      1
>>>> 8     2                               osd.8   up      1
>>>>
>>>> But yeah, I'm just stumped about why files going into that particular
>>>> RBD get corrupted.  I tried a smaller file (~140MB) and it was fine.
>>>> I haven't gotten to do enough testing to find the threshold for
>>>> corruption.  Or if it only happens for specific file types.  I did a
>>>> similar test with qcow2 images (10G virtual, 4.4GB actual), and the
>>>> fsck results were the same -- immediate corruption inside that RBD.  I
>>>> did not capture the sha1sum for those files though.  I expect they
>>>> would differ.  =)
>>>>
>>>> Thanks,
>>>>
>>>> - Travis
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com