Re: Consistency problems when taking RBD snapshot

Ilya Dryomov <idryomov@xxxxxxxxx> · Tue, 13 Sep 2016 15:30:17 +0200

On Tue, Sep 13, 2016 at 1:59 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>
>
> On 09/13/2016 01:33 PM, Ilya Dryomov wrote:
>> On Tue, Sep 13, 2016 at 12:08 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>>> Hello list,
>>>
>>>
>>> I have the following cluster:
>>>
>>> ceph status
>>>     cluster a2fba9c1-4ca2-46d8-8717-a8e42db14bb0
>>>      health HEALTH_OK
>>>      monmap e2: 5 mons at {alxc10=xxxxx:6789/0,alxc11=xxxxx:6789/0,alxc5=xxxxx:6789/0,alxc6=xxxx:6789/0,alxc7=xxxxx:6789/0}
>>>             election epoch 196, quorum 0,1,2,3,4 alxc10,alxc5,alxc6,alxc7,alxc11
>>>      mdsmap e797: 1/1/1 up {0=alxc11.xxxx=up:active}, 2 up:standby
>>>      osdmap e11243: 50 osds: 50 up, 50 in
>>>       pgmap v3563774: 8192 pgs, 3 pools, 1954 GB data, 972 kobjects
>>>             4323 GB used, 85071 GB / 89424 GB avail
>>>                 8192 active+clean
>>>   client io 168 MB/s rd, 11629 kB/s wr, 3447 op/s
>>>
>>> It's running ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) and kernel 4.4.14
>>>
>>> I have multiple rbd devices which are used as the root for lxc-based containers and have ext4. At some point I want
>>> to create a an rbd snapshot, for this the sequence of operations I do is thus:
>>>
>>> 1. freezefs -f /path/to/where/ext4-ontop-of-rbd-is-mounted
>>
>> fsfreeze?
>
> Yes, indeed, my bad.
>
>>
>>>
>>> 2. rbd snap create "${CEPH_POOL_NAME}/${name-of-blockdev}@${name-of-snapshot}
>>>
>>> 3. freezefs -u /path/to/where/ext4-ontop-of-rbd-is-mounted
>>>
>>> <= At this point normal container operation continues =>
>>>
>>> 4. Mount the newly created snapshot to a 2nd location as read-only and rsync the files from it to a remote server.
>>>
>>> However as I start rsyncing stuff to the remote server then certain files in the snapshot are reported as corrupted.
>>
>> Can you share some dmesg snippets?  Is there a pattern - the same
>> file/set of files, etc?
>
> [1718059.910038] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718060.044540] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
> [1718060.044978] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718060.045246] rbd: rbd143: write 1000 at 0 result -30
> [1718060.045249] blk_update_request: I/O error, dev rbd143, sector 0
> [1718060.045487] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718071.404057] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #385038: comm rsync: deleted inode referenced: 46581
> [1718071.404466] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718071.404739] rbd: rbd143: write 1000 at 0 result -30
> [1718071.404742] blk_update_request: I/O error, dev rbd143, sector 0
> [1718071.404999] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718071.419172] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
> [1718071.419575] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718071.419844] rbd: rbd143: write 1000 at 0 result -30
> [1718071.419847] blk_update_request: I/O error, dev rbd143, sector 0
> [1718071.420081] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718071.420758] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
> [1718071.421196] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718071.421441] rbd: rbd143: write 1000 at 0 result -30
> [1718071.421443] blk_update_request: I/O error, dev rbd143, sector 0
> [1718071.421671] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718071.543020] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
> [1718071.543422] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718071.543680] rbd: rbd143: write 1000 at 0 result -30
> [1718071.543682] blk_update_request: I/O error, dev rbd143, sector 0
> [1718071.543945] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718083.388635] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #385038: comm rsync: deleted inode referenced: 46581
> [1718083.389060] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718083.389324] rbd: rbd143: write 1000 at 0 result -30
> [1718083.389327] blk_update_request: I/O error, dev rbd143, sector 0
> [1718083.389561] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718083.403910] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
> [1718083.404319] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718083.404581] rbd: rbd143: write 1000 at 0 result -30
> [1718083.404583] blk_update_request: I/O error, dev rbd143, sector 0
> [1718083.404816] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718083.405484] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
> [1718083.405893] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718083.406140] rbd: rbd143: write 1000 at 0 result -30
> [1718083.406142] blk_update_request: I/O error, dev rbd143, sector 0
> [1718083.406373] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718083.534736] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
> [1718083.535184] EXT4-fs (rbd143): previous I/O error to superblock detected
> [1718083.535449] rbd: rbd143: write 1000 at 0 result -30
> [1718083.535452] blk_update_request: I/O error, dev rbd143, sector 0
> [1718083.535684] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
> [1718615.793617] rbd: image c12867: WARNING: kernel layering is EXPERIMENTAL!
> [1718615.806239] rbd: rbd143: added with size 0xc80000000
> [1718615.860688] EXT4-fs (rbd143): write access unavailable, skipping orphan cleanup
> [1718615.861105] EXT4-fs (rbd143): mounted filesystem without journal. Opts: noload
> [1718617.810076] rbd: rbd144: added with size 0xa00000000
> [1718617.862650] EXT4-fs (rbd144): write access unavailable, skipping orphan cleanup
> [1718617.863044] EXT4-fs (rbd144): mounted filesystem without journal. Opts: noload
>
>
> Some of the files whic exhibit this:
> rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/cpanel/configs.cache/_etc_sysconfig_named___default") failed: Structure needs cleaning (117)
> IO error encountered -- skipping file deletion
> rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/queueprocd.pid") failed: Structure needs cleaning (117)
> rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/cphulkd_processor.pid") failed: Structure needs cleaning (117)
> rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/cpdavd.pid") failed: Structure needs cleaning (117)
>
> The files are different every time.
>
>
>>
>>>
>>> freezefs implies filesystem syncing I also tested with manually doing sync/syncfs on the fs which is being snapshot. Before
>>> and after the freezefs and the corruption is still present. So it's unlikely there are dirty buffers in the page cache.
>>> I'm using the kernel rbd driver for the clients. The theory currently is there are some caches which are not being flushed,
>>> other than the linux page cache. Reading the doc implies that only librbd is using separate caching but I'm not using librbd.
>>
>> What happens if you run fsck -n on the snapshot (ro mapping)?
>
> fsck -n -f run on the RO snapshot:
>
> http://paste.ubuntu.com/23173304/
>
>>
>> What happens if you run clone from the snapshot and run fsck (rw
>> mapping)?
>
> fsck -f -n run on the RW clone from the aforementioned snapshot (it has considerably less errors):
> http://paste.ubuntu.com/23173306/
>
>>
>> What happens if you mount the clone without running fsck and run rsync?
>
> My colleagues told me that running rsync without running fsck from the clone doesn't
> cause rsync to error out. This means that somehow the initial snapshot seems broken,
> but a clone out of it isn't. That's very odd.

Hmm, it could be about whether it is able to do journal replay on
mount.  When you mount a snapshot, you get a read-only block device;
when you mount a clone image, you get a read-write block device.

Let's try this again, suppose image is foo and snapshot is snap:

# fsfreeze -f /mnt

# rbd snap create foo@snap
# rbd map foo@snap
/dev/rbd0
# file -s /dev/rbd0
# fsck.ext4 -n /dev/rbd0
# mount /dev/rbd0 /foo
# umount /foo
<full dmesg>
# file -s /dev/rbd0
# fsck.ext4 -n /dev/rbd0

# rbd clone foo@snap bar
$ rbd map bar
/dev/rbd1
# file -s /dev/rbd1
# fsck.ext4 -n /dev/rbd1
# mount /dev/rbd1 /bar
# umount /bar
<full dmesg>
# file -s /dev/rbd1
# fsck.ext4 -n /dev/rbd1

Could you please provide the output for the above?

>
>
>>
>> Can you try taking more than one snapshot and then compare them?
>
> What do you mean? Checksumming the content or something else?

Yeah, I was thinking md5sum on the /dev/rbd<x> for starters, but let's
figure out the snap vs clone first.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com