Re: Consistency problems when taking RBD snapshot

Nikolay Borisov <kernel@xxxxxxxx> · Tue, 13 Sep 2016 14:59:35 +0300

On 09/13/2016 01:33 PM, Ilya Dryomov wrote:
> On Tue, Sep 13, 2016 at 12:08 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>> Hello list,
>>
>>
>> I have the following cluster:
>>
>> ceph status
>>     cluster a2fba9c1-4ca2-46d8-8717-a8e42db14bb0
>>      health HEALTH_OK
>>      monmap e2: 5 mons at {alxc10=xxxxx:6789/0,alxc11=xxxxx:6789/0,alxc5=xxxxx:6789/0,alxc6=xxxx:6789/0,alxc7=xxxxx:6789/0}
>>             election epoch 196, quorum 0,1,2,3,4 alxc10,alxc5,alxc6,alxc7,alxc11
>>      mdsmap e797: 1/1/1 up {0=alxc11.xxxx=up:active}, 2 up:standby
>>      osdmap e11243: 50 osds: 50 up, 50 in
>>       pgmap v3563774: 8192 pgs, 3 pools, 1954 GB data, 972 kobjects
>>             4323 GB used, 85071 GB / 89424 GB avail
>>                 8192 active+clean
>>   client io 168 MB/s rd, 11629 kB/s wr, 3447 op/s
>>
>> It's running ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) and kernel 4.4.14
>>
>> I have multiple rbd devices which are used as the root for lxc-based containers and have ext4. At some point I want
>> to create a an rbd snapshot, for this the sequence of operations I do is thus:
>>
>> 1. freezefs -f /path/to/where/ext4-ontop-of-rbd-is-mounted
> 
> fsfreeze?

Yes, indeed, my bad. 

> 
>>
>> 2. rbd snap create "${CEPH_POOL_NAME}/${name-of-blockdev}@${name-of-snapshot}
>>
>> 3. freezefs -u /path/to/where/ext4-ontop-of-rbd-is-mounted
>>
>> <= At this point normal container operation continues =>
>>
>> 4. Mount the newly created snapshot to a 2nd location as read-only and rsync the files from it to a remote server.
>>
>> However as I start rsyncing stuff to the remote server then certain files in the snapshot are reported as corrupted.
> 
> Can you share some dmesg snippets?  Is there a pattern - the same
> file/set of files, etc?

[1718059.910038] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718060.044540] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
[1718060.044978] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718060.045246] rbd: rbd143: write 1000 at 0 result -30
[1718060.045249] blk_update_request: I/O error, dev rbd143, sector 0
[1718060.045487] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718071.404057] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #385038: comm rsync: deleted inode referenced: 46581
[1718071.404466] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718071.404739] rbd: rbd143: write 1000 at 0 result -30
[1718071.404742] blk_update_request: I/O error, dev rbd143, sector 0
[1718071.404999] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718071.419172] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
[1718071.419575] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718071.419844] rbd: rbd143: write 1000 at 0 result -30
[1718071.419847] blk_update_request: I/O error, dev rbd143, sector 0
[1718071.420081] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718071.420758] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
[1718071.421196] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718071.421441] rbd: rbd143: write 1000 at 0 result -30
[1718071.421443] blk_update_request: I/O error, dev rbd143, sector 0
[1718071.421671] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718071.543020] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
[1718071.543422] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718071.543680] rbd: rbd143: write 1000 at 0 result -30
[1718071.543682] blk_update_request: I/O error, dev rbd143, sector 0
[1718071.543945] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718083.388635] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #385038: comm rsync: deleted inode referenced: 46581
[1718083.389060] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718083.389324] rbd: rbd143: write 1000 at 0 result -30
[1718083.389327] blk_update_request: I/O error, dev rbd143, sector 0
[1718083.389561] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718083.403910] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
[1718083.404319] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718083.404581] rbd: rbd143: write 1000 at 0 result -30
[1718083.404583] blk_update_request: I/O error, dev rbd143, sector 0
[1718083.404816] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718083.405484] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #769039: comm rsync: deleted inode referenced: 410848
[1718083.405893] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718083.406140] rbd: rbd143: write 1000 at 0 result -30
[1718083.406142] blk_update_request: I/O error, dev rbd143, sector 0
[1718083.406373] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718083.534736] EXT4-fs error (device rbd143): ext4_lookup:1584: inode #52269: comm rsync: deleted inode referenced: 46393
[1718083.535184] EXT4-fs (rbd143): previous I/O error to superblock detected
[1718083.535449] rbd: rbd143: write 1000 at 0 result -30
[1718083.535452] blk_update_request: I/O error, dev rbd143, sector 0
[1718083.535684] Buffer I/O error on dev rbd143, logical block 0, lost sync page write
[1718615.793617] rbd: image c12867: WARNING: kernel layering is EXPERIMENTAL!
[1718615.806239] rbd: rbd143: added with size 0xc80000000
[1718615.860688] EXT4-fs (rbd143): write access unavailable, skipping orphan cleanup
[1718615.861105] EXT4-fs (rbd143): mounted filesystem without journal. Opts: noload
[1718617.810076] rbd: rbd144: added with size 0xa00000000
[1718617.862650] EXT4-fs (rbd144): write access unavailable, skipping orphan cleanup
[1718617.863044] EXT4-fs (rbd144): mounted filesystem without journal. Opts: noload 

Some of the files whic exhibit this: 
rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/cpanel/configs.cache/_etc_sysconfig_named___default") failed: Structure needs cleaning (117)
IO error encountered -- skipping file deletion
rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/queueprocd.pid") failed: Structure needs cleaning (117)
rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/cphulkd_processor.pid") failed: Structure needs cleaning (117)
rsync: readlink_stat("/var/snapshots/c11579-backup-1473764092/var/run/cpdavd.pid") failed: Structure needs cleaning (117) 

The files are different every time. 

> 
>>
>> freezefs implies filesystem syncing I also tested with manually doing sync/syncfs on the fs which is being snapshot. Before
>> and after the freezefs and the corruption is still present. So it's unlikely there are dirty buffers in the page cache.
>> I'm using the kernel rbd driver for the clients. The theory currently is there are some caches which are not being flushed,
>> other than the linux page cache. Reading the doc implies that only librbd is using separate caching but I'm not using librbd.
> 
> What happens if you run fsck -n on the snapshot (ro mapping)?

fsck -n -f run on the RO snapshot: 

http://paste.ubuntu.com/23173304/

> 
> What happens if you run clone from the snapshot and run fsck (rw
> mapping)?

fsck -f -n run on the RW clone from the aforementioned snapshot (it has considerably less errors): 
http://paste.ubuntu.com/23173306/

> 
> What happens if you mount the clone without running fsck and run rsync?

My colleagues told me that running rsync without running fsck from the clone doesn't
cause rsync to error out. This means that somehow the initial snapshot seems broken, 
but a clone out of it isn't. That's very odd. 

> 
> Can you try taking more than one snapshot and then compare them?

What do you mean? Checksumming the content or something else?

> 
> Thanks,
> 
>                 Ilya
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com