Re: Consistency problems when taking RBD snapshot

Nikolay Borisov <kernel@xxxxxxxx> · Thu, 15 Sep 2016 15:43:08 +0300

On 09/15/2016 03:15 PM, Ilya Dryomov wrote:
> On Thu, Sep 15, 2016 at 12:54 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>>
>>
>> On 09/15/2016 01:24 PM, Ilya Dryomov wrote:
>>> On Thu, Sep 15, 2016 at 10:22 AM, Nikolay Borisov
>>> <n.borisov@xxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> On 09/15/2016 09:22 AM, Nikolay Borisov wrote:
>>>>>
>>>>>
>>>>> On 09/14/2016 05:53 PM, Ilya Dryomov wrote:
>>>>>> On Wed, Sep 14, 2016 at 3:30 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 09/14/2016 02:55 PM, Ilya Dryomov wrote:
>>>>>>>> On Wed, Sep 14, 2016 at 9:01 AM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 09/14/2016 09:55 AM, Adrian Saul wrote:
>>>>>>>>>>
>>>>>>>>>> I found I could ignore the XFS issues and just mount it with the appropriate options (below from my backup scripts):
>>>>>>>>>>
>>>>>>>>>>         #
>>>>>>>>>>         # Mount with nouuid (conflicting XFS) and norecovery (ro snapshot)
>>>>>>>>>>         #
>>>>>>>>>>         if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; then
>>>>>>>>>>                 echo "FAILED: Unable to mount snapshot $DATESTAMP of $FS - cleaning up"
>>>>>>>>>>                 rbd unmap $SNAPDEV
>>>>>>>>>>                 rbd snap rm ${RBDPATH}@${DATESTAMP}
>>>>>>>>>>                 exit 3;
>>>>>>>>>>         fi
>>>>>>>>>>         echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"
>>>>>>>>>>
>>>>>>>>>> It's impossible without clones to do it without norecovery.
>>>>>>>>>
>>>>>>>>> But shouldn't freezing the fs and doing a snapshot constitute a "clean
>>>>>>>>> unmount" hence no need to recover on the next mount (of the snapshot) -
>>>>>>>>> Ilya?
>>>>>>>>
>>>>>>>> I *thought* it should (well, except for orphan inodes), but now I'm not
>>>>>>>> sure.  Have you tried reproducing with loop devices yet?
>>>>>>>
>>>>>>> Here is what the checksum tests showed:
>>>>>>>
>>>>>>> fsfreeze -f  /mountpoit
>>>>>>> md5sum /dev/rbd0
>>>>>>> f33c926373ad604da674bcbfbe6460c5  /dev/rbd0
>>>>>>> rbd snap create xx@xxx && rbd snap protect xx@xxx
>>>>>>> rbd map xx@xxx
>>>>>>> md5sum /dev/rbd1
>>>>>>> 6f702740281874632c73aeb2c0fcf34a  /dev/rbd1
>>>>>>>
>>>>>>> where rbd1 is a snapshot of the rbd0 device. So the checksum is indeed
>>>>>>> different, worrying.
>>>>>>
>>>>>> Sorry, for the filesystem device you should do
>>>>>>
>>>>>> md5sum <(dd if=/dev/rbd0 iflag=direct bs=8M)
>>>>>>
>>>>>> to get what's actually on disk, so that it's apples to apples.
>>>>>
>>>>> root@alxc13:~# rbd showmapped  |egrep "device|c11579"
>>>>> id  pool image  snap      device
>>>>> 47  rbd  c11579 -         /dev/rbd47
>>>>> root@alxc13:~# fsfreeze -f /var/lxc/c11579
>>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>>>> 12800+0 records in
>>>>> 12800+0 records out
>>>>> 107374182400 bytes (107 GB) copied, 617.815 s, 174 MB/s
>>>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63      <--- Check sum after freeze
>>>>> root@alxc13:~# rbd snap create rbd/c11579@snap_test
>>>>> root@alxc13:~# rbd map c11579@snap_test
>>>>> /dev/rbd1
>>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>>>> 12800+0 records in
>>>>> 12800+0 records out
>>>>> 107374182400 bytes (107 GB) copied, 610.043 s, 176 MB/s
>>>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63     <--- Check sum of snapshot
>>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>>>> 12800+0 records in
>>>>> 12800+0 records out
>>>>> 107374182400 bytes (107 GB) copied, 592.164 s, 181 MB/s
>>>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63    <--- Check sum of original device, not changed - GOOD
>>>>> root@alxc13:~# file -s /dev/rbd1
>>>>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge files)
>>>>> root@alxc13:~# fsfreeze -u /var/lxc/c11579
>>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>>>> 12800+0 records in
>>>>> 12800+0 records out
>>>>> 107374182400 bytes (107 GB) copied, 647.01 s, 166 MB/s
>>>>> 92b7182591d7d7380435cfdea79a8897  /dev/fd/63   <--- After unfreeze checksum is different - OK
>>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>>>> 12800+0 records in
>>>>> 12800+0 records out
>>>>> 107374182400 bytes (107 GB) copied, 590.556 s, 182 MB/s
>>>>> bc3b68f0276c608d9435223f89589962  /dev/fd/63 <--- Why the heck the checksum of the snapshot is different after unfreeze? BAD?
>>>>> root@alxc13:~# file -s /dev/rbd1
>>>>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)
>>>>> root@alxc13:~#
>>>>>
>>>>
>>>> And something even more peculiar - taking an md5sum some hours after the
>>>> above test produced this:
>>>>
>>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>>> 12800+0 records in
>>>> 12800+0 records out
>>>> 107374182400 bytes (107 GB) copied, 636.836 s, 169 MB/s
>>>> e68e41616489d41544cd873c73defb08  /dev/fd/63
>>>>
>>>> Meaning the read-only snapshot somehow has "mutated". E.g. it wasn't
>>>> recreated, just the same old snapshot. Is this normal?
>>>
>>> Hrm, I wonder if it missed a snapshot context update.  Please pastebin
>>> entire dmesg for that boot.
>>
>> The machine has been up more than 2 and the dmesg has been rewritten
>> several times for that time. Also the node is rather busy so there's
>> plenty of irrelevant stuff in the dmesg. Grepped for rbd1/0 and found no
>> strings containing them so it's unlikely you will get anything useful.
> 
> Kernel messages are logged, you can get to them with journalctl -k or
> syslog.  Grep for libceph?
> 
>>
>>>
>>> Have those devices been remapped or alxc13 rebooted since then?  If
>>> not, what's the output of
>>>
>>> $ rados -p rbd listwatchers $(rbd info c11579 | grep block_name_prefix
>>> | awk '{ print $2 }' | sed 's/rbd_data/rbd_header/')
>>
>> watcher=xx.xxx.xxx.xx:0/3416829538 client.157729 cookie=673
>> watcher=xx.xxx.xxx.xx:0/3416829538 client.157729 cookie=676
> 
> What's the output of
> 
> $ cat /sys/bus/rbd/devices/47/client_id
> $ cat /sys/bus/rbd/devices/1/client_id

cat /sys/bus/rbd/devices/47/client_id
client157729
cat /sys/bus/rbd/devices/1/client_id
client157729

Client client157729 is alxc13, based on correlation by the ip address
shown by the rados -p ... command. So it's the only client where the rbd
images are mapped.

>>
>>
>>>
>>> and can you check whether that snapshot is continuing to mutate as the
>>> image is mutated - freeze /var/lxc/c11579 again and check rbd47 and
>>> rbd1?
>>
>> That would take a bit more time since it involves downtime to production
>> workloads.
>>
>> Btw, are you on IRC in ceph/ceph-devel ?
> 
> dis on #ceph-devel, but I'd rather do this via email.
> 
> Thanks,
> 
>                 Ilya
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com