Re: Invalid RBD object maps of snapshots on Mimic

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 10 Jan 2019 10:53:16 -0500

On Thu, Jan 10, 2019 at 10:50 AM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>
> Dear Jason and list,
>
> Am 10.01.19 um 16:28 schrieb Jason Dillaman:
> > On Thu, Jan 10, 2019 at 4:01 AM Oliver Freyermuth
> > <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> I performed several consistency checks now:
> >> - Exporting an RBD snapshot before and after the object map rebuilding.
> >> - Exporting a backup as raw image, all backups (re)created before and after the object map rebuilding.
> >> - md5summing all of that for a snapshot for which the rebuilding was actually needed.
> >>
> >> The good news: I found that all checksums are the same. So the backups are (at least for those I checked) not broken.
> >>
> >> I also checked the source and found:
> >> https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h
> >> So to my understanding, the object map entries are OBJECT_EXISTS, but should be OBJECT_EXISTS_CLEAN.
> >> Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object being unchanged ("clean") as compared to another snapshot / the main volume?
> >>
> >> If so, this would explain why the backups, exports etc. are all okay, since the backup tools only got "too many" objects in the fast-diff and
> >> hence extracted too many objects from Ceph-RBD even though that was not needed. Since both Benji and Backy2 deduplicate again in their backends,
> >> this causes only a minor network traffic inefficiency.
> >>
> >> Is my understanding correct?
> >> Then the underlying issue would still be a bug, but (as it seems) a harmless one.
> >
> > Yes, your understanding is correct in that it's harmless from a
> > data-integrity point-of-view.
> >
> > During the creation of the snapshot, the current object map (for the
> > HEAD revision) is copied to a new object map for that snapshot and
> > then all the objects in the HEAD revision snapshot are marked as
> > EXISTS_CLEAN (if they EXIST). Somehow an IO operation is causing the
> > object map to think there is an update, but apparently no object
> > update is actually occurring (or at least the OSD doesn't think a
> > change occurred).
>
> thanks a lot for the clarification! Good to know my understanding is correct.
>
> I re-checked all object maps just now. Again, the most recent snapshots show this issue, but only those.
> The only "special" thing which probably not everybody is doing would likely be us running fstrim in the machines
> running from the RBD regularly, to conserve space.
>
> I am not sure how exactly the DISCARD operation is handled in rbd. But since this was my guess, I just did an fstrim inside one of the VMs,
> and checked the object-maps again. I get:
> 2019-01-10 16:44:25.320 7f06f67fc700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.4f587327b23c6.0000000000000040 marked as 1, but should be 3
> In this case, I got it for the volume itself and not a snapshot.
>
> So it seems to me that sometimes, DISCARD causes objects to think they have been updated, albeit they have not.
> Sadly due to in-depth code knowledge and lack of a real debug setup I can not track it down further :-(.
>
> Cheers and hope that helps a code expert in tracking it down (at least it's not affecting data integrity),

Thanks, that definitely provides a good investigation starting point.

>         Oliver
>
> >
> >> I'll let you know if it happens again to some of our snapshots, and if so, if it only happens to newly created ones...
> >>
> >> Cheers,
> >>          Oliver
> >>
> >> Am 10.01.19 um 01:18 schrieb Oliver Freyermuth:
> >>> Dear Cephalopodians,
> >>>
> >>> inspired by http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I did a check of the object-maps of our RBD volumes
> >>> and snapshots. We are running 13.2.1 on the cluster I am talking about, all hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5.
> >>>
> >>> Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not the volumes themselves), I got something like:
> >>> --------------------------------------------------------------------------------------------------
> >>> 2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0000000000000260 marked as 1, but should be 3
> >>> 2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0000000000000840 marked as 1, but should be 3
> >>> --------------------------------------------------------------------------------------------------
> >>> 2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0000000000000480 marked as 1, but should be 3
> >>> 2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0000000000000840 marked as 1, but should be 3
> >>> --------------------------------------------------------------------------------------------------
> >>> It often appears to affect 1-3 entries in the map of a snapshot. The Object Map was *not* marked invalid before I ran the check.
> >>> After rebuilding it, the check is fine again.
> >>>
> >>> The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we plan to upgrade to 13.2.4 soonish).
> >>> There have been no major causes of worries so far. We purged a single OSD disk, balanced PGs with upmap, modified the CRUSH topology slightly etc.
> >>> The cluster never was in a prolonged unhealthy period nor did we have to repair any PG.
> >>>
> >>> Is this a known error?
> >>> Is it harmful, or is this just something like reference counting being off, and objects being in the map which did not really change in the snapshot?
> >>>
> >>> Our usecase, in case that helps to understand or reproduce:
> >>> - RBDs are used as disks for qemu/kvm virtual machines.
> >>> - Every night:
> >>>     - We run an fstrim in the VM (which propagates to RBD and purges empty blocks), fsfreeze it, take a snapshot, thaw it again.
> >>>     - After that, we run two backups with Benji backup ( https://benji-backup.me/ ) and Backy2 backup ( http://backy2.com/docs/ )
> >>>       which seems to work rather well so far.
> >>>     - We purge some old snapshots.
> >>>
> >>> We use the following RBD feature flags:
> >>> layering, exclusive-lock, object-map, fast-diff, deep-flatten
> >>>
> >>> Since Benji and Backy2 are optimized for differential RBD backups to deduplicated storage, they leverage "rbd diff" (and hence make use of fast-diff, I would think).
> >>> If rbd diff produces wrong output due to this issue, it would affect our backups (but it would also affect classic backups of snapshots via "rbd export"...).
> >>> In case the issue is known or understood, can somebody extrapolate whether this means "rbd diff" contains too many blocks or actually misses changed blocks?
> >>>
> >>>
> >>> We are from now on running daily, full object-map checks on all volumes and backups, and automatically rebuild any object-map which was found invalid after the check.
> >>> Hopefully, this will allow to correlate the appearance of these issues with "something" happening on the cluster.
> >>> I did not detect a clean pattern in the affected snapshots, though, it seemed rather random...
> >>>
> >>> Maybe it would also help to understand this issue if somebody else using RBD in a similar manner on Mimic could also check the object-maps.
> >>> Since this issue does not show up until a check is performed, this was below our radar for many months now...
> >>>
> >>> Cheers,
> >>>        Oliver
> >>>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
>
> --
> Oliver Freyermuth
> Universität Bonn
> Physikalisches Institut, Raum 1.047
> Nußallee 12
> 53115 Bonn
> --
> Tel.: +49 228 73 2367
> Fax:  +49 228 73 7869
> --
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com