Re: speeding rbd object/diff map building - osd side

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 3 Apr 2019 15:49:41 -0400

On Wed, Apr 3, 2019 at 9:59 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
>
>
>
> On 03/04/2019 14:47, Jason Dillaman wrote:
> > On Tue, Apr 2, 2019 at 7:11 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
> >>
> >>
> >> Hi all,
> >>
> >> would it be useful when building/re-building rbd object map as well as
> >> snap diff map to have osd side processing multiple objects instead of
> >> having to test 1 object at a time which is very slow.
> >
>
> > Technically, it tests up to "rbd_concurrent_management_ops" (defaults
> > to 10) objects at a time.
> >
>
> Yes but it is still counted as a separate client iop, for example if
> your cluster can do 50K iops, a 1TB rbd image has 250K objects so will
> take 5 sec to iterate over, a 10TB will need 50 sec assuming no other load.
>
> >> One approach to speed things would be to extend the pgnls operation
> >> filter to accept an object prefix + ability to access a snap context,
> >> currently the filter interface only supports xattr.
> >
> > Searching potentially hundreds of millions of objects on the OSDs to
> > find a very small percentage of related RBD data objects doesn't seem
> > too efficient to me at first glance. You also wouldn't necessarily
> > want to block the OSD's PG thread searching through and filtering tons
> > of object metadata since that would impact other unrelated IO. To
> > reduce the impact, you would want to limit the listing to querying
> > some fixed amount of objects, which given the expected low percentage
> > of hits, would result in 0 or very few actual results per query.
>
> The hit percentage is the ratio of the size of image to the other images
> within the same pool, so if you have 50 images in a pool with equal
> sizes it is 2% but rocksdb will iterate keys in reverse-hash order so i
> expect they will mostly be in memory lookups. The rados iterator sends
> 1k objects at a time to pgnls so it is limited.

50 images is really not the "cloud" scale that Ceph is trying to
optimize against. If the iterator needed to eliminate the vast
majority of objects from its filter, your 1K object result would have
resulted in potentially hundreds of thousands of unrelated objects
being scanned, potentially polluting the various caches in-use within
the OSD. At cloud scale, would the cost of this metadata scan and
resulting massive filtering be less than a targeted object lookup when
factoring in network latency?

> Still i agree the hit ratio does not look nice, hence the second method
> proposed which does not have this problem..
>
> >> Another approach would be to add a new pg op that does hit testing, the
> >> client would pass an object prefix + a vector of object_no and optional
> >> to/from snap ids. The client would pre-assign the objects to pgs and
> >> would send say 1k objects per op at a time. Sorting the objects in
> >> reverse-hash may further speed the rocksdb lookups.

That seems like it would be a highly RBD-centric design being
implemented as a PG op. What would happen in a single object in the
set is being rebalanced to another PG while the client sends the op?
Would the entire op need to be aborted, targets recalculated, and
resent? Calculating the snap diffs is also a very different operation
from computing object existence.

> >> Would either of these approaches be useful ? any gotchas ?

I just wonder if you are constantly running into cases where you need
to rebuild the object-map? If so, that is a bug we should fix in
librbd. The rebuilding process can also just be run in the background
against live workloads, so there really shouldn't be downtime. There
are definitely improvements that can be made to the current object-map
implementation (e.g. max image limit, OSD per-object write ops/sec
limit, requirement for exclusive-lock).

> >> /Maged
> >
> >
> >
>

-- 
Jason