Re: speeding rbd object/diff map building - osd side

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 3 Apr 2019 15:59:06 +0200

On 03/04/2019 14:47, Jason Dillaman wrote:
On Tue, Apr 2, 2019 at 7:11 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

Hi all,

would it be useful when building/re-building rbd object map as well as
snap diff map to have osd side processing multiple objects instead of
having to test 1 object at a time which is very slow.

Technically, it tests up to "rbd_concurrent_management_ops" (defaults
to 10) objects at a time.

Yes but it is still counted as a separate client iop, for example if 
your cluster can do 50K iops, a 1TB rbd image has 250K objects so will 
take 5 sec to iterate over, a 10TB will need 50 sec assuming no other load.

One approach to speed things would be to extend the pgnls operation
filter to accept an object prefix + ability to access a snap context,
currently the filter interface only supports xattr.

Searching potentially hundreds of millions of objects on the OSDs to
find a very small percentage of related RBD data objects doesn't seem
too efficient to me at first glance. You also wouldn't necessarily
want to block the OSD's PG thread searching through and filtering tons
of object metadata since that would impact other unrelated IO. To
reduce the impact, you would want to limit the listing to querying
some fixed amount of objects, which given the expected low percentage
of hits, would result in 0 or very few actual results per query.

The hit percentage is the ratio of the size of image to the other images 
within the same pool, so if you have 50 images in a pool with equal 
sizes it is 2% but rocksdb will iterate keys in reverse-hash order so i 
expect they will mostly be in memory lookups. The rados iterator sends 
1k objects at a time to pgnls so it is limited.

Still i agree the hit ratio does not look nice, hence the second method 
proposed which does not have this problem..

Another approach would be to add a new pg op that does hit testing, the
client would pass an object prefix + a vector of object_no and optional
to/from snap ids. The client would pre-assign the objects to pgs and
would send say 1k objects per op at a time. Sorting the objects in
reverse-hash may further speed the rocksdb lookups.

Would either of these approaches be useful ? any gotchas ?

/Maged