On 03/04/2019 21:49, Jason Dillaman wrote:
On Wed, Apr 3, 2019 at 9:59 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
On 03/04/2019 14:47, Jason Dillaman wrote:
On Tue, Apr 2, 2019 at 7:11 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
Hi all,
would it be useful when building/re-building rbd object map as well as
snap diff map to have osd side processing multiple objects instead of
having to test 1 object at a time which is very slow.
Technically, it tests up to "rbd_concurrent_management_ops" (defaults
to 10) objects at a time.
Yes but it is still counted as a separate client iop, for example if
your cluster can do 50K iops, a 1TB rbd image has 250K objects so will
take 5 sec to iterate over, a 10TB will need 50 sec assuming no other load.
One approach to speed things would be to extend the pgnls operation
filter to accept an object prefix + ability to access a snap context,
currently the filter interface only supports xattr.
Searching potentially hundreds of millions of objects on the OSDs to
find a very small percentage of related RBD data objects doesn't seem
too efficient to me at first glance. You also wouldn't necessarily
want to block the OSD's PG thread searching through and filtering tons
of object metadata since that would impact other unrelated IO. To
reduce the impact, you would want to limit the listing to querying
some fixed amount of objects, which given the expected low percentage
of hits, would result in 0 or very few actual results per query.
The hit percentage is the ratio of the size of image to the other images
within the same pool, so if you have 50 images in a pool with equal
sizes it is 2% but rocksdb will iterate keys in reverse-hash order so i
expect they will mostly be in memory lookups. The rados iterator sends
1k objects at a time to pgnls so it is limited.
50 images is really not the "cloud" scale that Ceph is trying to
optimize against. If the iterator needed to eliminate the vast
majority of objects from its filter, your 1K object result would have
resulted in potentially hundreds of thousands of unrelated objects
being scanned, potentially polluting the various caches in-use within
the OSD. At cloud scale, would the cost of this metadata scan and
resulting massive filtering be less than a targeted object lookup when
factoring in network latency?
Agree
Still i agree the hit ratio does not look nice, hence the second method
proposed which does not have this problem..
Another approach would be to add a new pg op that does hit testing, the
client would pass an object prefix + a vector of object_no and optional
to/from snap ids. The client would pre-assign the objects to pgs and
would send say 1k objects per op at a time. Sorting the objects in
reverse-hash may further speed the rocksdb lookups.
That seems like it would be a highly RBD-centric design being
implemented as a PG op. What would happen in a single object in the
set is being rebalanced to another PG while the client sends the op?
Would the entire op need to be aborted, targets recalculated, and
resent? Calculating the snap diffs is also a very different operation
from computing object existence.
If there is an epoch change or in case of error we can fallback on
current rbd implementation. The snap diff operation will use the same
current code but be called server side. Basically we will be doing the
same operations on the same number of objects as we do now, but instead
of 1 object per op we do many + pre-sorting the objects via their keys
will likely make many object and snap context lookups be fetched in-memory.
Would either of these approaches be useful ? any gotchas ?
I just wonder if you are constantly running into cases where you need
to rebuild the object-map? If so, that is a bug we should fix in
librbd. The rebuilding process can also just be run in the background
against live workloads, so there really shouldn't be downtime. There
are definitely improvements that can be made to the current object-map
implementation (e.g. max image limit, OSD per-object write ops/sec
limit, requirement for exclusive-lock).
The main use case i hope to speed is for on demand generation of object
and diff maps to support export/diff operations in an active/active setup.
Thanks a lot Jason. /Maged