Hi cephers,
I've been looking into better balancing our clusters with upmaps lately,
and ran into upmap cases that behave in a less than ideal way. If there
is any cycle in the upmaps like
ceph osd pg-upmap-items <pgid> a b b a
or
ceph osd pg-upmap-items <pgid> a b b c c a
the upmap validation passes, the upmap gets added to the osdmap, but
then gets silently ignored. Obviously this is for EC pools - irrelevant
for replicated pools where the order of OSDs is not significant.
The relevant code OSDMap::_apply_upmap even has a comment about this:
if (q != pg_upmap_items.end()) {
// NOTE: this approach does not allow a bidirectional swap,
// e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1].
for (auto& r : q->second) {
// make sure the replacement value doesn't already appear
...
I'm trying to understand the reasons for this limitation: is it the case
that this is just a matter of convenience of coding
(OSDMap::_apply_upmap could do this correctly with a bit more careful
approach), or is there some inherent limitation somewhere else that
prevents these cases from working? I did notice that just updating
crush weights (without using upmaps) produces similar changes to the UP
set (swaps OSDs in EC pools sometimes), so the OSDs seem to be perfectly
capable of doing backfills for osdmap changes that shuffle the order of
OSDs in the UP set. Some insight/history here would be appreciated.
Either way, the behavior of validation passing on an upmap and then the
upmap getting silently ignored is not ideal. I do realize that all
clients would have to agree on this code, since clients independently
execute it to find the OSDs to access (so rolling out a change to this
is challenging).
Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx