Hi,
I can think of a couple of reasons.
1) You have a compound key to identify the object and you might want to
quickly determine what shard/pool/pg/namespace is associated with the
object before you even look at something like the object name when
parsing the key.
2) We've noticed recently that in synthetic tests, having unexpected
order when performing operations against large numbers of ghobject_t
objects can be absolutely brutal on performance. (up to 160x!)
Depending on how the key is used here (I haven't looked closely yet),
the goal may have been to make it easier to sort the keys before
performing bulk operations. If you look at the ghobject_t cmp()
function you'll see that it first compares max, then the shard_id, then
hobject_t which compares pool, the bitwise key, namespace, etc. This
pretty much matches the order of entries being used for the compound key
you observed.
Based on testing we've done recently, it looks like we absolutely must
strive to maintain proper ordering when doing bulk operations against
large numbers of objects. Given some of the issues people have seen
with slow omap operations in some cases, I suspect we will have to audit
all relevant areas of the code to make sure we aren't using suboptimal
ordering anywhere. You can see how bad it can get in the "remove" tests
documented in these PRs:
https://github.com/ceph/ceph/pull/40351
https://github.com/ceph/ceph/pull/39976
FWIW, these tests are not actually benchmarking ceph itelf, it's just
showcasing what the objectstore does when remove is performed against a
vector of ghobject_t objects sorted in name order (object1, object2,
object3...) instead of proper ghobject_t ordering with various memory
configurations (part of the reason the ordering matters is because it
changes rocksdb IO patterns). If you were to change the bluestore code
in such a way that you made the order of bulk operations less optimal,
you could potentially make certain parts of the code incredibly slow (or
simply break parsing of that key if you don't update it). I wouldn't
suggest changing this code for your new cluster unless you are just
experimenting for development purposes or very very sure you know what
you are doing!
Mark
On 3/26/21 5:56 AM, 7onghc@xxxxxxxxx wrote:
Hi, I'm reading the function 'get_object_key' in src/os/bluestore/BlueStore.cc, and trying to know why the onode key conforms to these order:
- shard_id
- hobj.pool
- hobj.hash_reverse_bits
- hobj.nspace
...
Would it be reasonable if I change these orders for a new cluster?
I only know that RocsDB store omap and list objects using the prefix 'O'.
So if I move 'hobj.nspace' to the head, will it be faster for listing objects in a namespace using 'rados ls -N {namespace}'?
===================================================
template<typename S>
static void get_object_key(CephContext *cct, const ghobject_t& oid, S *key)
{
key->clear();
size_t max_len = ENCODED_KEY_PREFIX_LEN +
(oid.hobj.nspace.length() * 3 + 1) +
(oid.hobj.get_key().length() * 3 + 1) +
1 + // for '<', '=', or '>'
(oid.hobj.oid.name.length() * 3 + 1) +
8 + 8 + 1;
key->reserve(max_len);
_key_encode_prefix(oid, key);
append_escaped(oid.hobj.nspace, key);
if (oid.hobj.get_key().length()) {
// is a key... could be < = or >.
append_escaped(oid.hobj.get_key(), key);
// (ASCII chars < = and > sort in that order, yay)
int r = oid.hobj.get_key().compare(oid.hobj.oid.name);
if (r) {
key->append(r > 0 ? ">" : "<");
append_escaped(oid.hobj.oid.name, key);
} else {
// same as no key
key->append("=");
}
} else {
// no key
append_escaped(oid.hobj.oid.name, key);
key->append("=");
}
_key_encode_u64(oid.hobj.snap, key);
_key_encode_u64(oid.generation, key);
key->push_back(ONODE_KEY_SUFFIX);
}
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx