Re: Why onode key conforms such an order in get_object_key?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


I can think of a couple of reasons.


1) You have a compound key to identify the object and you might want to quickly determine what shard/pool/pg/namespace is associated with the object before you even look at something like the object name when parsing the key.

2) We've noticed recently that in synthetic tests, having unexpected order when performing operations against large numbers of ghobject_t objects can be absolutely brutal on performance. (up to 160x!)  Depending on how the key is used here (I haven't looked closely yet), the goal may have been to make it easier to sort the keys before performing bulk operations.  If you look at the ghobject_t cmp() function you'll see that it first compares max, then the shard_id, then hobject_t which compares pool, the bitwise key, namespace, etc.  This pretty much matches the order of entries being used for the compound key you observed.


Based on testing we've done recently, it looks like we absolutely must strive to maintain proper ordering when doing bulk operations against large numbers of objects.  Given some of the issues people have seen with slow omap operations in some cases, I suspect we will have to audit all relevant areas of the code to make sure we aren't using suboptimal ordering anywhere.  You can see how bad it can get in the "remove" tests documented in these PRs:


https://github.com/ceph/ceph/pull/40351

https://github.com/ceph/ceph/pull/39976


FWIW, these tests are not actually benchmarking ceph itelf, it's just showcasing what the objectstore does when remove is performed against a vector of ghobject_t objects sorted in name order (object1, object2, object3...) instead of proper ghobject_t ordering with various memory configurations (part of the reason the ordering matters is because it changes rocksdb IO patterns). If you were to change the bluestore code in such a way that you made the order of bulk operations less optimal, you could potentially make certain parts of the code incredibly slow (or simply break parsing of that key if you don't update it).  I wouldn't suggest changing this code for your new cluster unless you are just experimenting for development purposes or very very sure you know what you are doing!


Mark


On 3/26/21 5:56 AM, 7onghc@xxxxxxxxx wrote:
Hi, I'm reading the function 'get_object_key'  in src/os/bluestore/BlueStore.cc, and trying to know why the onode key conforms to these order:

- shard_id
- hobj.pool
- hobj.hash_reverse_bits
- hobj.nspace
...

Would it be reasonable if I change these orders for a new cluster?

I only know that RocsDB store omap and list objects using the prefix 'O'.
So if  I move 'hobj.nspace' to the head, will it be faster for listing objects in a namespace using 'rados ls -N {namespace}'?

===================================================

template<typename S>
static void get_object_key(CephContext *cct, const ghobject_t& oid, S *key)
{
   key->clear();

   size_t max_len = ENCODED_KEY_PREFIX_LEN +
                   (oid.hobj.nspace.length() * 3 + 1) +
                   (oid.hobj.get_key().length() * 3 + 1) +
                    1 + // for '<', '=', or '>'
                   (oid.hobj.oid.name.length() * 3 + 1) +
                    8 + 8 + 1;
   key->reserve(max_len);

   _key_encode_prefix(oid, key);

   append_escaped(oid.hobj.nspace, key);

   if (oid.hobj.get_key().length()) {
     // is a key... could be < = or >.
     append_escaped(oid.hobj.get_key(), key);
     // (ASCII chars < = and > sort in that order, yay)
     int r = oid.hobj.get_key().compare(oid.hobj.oid.name);
     if (r) {
       key->append(r > 0 ? ">" : "<");
       append_escaped(oid.hobj.oid.name, key);
     } else {
       // same as no key
       key->append("=");
     }
   } else {
     // no key
     append_escaped(oid.hobj.oid.name, key);
     key->append("=");
   }

   _key_encode_u64(oid.hobj.snap, key);
   _key_encode_u64(oid.generation, key);

   key->push_back(ONODE_KEY_SUFFIX);
}
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux