Re: Why onode key conforms such an order in get_object_key?

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 26 Mar 2021 07:01:25 -0500

Hi,

I can think of a couple of reasons.

1) You have a compound key to identify the object and you might want to 
quickly determine what shard/pool/pg/namespace is associated with the 
object before you even look at something like the object name when 
parsing the key.

2) We've noticed recently that in synthetic tests, having unexpected 
order when performing operations against large numbers of ghobject_t 
objects can be absolutely brutal on performance. (up to 160x!)  
Depending on how the key is used here (I haven't looked closely yet), 
the goal may have been to make it easier to sort the keys before 
performing bulk operations.  If you look at the ghobject_t cmp() 
function you'll see that it first compares max, then the shard_id, then 
hobject_t which compares pool, the bitwise key, namespace, etc.  This 
pretty much matches the order of entries being used for the compound key 
you observed.

Based on testing we've done recently, it looks like we absolutely must 
strive to maintain proper ordering when doing bulk operations against 
large numbers of objects.  Given some of the issues people have seen 
with slow omap operations in some cases, I suspect we will have to audit 
all relevant areas of the code to make sure we aren't using suboptimal 
ordering anywhere.  You can see how bad it can get in the "remove" tests 
documented in these PRs:

https://github.com/ceph/ceph/pull/40351

https://github.com/ceph/ceph/pull/39976

FWIW, these tests are not actually benchmarking ceph itelf, it's just 
showcasing what the objectstore does when remove is performed against a 
vector of ghobject_t objects sorted in name order (object1, object2, 
object3...) instead of proper ghobject_t ordering with various memory 
configurations (part of the reason the ordering matters is because it 
changes rocksdb IO patterns). If you were to change the bluestore code 
in such a way that you made the order of bulk operations less optimal, 
you could potentially make certain parts of the code incredibly slow (or 
simply break parsing of that key if you don't update it).  I wouldn't 
suggest changing this code for your new cluster unless you are just 
experimenting for development purposes or very very sure you know what 
you are doing!

Mark

On 3/26/21 5:56 AM, 7onghc@xxxxxxxxx wrote:
Hi, I'm reading the function 'get_object_key'  in src/os/bluestore/BlueStore.cc, and trying to know why the onode key conforms to these order:

- shard_id
- hobj.pool
- hobj.hash_reverse_bits
- hobj.nspace
...

Would it be reasonable if I change these orders for a new cluster?

I only know that RocsDB store omap and list objects using the prefix 'O'.
So if  I move 'hobj.nspace' to the head, will it be faster for listing objects in a namespace using 'rados ls -N {namespace}'?

===================================================

template<typename S>
static void get_object_key(CephContext *cct, const ghobject_t& oid, S *key)
{
   key->clear();

   size_t max_len = ENCODED_KEY_PREFIX_LEN +
                   (oid.hobj.nspace.length() * 3 + 1) +
                   (oid.hobj.get_key().length() * 3 + 1) +
                    1 + // for '<', '=', or '>'
                   (oid.hobj.oid.name.length() * 3 + 1) +
                    8 + 8 + 1;
   key->reserve(max_len);

   _key_encode_prefix(oid, key);

   append_escaped(oid.hobj.nspace, key);

   if (oid.hobj.get_key().length()) {
     // is a key... could be < = or >.
     append_escaped(oid.hobj.get_key(), key);
     // (ASCII chars < = and > sort in that order, yay)
     int r = oid.hobj.get_key().compare(oid.hobj.oid.name);
     if (r) {
       key->append(r > 0 ? ">" : "<");
       append_escaped(oid.hobj.oid.name, key);
     } else {
       // same as no key
       key->append("=");
     }
   } else {
     // no key
     append_escaped(oid.hobj.oid.name, key);
     key->append("=");
   }

   _key_encode_u64(oid.hobj.snap, key);
   _key_encode_u64(oid.generation, key);

   key->push_back(ONODE_KEY_SUFFIX);
}
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx