Re: [Gluster-users] Quick update on glusterd's volume scalability improvements

Atin Mukherjee <amukherj@xxxxxxxxxx> · Sat, 30 Mar 2019 09:46:29 +0530

On Sat, 30 Mar 2019 at 08:06, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:

On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
All,

As many of you already know that the design logic with which GlusterD (here on to be referred as GD1) was implemented has some fundamental scalability bottlenecks at design level, especially around it's way of handshaking configuration meta data and replicating them across all the peers. While the initial design was adopted with a factor in mind that GD1 will have to deal with just few tens of nodes/peers and volumes, the magnitude of the scaling bottleneck this design can bring in was never realized and estimated. 

Ever since Gluster has been adopted in container storage land as one of the storage backends, the business needs have changed. >From tens of volumes, the requirements have translated to hundreds and now to thousands. We introduced brick multiplexing which had given some relief to have a better control on the memory footprint when having many number of bricks/volumes hosted in the node, but this wasn't enough. In one of our (I represent Red Hat) customer's deployment  we had seen on a 3 nodes cluster, whenever the number of volumes go beyond ~1500 and for some reason if one of the storage pods get rebooted, the overall time it takes to complete the overall handshaking (not only in a factor of n X n peer handshaking but also the number of volume iterations, building up the dictionary and sending it over the write) consumes a huge time as part of the handshaking process, the hard timeout of an rpc request which is 10 minutes gets expired and we see cluster going into a state where none of the cli commands go through and get stuck. 

With such problem being around and more demand of volume scalability, we started looking into these areas in GD1 to focus on improving (a) volume scalability (b) node scalability. While (b) is a separate topic for some other day we're going to focus on more on (a) today.

While looking into this volume scalability problem with a deep dive, we realized that most of the bottleneck which was causing the overall delay in the friend handshaking and exchanging handshake packets between peers in the cluster was iterating over the in-memory data structures of the volumes, putting them into the dictionary sequentially. With 2k like volumes the function glusterd_add_volumes_to_export_dict () was quite costly and most time consuming. From pstack output when glusterd instance was restarted in one of the pods, we could always see that control was iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3 nodes cluster, this function itself took almost 7.5 minutes . The bottleneck is primarily because of sequential iteration of volumes, sequentially updating the dictionary with lot of (un)necessary keys. 

So what we tried out was making this loop to work on a worker thread model so that multiple threads can process a range of volume list and not all of them so that we can get more parallelism within glusterd. But with that we still didn't see any improvement and the primary reason for that was our dictionary APIs need locking. So the next idea was to actually make threads work on multiple dictionaries and then once all the volumes are iterated the subsequent dictionaries to be merged into a single one. Along with these changes there are few other improvements done on skipping comparison of snapshots if there's no snap available, excluding tiering keys if the volume type is not tier. With this enhancement [1] we see the overall time it took to complete building up the dictionary from the in-memory structure is 2 minutes 18 seconds which is close  ~3x improvement. We firmly believe that with this improvement, we should be able to scale up to 2000 volumes on a 3 node cluster and that'd help our users to get benefited with supporting more PVCs/volumes.

Patch [1] is still in testing and might undergo few minor changes. But we welcome you for review and comment on it. We plan to get this work completed, tested and release in glusterfs-7.

Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) for all the work done on this for last few days. Thank you Mohit!

This sounds good! Thank you for the update on this work.

Did you ever consider using etcd with GD1 (like as it is used with GD2)? 

Honestly I had thought about it few times, but the primary reason was not to go forward with that direction was the bandwidth as such improvements isn’t a short term and tiny tasks, and also to keep in mind that GD2 tasks were in our plate too. If any other contributors are willing to take this up, I am more than happy to collaborate and provide guidance.

Having etcd as a backing store for configuration could remove expensive handshaking as well as persistence of configuration on every node. I am interested in understanding if you are aware of any drawbacks with that approach. If there haven't been any thoughts in that direction, it might be a fun experiment to try.

Thanks,
Vijay
-- 
- Atin (atinm)
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel