All,
As many of you already know that the design logic with which GlusterD (here on to be referred as GD1) was implemented has some fundamental scalability bottlenecks at design level, especially around it's way of handshaking configuration meta data and replicating them across all the peers. While the initial design was adopted with a factor in mind that GD1 will have to deal with just few tens of nodes/peers and volumes, the magnitude of the scaling bottleneck this design can bring in was never realized and estimated.
Ever since Gluster has been adopted in container storage land as one of the storage backends, the business needs have changed. From tens of volumes, the requirements have translated to hundreds and now to thousands. We introduced brick multiplexing which had given some relief to have a better control on the memory footprint when having many number of bricks/volumes hosted in the node, but this wasn't enough. In one of our (I represent Red Hat) customer's deployment we had seen on a 3 nodes cluster, whenever the number of volumes go beyond ~1500 and for some reason if one of the storage pods get rebooted, the overall time it takes to complete the overall handshaking (not only in a factor of n X n peer handshaking but also the number of volume iterations, building up the dictionary and sending it over the write) consumes a huge time as part of the handshaking process, the hard timeout of an rpc request which is 10 minutes gets expired and we see cluster going into a state where none of the cli commands go through and get stuck.
With such problem being around and more demand of volume scalability, we started looking into these areas in GD1 to focus on improving (a) volume scalability (b) node scalability. While (b) is a separate topic for some other day we're going to focus on more on (a) today.
So what we tried out was making this loop to work on a worker thread model so that multiple threads can process a range of volume list and not all of them so that we can get more parallelism within glusterd. But with that we still didn't see any improvement and the primary reason for that was our dictionary APIs need locking. So the next idea was to actually make threads work on multiple dictionaries and then once all the volumes are iterated the subsequent dictionaries to be merged into a single one. Along with these changes there are few other improvements done on skipping comparison of snapshots if there's no snap available, excluding tiering keys if the volume type is not tier. With this enhancement [1] we see the overall time it took to complete building up the dictionary from the in-memory structure is 2 minutes 18 seconds which is close ~3x improvement. We firmly believe that with this improvement, we should be able to scale up to 2000 volumes on a 3 node cluster and that'd help our users to get benefited with supporting more PVCs/volumes.
Patch [1] is still in testing and might undergo few minor changes. But we welcome you for review and comment on it. We plan to get this work completed, tested and release in glusterfs-7.
Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) for all the work done on this for last few days. Thank you Mohit!
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users