----- Original Message ----- > From: "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx> > To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > Cc: gluster-devel@xxxxxxxxxxx > Sent: Tuesday, July 1, 2014 1:48:09 AM > Subject: Re: Feature review: Improved rebalance performance > > > From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > > > > Hi Shyam, > > > > On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote: > > > It also touches upon a rebalance on access like mechanism where we could > > > potentially, move data out of existing bricks to a newer brick faster, in > > > the case of brick addition, and vice versa for brick removal, and heal > > > the > > > rest of the data on access. > > > > > Will this "rebalance on access" feature be enabled always or only during a > > brick addition/removal to move files that do not go to the affected brick > > while the main rebalance is populating or removing files from the brick ? > > The rebalance on access, in my head, stands as follows, (a little more > detailed than what is in the feature page) > Step 1: Initiation of the process > - Admin chooses to "rebalance _changed_" bricks > - This could mean added/removed/changed size bricks > [3]- Rebalance on access is triggered, so as to move files when they are > accessed but asynchronously > [1]- Background rebalance, acts only to (re)move data (from)to these bricks > [2]- This would also change the layout for all directories, to include the > new configuration of the cluster, so that newer data is placed in the > correct bricks > > Step 2: Completion of background rebalance > - Once background rebalance is complete, the rebalance status is noted as > success/failure based on what the backgrould rebalance process did > - This will not stop the on access rebalance, as data is still all over the > place, and enhancements like lookup-unhashed=auto will have trouble > > Step 3: Admin can initiate a full rebalance > - When this is complete then the on access rebalance would be turned off, as > the cluster is rebalanced! > > Step 2.5/4: Choosing to stop the on access rebalance > - This can be initiated by the admin, post 3 which is more logical or between > 2 and 3, in which case lookup everywhere for files etc. cannot be avoided > due to [2] above > > Issues and possible solutions: > > [4] One other thought is to create link files, as a part of [1], for files > that do not belong to the right bricks but are _not_ going to be rebalanced > as their source/destination is not a changed brick. This _should_ be faster > than moving data around and rebalancing these files. It should also avoid > the problem that, post a "rebalance _changed_" command, the cluster may have > files in the wrong place based on the layout, as the link files would be > present to correct the situation. In this situation the rebalance on access > can be left on indefinitely and turning it off does not serve much purpose. > > Enabling rebalance on access always is fine, but I am not sure it buys us > gluster states that mean the cluster is in a balanced situation, for other > actions like the lookup-unhashed mentioned which may not just need the link > files in place. Examples could be mismatched or overly space committed > bricks with old, not accessed data etc. but do not have a clear example yet. > > Just stating, the core intention of "rebalance _changed_" is to create space > in existing bricks when the cluster grows faster, or be able to remove > bricks from the cluster faster. > > Redoing a "rebalance _changed_" again due to a gluster configuration change, > i.e expanding the cluster again say, needs some thought. It does not impact > if rebalance on access is running or not, the only thing it may impact is > the choice of files that are already put into the on access queue based on > the older layout, due to the older cluster configuration. Just noting this > here. > > In short if we do [4] then we can leave rebalance on access turned on always, > unless we have some other counter examples or use cases that are not thought > of. Doing [4] seems logical, so I would state that we should, but from a > performance angle of improving rebalance, we need to determine the worth > against access paths from IO post not having [4] (again considering the > improvement that lookup-unhashed brings, this maybe obvious that [4] should > be done). > > A note on [3], the intention is to start an asynchronous sync task that > rebalances the file on access, and not impact the IO path. So if a file is > chosen by the IO path as to needing a rebalance, then a sync task with the > required xattr to trigger a file move is setup, and setxattr is called, that > should take care of the file migration and enabling the IO path to progress > as is. > > Reading through your mail, a better way of doing this by sharing the load, > would be to use an index, so that each node in the cluster has a list of > files accessed that need a rebalance. The above method for [3] would be > client heavy and would incur a network read and write, whereas the index > manner of doing things on the node could help in local reads and remote > writes operations and in spreading the work. It would incur a walk/crawl of > the index, but each entry returned is a candidate, and the walk is limited, > so should not be a bad thing by itself. > > > > > I like all the proposed ideas. I think they would improve the performance > > of > > the rebalance operation considerably. Probably we will need to define some > > policies to limit the amount of bandwidth that rebalance is allowed to use > > and > > at which hours, but this can be determined later. > > This [5] section of the feature page touches upon the same issue. i.e being > IO path requirements aware and not let rebalance hog the node resources. But > as you state, needs more thought and probably to be done once we see some > improvements and also see that we are utilizing the resources heavily. > > > > > I would also consider using index or changelog xlators to track renames and > > let rebalance consume it. Currently a file or directory rename makes that > > files correctly placed in the right brick need to be moved to another > > brick. > > A > > full rebalance crawling all the file system seems too expensive for this > > kind > > of local changes (the effects of this are orders of magnitude smaller than > > adding or removing a brick). Having a way to list pending moves due to > > rename > > without scanning all the file system would be great. > > Hmmm... to my knowledge a rename of a file does not move the file, it rather > creates a link file if the hashed sub volume of the new name is different > than the older sub volume where the file was placed. the rename of a > directory does not change its layout (unless 'a still to be analyzed' lookup > races with the rename for layout fetching and healing). On any future layout > fixes due to added bricks or removed bricks, the layout overlaps are > computed so as to minimize data movements. > > Are you suggesting a change in behavior here, or am I missing something? > > > > > Another thing to consider for future versions is to modify the current DHT > > to > > a consistent hashing and even the hash value (using gfid instead of a hash > > of > > the name would solve the rename problem). The consistent hashing would > > drastically reduce the number of files that need to be moved and already > > solves some of the current problems. This change needs a lot of thinking > > though. > > Firstly agree that this is an area to explore and nail better in the > _hopefully_ near future and that it takes some thinking time to get this > straight, while learning from the current implementation. > > Also, would like to point out to a commit that changes this for directories > using the GFID based hash rather than the name based hash, here [6]. I don't think this is what xavi meant. This only changes how hash-ranges are distributed across subvolumes. To decide which subvolume a file goes, we still hash on the name. We cannot use gfid, for the reasons I've pointed out in another mail. > It does > not address the rename problem, but starts to do things that you put down > here. > > > > > Xavi > > > > > > [5] > http://www.gluster.org/community/documentation/index.php/Features/improve_rebalance_performance#Make_rebalance_aware_of_IO_path_requirements > [6] http://review.gluster.org/#/c/7493/ > > Shyam > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel