OK, no current DHT workaround… Wasn’t there a xlator that would tend to put files on the local brick (maybe with NFS mount)? BR Jan On 2014/11/26, 1:15 AM, "Shyam" <srangana@xxxxxxxxxx> wrote: >On 11/25/2014 05:03 PM, Anand Avati wrote: >> >> >> On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana@xxxxxxxxxx >> <mailto:srangana@xxxxxxxxxx>> wrote: >> >> On 11/12/2014 01:55 AM, Anand Avati wrote: >> > >> > >> > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy@xxxxxxxxxx >> <mailto:jdarcy@xxxxxxxxxx> >> > <mailto:jdarcy@xxxxxxxxxx <mailto:jdarcy@xxxxxxxxxx>>> wrote: >> > >> > (Personally I would have >> > done this by "mixing in" the parent GFID to the hash >> calculation, but >> > that alternative was ignored.) >> > >> > >> > Actually when DHT was implemented, the concept of GFID did not >>(yet) >> > exist. Due to backward compatibility it has just remained this >> way even >> > later. Including the GFID into the hash has benefits. >> >> I am curious here as this is interesting. >> >> So the layout start subvol assignment for a directory to be based >>on its >> GFID was provided so that files with the same name distribute better >> than ending up in the same bricks, right? >> >> >> Right, for e.g we wouldn't want all the README.txt in various >> directories of a volume to end up on the same server. The way it is >> achieved today is, the per server hash-range assignment is "rotated" by >> a certain amount (how much it is rotated is determined by a separate >> hash on the directory path) at the time of mkdir. >> >> Instead as we _now_ have GFID, we could use that including the name >>to >> get a similar/better distribution, or GFID+name to determine hashed >> subvol. >> >> What we could do now is, include the parent directory gfid as an input >> into the DHT hash function. >> >> Today, we do approximately: >> int hashval = dm_hash ("readme.txt") >> hash_ranges[] = inode_ctx_get (parent_dir) >> subvol = find_subvol (hash_ranges, hashval) >> >> Instead, we could: >> int hashval = new_hash ("readme.txt", parent_dir.gfid) >> hash_ranges[] = global_value >> subvol = find_subvol (hash_ranges, hashval) >> >> The idea here would be that on dentry creates we would need to >>generate >> the GFID and not let the bricks generate the same, so that we can >>choose >> the subvol to wind the FOP to. >> >> >> The GFID would be that of the parent (as an entry name is always in the >> context of a parent directory/inode). Also, the GFID for a new entry is >> already generated by the client, the brick does not generate a GFID. >> >> This eliminates the need for a layout per sub-directory and all the >> (interesting) problems that it comes with and instead can be >>replaced by >> a layout at root. Not sure if it handles all use cases and paths >>that we >> have now (which needs more understanding). >> >> I do understand there is a backward compatibility issue here, but >>other >> than this, this sounds better than the current scheme, as there is a >> single layout to read/optimize/stash/etc. across clients. >> >> Can I understand the rationale of this better, as to what you folks >>are >> thinking. Am I missing something or over reading on the benefits >>that >> this can provide? >> >> >> I think you understand it right. The benefit is one could have a single >> hash layout for the entire volume and the directory "specific-ness" is >> implemented by including the directory gfid into the hash function. The >> way I see it, the compromise would be something like: >> >> Pro per directory range: By having per-directory hash ranges, we can do >> easier incremental rebalance. Partial progress is well tolerated and >> does not impact the entire volume. The time a given directory is >> undergoing rebalance, for that directory alone we need to enter >> "unhashed lookup" mode, only for that period of time. >> >> Con per directory range: Just the new "hash assignment" phase (to impact >> placement of new files/data, not move old data) itself is an extended >> process, crawling the entire volume with complex per-directory >> operations. The number of points in the system where things can "break" >> (i.e, result in overlaps and holes in ranges) is high. >> >> Pro single layout with dir GFID in hash: Avoid the numerous parts >> (per-dir hash ranges) which can potentially "break". >> >> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning >> new layout) is atomic for the entire volume - unhashed lookup has to be >> "on" for all dirs for the entire period. To mitigate this, we could >> explore versioning the centralized hash ranges, and store the version >> used by each directory in its xattrs (and update the version as the >> rebalance progresses). But now we have more centralized metadata (may >> be/ may not be a worthy compromise - not sure.) > >Agreed, the auto-unhased would have to wait longer before being rearmed. > >Just throwing some more thoughts on the same, > >Unhashed-auto also can benefit from just linkto creations, rather than >require a data rebalance (i.e movement of data). So in phase-0 we could >just create the linkto files and then turn on auto-unhashed. As lookups >would find the (linkto) file. > >Other abilities, like giving directories weighted layout ranges based on >size of bricks could be affected, i.e forcing a rebalance when a brick >size is increased, as it would need a root layout change, rather than >newly created directories getting the better weights. > >> >> In summary, including GFID into the hash calculation does open up >> interesting possibilities and worthy of serious consideration. > >Yes, something to consider for Gluster 4.0 (or earlier if done right >with backward compatibility handled) > >Thanks, >Shyam >_______________________________________________ >Gluster-devel mailing list >Gluster-devel@xxxxxxxxxxx >http://supercolony.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel