Re: Single layout at root (Was EHT / DHT)

Shyam <srangana@xxxxxxxxxx> · Tue, 25 Nov 2014 20:15:58 -0500

On 11/25/2014 05:03 PM, Anand Avati wrote:

On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana@xxxxxxxxxx
<mailto:srangana@xxxxxxxxxx>> wrote:

    On 11/12/2014 01:55 AM, Anand Avati wrote:
     >
     >
     > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy@xxxxxxxxxx
    <mailto:jdarcy@xxxxxxxxxx>
     > <mailto:jdarcy@xxxxxxxxxx <mailto:jdarcy@xxxxxxxxxx>>> wrote:
     >
     >       (Personally I would have
     >     done this by "mixing in" the parent GFID to the hash
    calculation, but
     >     that alternative was ignored.)
     >
     >
     > Actually when DHT was implemented, the concept of GFID did not (yet)
     > exist. Due to backward compatibility it has just remained this
    way even
     > later. Including the GFID into the hash has benefits.

    I am curious here as this is interesting.

    So the layout start subvol assignment for a directory to be based on its
    GFID was provided so that files with the same name distribute better
    than ending up in the same bricks, right?

Right, for e.g we wouldn't want all the README.txt in various
directories of a volume to end up on the same server. The way it is
achieved today is, the per server hash-range assignment is "rotated" by
a certain amount (how much it is rotated is determined by a separate
hash on the directory path) at the time of mkdir.

    Instead as we _now_ have GFID, we could use that including the name to
    get a similar/better distribution, or GFID+name to determine hashed
    subvol.

What we could do now is, include the parent directory gfid as an input
into the DHT hash function.

Today, we do approximately:
   int hashval = dm_hash ("readme.txt")
   hash_ranges[] = inode_ctx_get (parent_dir)
   subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
   int hashval = new_hash ("readme.txt", parent_dir.gfid)
   hash_ranges[] = global_value
   subvol = find_subvol (hash_ranges, hashval)

    The idea here would be that on dentry creates we would need to generate
    the GFID and not let the bricks generate the same, so that we can choose
    the subvol to wind the FOP to.

The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

    This eliminates the need for a layout per sub-directory and all the
    (interesting) problems that it comes with and instead can be replaced by
    a layout at root. Not sure if it handles all use cases and paths that we
    have now (which needs more understanding).

    I do understand there is a backward compatibility issue here, but other
    than this, this sounds better than the current scheme, as there is a
    single layout to read/optimize/stash/etc. across clients.

    Can I understand the rationale of this better, as to what you folks are
    thinking. Am I missing something or over reading on the benefits that
    this can provide?

I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory "specific-ness" is
implemented by including the directory gfid into the hash function. The
way I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and
does not impact the entire volume. The time a given directory is
undergoing rebalance, for that directory alone we need to enter
"unhashed lookup" mode, only for that period of time.

Con per directory range: Just the new "hash assignment" phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory
operations. The number of points in the system where things can "break"
(i.e, result in overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts
(per-dir hash ranges) which can potentially "break".

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
new layout) is atomic for the entire volume - unhashed lookup has to be
"on" for all dirs for the entire period. To mitigate this, we could
explore versioning the centralized hash ranges, and store the version
used by each directory in its xattrs (and update the version as the
rebalance progresses). But now we have more centralized metadata (may
be/ may not be a worthy compromise - not sure.)

Agreed, the auto-unhased would have to wait longer before being rearmed.

Just throwing some more thoughts on the same,

Unhashed-auto also can benefit from just linkto creations, rather than 
require a data rebalance (i.e movement of data). So in phase-0 we could 
just create the linkto files and then turn on auto-unhashed. As lookups 
would find the (linkto) file.

Other abilities, like giving directories weighted layout ranges based on 
size of bricks could be affected, i.e forcing a rebalance when a brick 
size is increased, as it would need a root layout change, rather than 
newly created directories getting the better weights.

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

Yes, something to consider for Gluster 4.0 (or earlier if done right 
with backward compatibility handled)

Thanks,
Shyam
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel