Re: Single layout at root (Was EHT / DHT)

Anand Avati <avati@xxxxxxxxxxx> · Tue, 25 Nov 2014 22:03:40 +0000

On Tue Nov 25 2014 at 1:28:59 PM Shyam <srangana@xxxxxxxxxx> wrote:
On 11/12/2014 01:55 AM, Anand Avati wrote:

>

>

> On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy <jdarcy@xxxxxxxxxx

> <mailto:jdarcy@xxxxxxxxxx>> wrote:

>

>       (Personally I would have

>     done this by "mixing in" the parent GFID to the hash calculation, but

>     that alternative was ignored.)

>

>

> Actually when DHT was implemented, the concept of GFID did not (yet)

> exist. Due to backward compatibility it has just remained this way even

> later. Including the GFID into the hash has benefits.

I am curious here as this is interesting.

So the layout start subvol assignment for a directory to be based on its

GFID was provided so that files with the same name distribute better

than ending up in the same bricks, right?

Right, for e.g we wouldn't want all the README.txt in various directories of a volume to end up on the same server. The way it is achieved today is, the per server hash-range assignment is "rotated" by a certain amount (how much it is rotated is determined by a separate hash on the directory path) at the time of mkdir.

Instead as we _now_ have GFID, we could use that including the name to

get a similar/better distribution, or GFID+name to determine hashed subvol.

What we could do now is, include the parent directory gfid as an input into the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash ("readme.txt")
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash ("readme.txt", parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate

the GFID and not let the bricks generate the same, so that we can choose

the subvol to wind the FOP to.

The GFID would be that of the parent (as an entry name is always in the context of a parent directory/inode). Also, the GFID for a new entry is already generated by the client, the brick does not generate a GFID.

This eliminates the need for a layout per sub-directory and all the

(interesting) problems that it comes with and instead can be replaced by

a layout at root. Not sure if it handles all use cases and paths that we

have now (which needs more understanding).

I do understand there is a backward compatibility issue here, but other

than this, this sounds better than the current scheme, as there is a

single layout to read/optimize/stash/etc. across clients.

Can I understand the rationale of this better, as to what you folks are

thinking. Am I missing something or over reading on the benefits that

this can provide?

I think you understand it right. The benefit is one could have a single hash layout for the entire volume and the directory "specific-ness" is implemented by including the directory gfid into the hash function. The way I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do easier incremental rebalance. Partial progress is well tolerated and does not impact the entire volume. The time a given directory is undergoing rebalance, for that directory alone we need to enter "unhashed lookup" mode, only for that period of time.

Con per directory range: Just the new "hash assignment" phase (to impact placement of new files/data, not move old data) itself is an extended process, crawling the entire volume with complex per-directory operations. The number of points in the system where things can "break" (i.e, result in overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir hash ranges) which can potentially "break".

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new layout) is atomic for the entire volume - unhashed lookup has to be "on" for all dirs for the entire period. To mitigate this, we could explore versioning the centralized hash ranges, and store the version used by each directory in its xattrs (and update the version as the rebalance progresses). But now we have more centralized metadata (may be/ may not be a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up interesting possibilities and worthy of serious consideration.

HTH,
Avati
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel