Re: GFID to Path Conversion

Aravinda <avishwan@xxxxxxxxxx> · Tue, 12 Jan 2016 12:08:44 +0530

regards
Aravinda

On 01/11/2016 08:00 PM, Shyam wrote:
On 01/06/2016 04:46 AM, Aravinda wrote:

regards
Aravinda

On 01/06/2016 02:49 AM, Shyam wrote:
On 12/09/2015 12:47 AM, Aravinda wrote:
Hi,

Sharing draft design for GFID to Path Conversion.(Directory GFID to
Path is
very easy in DHT v.1, this design may not work in case of DHT 2.0)

(current thought) DHT2 would extend the manner in which name,pGFID is
stored for files, for directories. So reverse path walking would
leverage the same mechanism as explained below.

Of course, as this would involve MDS hopping, the intention would be
to *not* use this in IO critical paths, and rather use this in the
tool set that needs reverse path walks to provide information to 
admins.

Performance and Storage space impact yet to be analyzed.

Storing the required informaton
-------------------------------
Metadata information related to Parent GFID and Basename will reside
with the file. PGFID and hash of Basename will become part of Xattr
Key name and Basename will be saved as Value.

     Xattr Key = meta.<PGFID>.<HASH(BASENAME)>
     Xattr Value = <BASENAME>

I would think we should keep the xattr name constant, and specialize
the value, instead of encoding data in the xattr value itself. The
issue is of course multiple xattr name:value pairs where name is
constant is not feasible and needs some thought.
If we use single xattr for multiple values then updating one's basename
will have to parse the existing xattr before update(in case of 
hardlinks)
Wrote about other experiments did to update and read xattrs.
http://www.gluster.org/pipermail/gluster-devel/2015-December/047380.html

Agree and understood, I am more thinking how we will enumerate all 
such xattrs, when we just know the name. We possibly would do 
listxattr in that case, would that be right?
To Create/Update the Xattr, search is not required. For example, Create 
d1/d2/f1

pgfid = get_gfid(d1/d2)
xattr_name = "meta." + pgfid + "." + HASH(f1)
value = "f1"
setxattr(d1/d2/f1, xattr_name, value)

In case of Rename(d1/d2/f1 => d1/d3/f3),
pgfid_old = get_gfid(d1/d2)
pgfid_new = get_gfid(d1/d3)
xattr_name_old = "meta." + pgfid_old + "." + HASH(f1)
xattr_name_new = "meta." + pgfid_new + "." + HASH(f3)
value_new = "f3"
removexattr(d1/d2/f1, xattr_name_old)
setxattr(d1/d3/f3, xattr_name_new, value_new)

Populate xattrs example, 
https://gist.github.com/aravindavk/5307489f68cbcfb37d3d

Each xattrs can be independently handled(thread safe) since xattr 
key/value is not dependent on others base name.

To read xattr and convert to path(Python example, 
https://gist.github.com/aravindavk/d1d0ca9c874b7d3d8d86)

paths = []
all_xattrs = listxattr(PATH)
for xattr in all_xattrs{
    if xattr_name.startswith("meta."){
        paths.append(getxattr(PATH, xattr_name))
    }
}
print paths

Non-crypto hash is suitable for this purpose.
Number of Xattrs on a file = Number of Links

Converting GFID to Path
-----------------------
Example GFID: 78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038

Here is where we get into a bit of a problem, if a file has links.
Which path to follow would be a dilemma. We could return all paths,
but tools like glusterfind or backup related, would prefer a single
file. One of the thoughts is, if we could feed a pGFID:GFID pair as
input, this still does not solve a file having links within the same
pGFID.

Anyway, something to note or consider.

1. List all xattrs of GFID file in the brick backend.
($BRICK_ROOT/.glusterfs/78/e8/78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038)
2. If Xattr Key starts with “meta”, Split to get parent GFID and 
collect
xattr value
3. Convert Parent GFID to path using recursive readlink till path.

This is the part which should/would change with DHT2 in my opinion.
Sort of repeating step (2) here instead of a readlink.

4. Join Converted parent dir path and xattr value(basename)

Recording
---------
MKNOD/CREATE/LINK/SYMLINK: Add new Xattr(PGFID, BN)

Most of these operations as they exist today are not atomic, i.e we
create the file and then add the xattrs and then possibly hardlink the
GFID, so by the time the GFID makes it's presence, the file is all
ready and (maybe) hence consistent.

The other way to look at this is that we get the GFID representation
ready, and then hard link the name into the name tree. Alternately we
could leverage O_TMPFILE to create the file encode all its inode
information and then bring it to life in the namespace. This is
orthogonal to this design, but brings in needs to be consistent on
failures.

Either way, if a failure occurs midway, we have no way to recover the
information for the inode and set it right. Thoughts?

RENAME: Remove old xattr(PGFID1, BN1), Add new xattr(PGFID2, BN2)
UNLINK: If Link count > 1 then Remove xattr(PGFID, BN)

Heal on Lookup
--------------
Healing on lookup can be enabled if required, by default we can
disable this option since this may have performance implications
during read.

Enabling the logging
---------------------
This can be enabled using Volume set option. Option name TBD.

Rebuild Index
-------------
Offline activity, crawls the backend filesystem and builds all the
required xattrs.

Frequency of the rebuild? I would assume this would be run when the
option is enabled, and later almost never, unless we want to recover
from some inconsistency in the data (how to detect the same would be
an open question).

Also I think once this option is enabled, we should prevent disabling
the same (or at least till the packages are downgraded), as this would
be a hinge that multiple other features may depend on, and so we
consider this an on-disk change that is made once, and later
maintained for the volume, rather than turn on/off.

Which means the initial index rebuild would be a volume version
conversion from current to this representation and may need aditional
thoughts on how we maintain volume versions.

Comments and Suggestions Welcome.

regards
Aravinda

On 11/25/2015 10:08 AM, Aravinda wrote:

regards
Aravinda

On 11/24/2015 11:25 PM, Shyam wrote:
There seem to be other interested consumers in gluster for the same
information, and I guess we need a god base design to address 
this on
disk change, so that it can be leveraged in the various use cases
appropriately.

Request a few folks to list out how they would use this feature and
also what performance characteristics they expect around the same.

- gluster find class of utilties
- change log processors
- swift on file
- inotify support on gluster
- Others?
Debugging utilities for users/admins(Show path for GFIDs displayed in
log files)
Retrigger Sync in Geo-replication(Geo-rep reports failed GFIDs in
logs, we can retrigger sync if path is known instead of GFID)

[3] is an attempt in XFS to do the same, possibly there is a more
later thread around the same that discusses later approaches.

[4] slide 13 onwards talks about how cephfs does this. (see cephfs
inode backtraces)

Aravinda, could you put up a design for the same, and how and where
this is information is added etc. Would help review it from other
xlators perspective (like existing DHT).

Shyam
[3] http://oss.sgi.com/archives/xfs/2014-01/msg00224.html
[4]
http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf 

On 10/27/2015 10:02 AM, Shyam wrote:
Aravinda, List,

The topic is interesting and also relevant in the case of DHT2
where we
lose the hierarchy on a single brick (unlike the older DHT) and so
some
of the thoughts here are along the same lines as what we are 
debating
w.r.t DHT2 as well.

Here is another option that extends the current thought, that I 
would
like to put forward, that is pretty much inspired from the Linux
kernel
NFS implementation (based on my current understanding of the same)
[1] [2].

If gluster server/brick processes handed out handles, (which are
currently just GFID (or inode #) of the file), that encode
pGFID/GFID,
then on any handle based operation, we get the pGFID/GFID for the
object
being operated on. This solves the first part of the problem 
where we
are encoding the pGFID in the xattr, and here we not only do 
that but
further hand out the handle with that relationship.

It also helps when an object is renamed and we still allow the 
older
handle to be used for operations. Not a bad thing in some cases, 
and
possibly not the best thing to do in some other cases (say access).

To further this knowledge back to a name, what you propose can be
stored
on the object itself. Thus giving us a short dentry tree creation
ability of pGFID->name(GFID).

This of course changes the gluster RPC wire protocol, as we need to
encode/send pGFID as well in some cases (or could be done adding
this to
the xdata payload.

Shyam

[1] http://nfs.sourceforge.net/#faq_c7
[2]
https://www.kernel.org/doc/Documentation/filesystems/nfs/Exporting

On 10/27/2015 03:07 AM, Aravinda wrote:
Hi,

We have a volume option called "build-pgfid:on" to enable 
recording
parent gfid in file xattr. This simplifies the GFID to Path
conversion.
Is it possible to save base name also in xattr along with 
PGFID? It
helps in converting GFID to Path easily without doing crawl.

Example structure,

dir1 (3c789e71-24b0-4723-92a2-7eb3c14b4114)
     - f1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
     - f2 (f1e7ad00-6500-4284-b21c-d02766ecc336)
dir2 (6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed)
     - h1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)

Where file f1 and h1 are hardlinks. Note the same GFID.

Backend,

.glusterfs
      - 3c/78/3c789e71-24b0-4723-92a2-7eb3c14b4114
      - 0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
      - f1/e7/f1e7ad00-6500-4284-b21c-d02766ecc336
      - 6c/3b/6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed

Since f1 and h1 are hardlinks accross directories, file xattr will
have
two parent GFIDs. Xattr dump will be,

trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=1
trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=1

Number shows number of hardlinks per parent GFID.

If we know GFID of a file, to get path,
1. Identify which brick has that file using pathinfo xattr.
2. Get all parent GFIDs(using listxattr on backend gfid path
.glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
3. Crawl those directories to find files with same inode as
.glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c

Updating PGFID to be done when,
1. CREATE/MKNOD - Add xattr
2. RENAME - If moved to different directory, Update PGFID
3. UNLINK - If number of links is more than 1. Reduce number of
link,
Remove respective parent PGFID
4. LINK - Add PGFID if link to different directory, Increment 
count

Advantageous:
1. Crawling is limited to a few directories instead of full file
system
crawl.
2. Break early during crawl when search reaches the hardlinks
number as
of Xattr value.

Disadvantageous:
1. Crawling is expensive if a directory has lot of files.
2. Updating PGFID when CREATE/MKNOD/RENAME/UNLINK/LINK
3. This method of conversion will not work if file is deleted.

We can improve performance of GFID to Path conversion if we record
Basename also in file xattr.

trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=f1
trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=h1

Note: Multiple base names delimited by zerobyte.

What additional overhead compare to storing only PGFID,
1. Space
2. Number of xattrs will grow as number of hardlinks
3. Max size issue for xattr value?
4. Even when renamed within the same directory.
5. Updating value of xattr involves parsing in case of multiple
hardlinks.

Are there any performance issues except during initial
indexing.(Assume
pgfid and basenames are populated by a separate script)

Comments and Suggestions Welcome.

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel