Re: RFC/Review: libgfapi object handle based extensions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 






On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan <srangana@xxxxxxxxxx> wrote:
Avati, Amar,

Amar, Anand S and myself had a discussion on this comment and here is an answer to your queries the way I see it. Let me know if I am missing something here.

(this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a single lookup or preserve a single object handle per filesystem object in its cache)

Currently a glfs_object is an opaque pointer to an object (it is a _handle_ to the object). The object itself contains a ref'd inode, which is the actual pointer to the object.

1) The similarity and differences of object handles to fds

The intention of multiple object handles is in lines with multiple fd's per file, an application using the library is free to lookup (and/or create (and its equivalents)) and acquire as many object handles as it wants for a particular object, and can hence determine the lifetime of each such object in its view. So in essence one thread can have an object handle to perform, say attribute related operations, whereas another thread has the same object looked up to perform IO.

So do you mean a glfs_object is meant to be a *per-operation* handle? If one thread wants to perform a chmod() and another thread wants to perform chown() and both attempt to resolve the same name and end up getting different handles, then both of them unref the glfs_handle right after their operation?

 
Where the object handles depart from the notion of fds is when an unlink is performed. As POSIX defines that open fds are still _open_ for activities on the file, the life of an fd and the actual object that it points to is till the fd is closed. In the case of object handles though, the moment any handle is used to unlink the object (which BTW is done using the parent object handle and the name of the child), all handles pointing to the object are still valid pointers, but operations on then will result in ENOENT, as the actual object has since been unlinked and removed by the underlying filesystem.

Not always. If the file had hardlinks the handle should still be valid. And if there were no hardlinks and you unlinked the last link, further operations must return ESTALE. ENOENT is when a "basename" does not resolve to a handle (in entry operations) - for e.g when you try to unlink the same entry a second time. Whereas ESTALE is when a presented handle does not exist - for e.g when you try to operate (read, chmod) a handle which got deleted.
 
The departure from fds is considered valid in my perspective, as the handle points to an object, which has since been removed, and so there is no semantics here that needs it to be preserved for further operations as there is a reference to it held.

The departure is only in the behavior of unlinked files. That is orthogonal to whether you want to return separate handles each time a component is looked up. I fail to see how the "departure from fd behavior" justifies creating new glfs_object per lookup?
 
So in essence for each time an object handle is returned by the API, it has to be closed for its life to end. Additionally if the object that it points to is removed from the underlying system, the handle is pointing to an entry that does not exist any longer and returns ENOENT on operations using the same.

2) The issue/benefit of having the same object handle irrespective of looking it up multiple times

If we have an 1-1 relationship of object handles (i.e struct glfs_object) to inodes, then the caller gets the same pointer to the handle. Hence having multiple handles as per the caller, boils down to giving out ref counted glfs_object(s) for the same inode.

Other than the memory footprint, this will still not make the object live past it's unlink time. The pointer handed out will be still valid till the last ref count is removed (i.e the object handle closed), at which point the object handle can be destroyed.

If I understand what you say above correctly, you intend to solve the problem of "unlinked files must return error" at your API layer? That's wrong. The right way is to ref-count glfs_object and return them precisely because you should NOT make the decision about the end of life of an inode at that layer. A hardlink may have been created by another client and the glfs_object may therefore be still be valid.

You are also returning separate glfs_object for different hardlinks of a file. Does that mean glfs_object is representing a dentry? or a per-operation reference to an inode?

So again, as many handles were handed out for the same inode, they have to be closed, etc.

3) Graph switches

In the case of graph switches, handles that are used in operations post the switch, get refreshed with an inode from the new graph, if we have an N:1 object to inode relationship.

In the case of 1:1 this is done once, but is there some multi thread safety that needs to be in place? I think this is already in place from the glfs_resolve_inode implementation as suggested earlier, but good to check.

4) Renames

In the case of renames, the inode remains the same, hence all handed out object handles still are valid and will operate on the right object per se.

5) unlinks and recreation of the same _named_ object in the background

Example being, application gets an handle for an object, say named "a.txt", and in the background (or via another application/client) this is deleted and recreated.

This will return ENOENT as the GFID would have changed for the previously held object to the new one, even though the names are the same. This seems like the right behaviour, and does not change in the case of a 1:1 of an N:1 object handle to inode mapping.

So bottom line, I see the object handles like an fd with the noted difference above. Having them in a 1:1 relationship or as a N:1 relationship does not seem to be an issue from what I understand, what am I missing here?

The issue is this. From what I understand, the usage of glfs_object in the FSAL is not like a per-operation handle, but something stored long term (many minutes, hours, days) in the per-inode context of the NFS Ganesha layer. Now NFS Ganesha may be doing the "right thing" by not re-looking up an already looked up name and therefore avoiding a leak (I'm not so sure, it still needs to verify every so often if the mapping is still valid). From NFS Ganesha's point of view the handle is changing on every lookup.

Now consider what happens in case of READDIRPLUS. A list of names and handles are returned to the client. The list of names can possibly include names which were previously looked up as well. Both are supposed to represent the same "gfid", but here will be returning new glfs_objects. When a client performs an operation on a GFID, on which glfs_object will the operation be performed at the gfapi layer? This part seems very ambiguous and not clear.

What would really help is if you can tell what a glfs_object is supposed to represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e inode_t)? A dentry? A per-operation handle to an on disk inode? A per-operation handle to an in memory per-graph inode? A per operation handle to a dentry? In the current form, it does not seem to fit any of the these categories.


Avati


Shyam



From: "Anand Avati" <avati@xxxxxxxxxxx>
To: "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxx>
Sent: Monday, September 30, 2013 10:35:05 AM

Subject: Re: RFC/Review: libgfapi object handle based extensions

I see a pretty core issue - lifecycle management of 'struct glfs_object'. What is the structure representing? When is it created? When is it destroyed? How does it relate to inode_t?

Looks like for every lookup() we are creating a new glfs_object, even if the looked up inode was already looked up before (in the cache) and had a glfs_object created for it in the recent past.

We need a stronger relationship between the two with a clearer relationship. It is probably necessary for a glfs_object to represent mulitple inode_t's at different points in time depending on graph switches, but for a given inode_t we need only one glfs_object. We definitely must NOT have a new glfs_object per lookup call.

Avati

On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan <srangana@xxxxxxxxxx> wrote:
Avati,

Please find the updated patch set for review at gerrit.
http://review.gluster.org/#/c/5936/

Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach.

I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later).

So currently, the review request is for all APIs other than,
glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place.

Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component && iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup.

Request comments on the glfs_resolve_at proposal.

Shyam.

----- Original Message -----

> From: "Anand Avati" <avati@xxxxxxxxxxx>
> To: "Shyamsundar Ranganathan" <srangana@xxxxxxxxxx>
> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxx>
> Sent: Wednesday, September 18, 2013 11:39:27 AM
> Subject: Re: RFC/Review: libgfapi object handle based extensions

> Minor comments are made in gerrit. Here is a larger (more important) comment
> for which email is probably more convenient.

> There is a problem in the general pattern of the fops, for example
> glfs_h_setattrs() (and others too)

> 1. glfs_validate_inode() has the assumption that object->inode deref is a
> guarded operation, but here we are doing an unguarded deref in the paramter
> glfs_resolve_base().

> 2. A more important issue, glfs_active_subvol() and glfs_validate_inode() are
> not atomic. glfs_active_subvol() can return an xlator from one graph, but by
> the time glfs_validate_inode() is called, a graph switch could have happened
> and inode can get resolved to a different graph. And in syncop_XXXXXX() we
> end up calling on graph1 with inode belonging to graph2.

> 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
> operations. The ESTALE_RETRY macro exists for path based FOPs where the
> resolved handle could have turned stale by the time we perform the FOP
> (where resolution and FOP are non-atomic). Over here, the handle is
> predetermined, and it does not make sense to retry on ESTALE (notice that FD
> based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
> reason)

> I think the pattern should be similar to FD based fops which specifically
> address both the above problems. Here's an outline:

> glfs_h_XXXX(struct glfs *fs, glfs_object *object, ...)
> {
> xlator_t *subvol = NULL;
> inode_t *inode = NULL;

> __glfs_entry_fs (fs);

> subvol = glfs_active_subvol (fs);
> if (!subvol) { errno = EIO; ... goto out; }

> inode = glfs_resolve_inode (fs, object, subvol);
> if (!inode) { errno = ESTALE; ... goto out; }

> loc.inode = inode;
> ret = syncop_XXXX(subvol, &loc, ...);

> }

> Notice the signature of glfs_resolve_inode(). What it does: given a
> glfs_object, and a subvol, it returns an inode_t which is resolved on that
> subvol. This way the syncop_XXX() is performed with matching subvol and
> inode. Also it returns the inode pointer so that no unsafe object->inode
> deref is done by the caller. Again, this is the same pattern followed by the
> fd based fops already.

> Also, as mentioned in one of the comments, please consider using
> glfs_resolve_at() and avoiding manual construction of loc_t.

> Thanks,
> Avati




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux