Re: dht: selfheal of missing directories on nameless (by GFID) LOOKUP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 5, 2014 at 12:32 AM, Anand Avati <avati@xxxxxxxxxxx> wrote:



On Sun, May 4, 2014 at 9:22 AM, Niels de Vos <ndevos@xxxxxxxxxx> wrote:
Hi,

bug 1093324 has been opened and we have identified the following cause:

1. an NFS-client does a LOOKUP of a directory on a volume
2. the NFS-client receives a filehandle (contains volume-id + GFID)
3. add-brick is executed, but the new brick does not have any
   directories yet
4. the NFS-client creates a new file in the directory, this request is
   in the format or <filehandle>/<filename>, <filehandle> was received
   in step 2
5. the NFS-server does a LOOKUP on the parent directory identified by
   the filehandle - nameless LOOKUP, only GFID is known
6. the old brick(s) return successfully
7. the new brick returns ESTALE
8. the NFS-server returns ESTALE to the NFS-client

In this case, the NFS-client should not receive an ESTALE. There is also
no ESTALE error passed to the client when this procedure is done over
FUSE or samba/libgfapi.

Selfhealing a directory entry based only on a GFID is not always
possible. Files do not have a unique filename (hardlinks), so it is not
trivial to find a filename for a GFID (expensive operation, and the
result could be a list). However, for a directory this is simpler.
A directory is not hardlink'd in the .glusterfs directory, directories
are maintained as symbolic-links. This makes it possible to find the
name of a directory, when only the GFID is known.

Currently DHT is not able to selfheal directories on a nameless LOOKUP.
I think that it should be possible to change this, and to fix the ESTALE
returned by the NFS-server.

At least two changes would be needed, and this is where I would like to
hear opinions from others about it:

- The posix-xlator should be able to return the directory name when
  a GFID is given. This can be part of the LOOKUP-reply (dict), and that
  would add a readlink() syscall for each nameless LOOKUP that finds
  a directory. Or (suggested by Pranith) add a virtual xattr and handle
  this specific request with an additional FGETXATTR call.

I think the LOOKUP-reply with readlink() is better, instead of a new over-the-wire FOP.
 

- DHT should selfheal the directory when at least one ESTALE is returned
  by the bricks.


This also makes sense, except - if even the parent directory is missing on that server (yet to be healed). Another important point to note is that, the directories (with the same GFID) themselves may be present at various locations as various dentries on the many servers. A lookup of <dir-gfid>/"name" should succeed transparently independent of the differing <dir-gfid>'s dentries across servers.

Just want to be sure, among the following two scenarios
1. Different <pargfid>/name combinations, having same gfid
2. Same <pargfid>/name combination, having different gfids

are you saying 1 is legal (though only as a transient state during ops like rename etc)? How about 2, isn't it illegal even as a transient state (one should never ever see 2 at any point in time)?

 

However if you want to heal, now the choice of server from where you select the dir's parent and name become important as the self-heal will impose that on the other servers. For e.g one of the AFR subvolumes may have not yet healed the parent directories etc. Or, the N-1 servers may each return a different par-gfid/dir-name in the LOOKUP reply. So it can quickly get hairy.

As a general approach, using the LOOKUP-reply to send parent info from the posix level makes sense. But we also need a more detailed proposal on how that info is used at the cluster xlator levels to achieve a higher level goal, like self-heal.
 
When all bricks return ESTALE, the ESTALE is valid and
  should be passed on to the upper layers (NFS-server -> NFS-client).

Yes.

Thanks

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel




--
Raghavendra G
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux