Re: distributed files/directories and [cm]time updates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Xavier,

There is a patch sent for review which implements the metadata cache in the posix layer.  What the changes do is this:

Whenever there is a fresh lookup on a object (file/directory/symlink), posix xlator saves the stat attributes of that object in its cache.
As of now, whenever there is a fop on a object, posix tries to build HANDLE of the object by looking into gfid based backend (i.e. .glusterfs directory) and doing stat to check if the gfid exists. The patch makes chages to posix to check into its own cache first and return if it can find the attributes. If not, then look into actual gfid backend.

But as of now, there is no cache invalidation. Whenever there is a setattr() fop to change the attributes of a object, the new stat info is saved in the cache once the fop is successful on disk.

The patch can be found here. (http://review.gluster.org/#/c/12157/).

Regards,
Raghavendra

On Tue, Jan 26, 2016 at 2:51 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
hi,
       Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.

I did think about this when I was writing ec at the beginning. The idea was that the point in time at which each fop is executed were controlled by the client by adding an special xattr to each regular fop. Of course this would require support inside the storage/posix xlator. At that time, adding the needed support to other xlators seemed too complex for me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to be sent, dht/afr/ec sets the current time in a special xattr, for example 'glusterfs.time'. It can be done in a way that if the time is already set by a higher xlator, it's not modified. This way DHT could set the time in fops involving multiple afr subvolumes. For other fops, would be afr who sets the time. It could also be set directly by the top most xlator (fuse), but that time could be incorrect because lower xlators could delay the fop execution and reorder it. This would need more thinking.

That xattr will be received by storage/posix. This xlator will determine what times need to be modified and will change them. In the case of a write, it can decide to modify mtime and, maybe, atime. For a mkdir or create, it will set the times of the new file/directory and also the mtime of the parent directory. It depends on the specific fop being processed.

mtime, atime and ctime (or even others) could be saved in a special posix xattr instead of relying on the file system attributes that cannot be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me. The additional I/O needed in posix could be minimized by implementing a metadata cache in storage/posix that would read all metadata on lookup and update it on disk only at regular intervals and/or on invalidation. All fops would read/write into the cache. This would even reduce the number of I/O we are currently doing for each fop.

Xavi

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux