Hi Xavi, Answer inline: ----- Original Message ----- From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> To: "Joseph Fernandes" <josferna@xxxxxxxxxx> Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx> Sent: Tuesday, January 26, 2016 2:09:43 PM Subject: Re: distributed files/directories and [cm]time updates Hi Joseph, On 26/01/16 09:07, Joseph Fernandes wrote: > Answer inline: > > > ----- Original Message ----- > From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Tuesday, January 26, 2016 1:21:37 PM > Subject: Re: distributed files/directories and [cm]time updates > > Hi Pranith, > > On 26/01/16 03:47, Pranith Kumar Karampuri wrote: >> hi, >> Traditionally gluster has been using ctime/mtime of the >> files/dirs on the bricks as stat output. Problem we are seeing with this >> approach is that, software which depends on it gets confused when there >> are differences in these times. Tar especially gives "file changed as we >> read it" whenever it detects ctime differences when stat is served from >> different bricks. The way we have been trying to solve it is to serve >> the stat structures from same brick in afr, max-time in dht. But it >> doesn't avoid the problem completely. Because there is no way to change >> ctime at the moment(lutimes() only allows mtime, atime), there is little >> we can do to make sure ctimes match after self-heals/xattr >> updates/rebalance. I am wondering if anyone of you solved these problems >> before, if yes how did you go about doing it? It seems like applications >> which depend on this for backups get confused the same way. The only way >> out I see it is to bring ctime to an xattr, but that will need more iops >> and gluster has to keep updating it on quite a few fops. > > I did think about this when I was writing ec at the beginning. The idea > was that the point in time at which each fop is executed were controlled > by the client by adding an special xattr to each regular fop. Of course > this would require support inside the storage/posix xlator. At that > time, adding the needed support to other xlators seemed too complex for > me, so I decided to do something similar to afr. > > Anyway, the idea was like this: for example, when a write fop needs to > be sent, dht/afr/ec sets the current time in a special xattr, for > example 'glusterfs.time'. It can be done in a way that if the time is > already set by a higher xlator, it's not modified. This way DHT could > set the time in fops involving multiple afr subvolumes. For other fops, > would be afr who sets the time. It could also be set directly by the top > most xlator (fuse), but that time could be incorrect because lower > xlators could delay the fop execution and reorder it. This would need > more thinking. > > That xattr will be received by storage/posix. This xlator will determine > what times need to be modified and will change them. In the case of a > write, it can decide to modify mtime and, maybe, atime. For a mkdir or > create, it will set the times of the new file/directory and also the > mtime of the parent directory. It depends on the specific fop being > processed. > > mtime, atime and ctime (or even others) could be saved in a special > posix xattr instead of relying on the file system attributes that cannot > be modified (at least for ctime). > > This solution doesn't require extra fops, So it seems quite clean to me. > The additional I/O needed in posix could be minimized by implementing a > metadata cache in storage/posix that would read all metadata on lookup > and update it on disk only at regular intervals and/or on invalidation. > All fops would read/write into the cache. This would even reduce the > number of I/O we are currently doing for each fop. > >>>>>>>>>> JOE: the idea of metadata cache is cool for read work loads, but for writes we > would end up doing double writes to the disk. i.e 1 for the actual write or 1 to update the setxattr. > IMHO we cannot have it in a write back cache (periodic flush to disk) and ctime/mtime/atime data loss > or inconsistency will be a problem. Your thoughts? If we want to have all in physical storage at all times, gluster will be slow. We only need to be posix compliant, and posix allows some degree of "inconsistency" here. i.e. we are not forced to write to physical storage until the user application sends a flush or similar request. Note that there are xlators that currently take advantage of this: for example write-behind and md-cache. Almost all file systems (if not all) rely on this to improve performance, otherwise they would be really slow. >>>>>>>>>>> JOE : Agree Of course this could cause a temporal inconsistency between bricks, but since all cluster xlators (dht, afr and ec) use special xattrs to track consistency, a crash before flushing the metadata could be detected and repaired (with additional care even a crash while flushing metadata could be detected). >>>>>>>>>> JOE : Well I am fine with the cache approach, But what level of fault tolerance is acceptable is another question here. Remember we are building a cache over cache (linux system cache) for posix metadata. IMHO configurable option should be provided, to have a deterministic consistence, for example how offen the metadata should be flushed. I understand the performance implication but it should be configurable. The reason I am excited about this is, long time back when we were thinking of WORM-Retention our major worry was gluster's control on utime/mtime/ctime and what would be the cost of maintaining this extra metadata. By giving control on the server-side metadata cache setting we would configure it desired consistency and performance. ~Joe _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel