On 03/15/2017 11:31 PM, Soumya Koduri wrote: > Hi Rafi, > > I haven't thoroughly gone through design. But have few > comments/queries which I have posted inline for now . > > On 02/28/2017 01:11 PM, Mohammed Rafi K C wrote: >> Thanks for the reply , Comments are inline >> >> >> >> On 02/28/2017 12:50 PM, Niels de Vos wrote: >>> On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote: >>>> Hi All, >>>> >>>> >>>> We discussed the problem $subject in the mail thread [1]. Based on the >>>> comments and suggestions I will summarize the design (Made as >>>> points for >>>> simplicity.) >>>> >>>> >>>> 1) As part of each fop, top layer will generate a time stamp and >>>> pass it >>>> to the down along with other param. >>>> >>>> 1.1) This will bring a dependency for NTP synced clients along >>>> with >>>> servers >>> What do you mean with "top layer"? Is this on the Gluster client, or >>> does the time get inserted on the bricks? >> It is the top layer (master xlator) in client graph like fuse, gfapi, >> nfs . My mistake I should have mentioned . Sorry for that. > > These clients shouldn't include internal client processes like > rebalance, self-heal daemons right? IIUC from [1], we should avoid > changing times during rebalance and self-heals. > > Also what about fops generated from the underlying layers - > getxattr/setxattr which may modify these time attributes? Since the time stamps are appended from master xlators like fuse , we will not have the timestamp for internal daemons as they don't have master xlator loaded. internal fops won't generate new timestamp , even if we are sending an internal fops from say dht, it will have only one time genrated by fuse. So I think this is fine. > >> >> >>> >>> I think we should not require a hard dependency on NTP, but have it >>> strongly suggested. Having a synced time in a clustered environment is >>> always helpful for reading and matching logs. >> Agreed, but if we go with option 1 where we generate time from client, >> then time will not be in sync if not done with NTP. >> >> >> >>> >>>> 1.2) There can be a diff in time if the fop stuck in the xlator >>>> for >>>> various reason, for ex: because of locks. >>> Or just slow networks? Blocking (mandatory?) locks should be handled >>> correctly. The time a FOP is blocked can be long. >> True, the questions can this be included in timestamp valie, because if >> it generated from say fuse then when it reaches to the brick the time >> may have moved ahead. what do you think about it ? >> >> >>> >>>> 2) On the server posix layer stores the value in the memory (inode >>>> ctx) >>>> and will sync the data periodically to the disk as an extended attr > Will you use any timer thread for asynchronous update? Yes, May be a timer thread. > >>>> >>>> 2.1) of course sync call also will force it. And fop comes for an >>>> inode which is not linked, we do the sync immediately. >>> Does it need to be in the posix layer? >> >> You mean storing the time attr ? then it need not be , protocol/server >> is also another candidate but I feel posix is ahead in the race ;) . > > I agree with Shyam and Niels that posix layer doesn't seem right. > Since having this support comes with performance cost, how about a > separate xlator (which shall be optional)? I take this as strong point. But I still wanted myself to be clarify about the performance drop for the periodic sync. I will do a poc on that. > >> >> >>> >>>> 3) Each time when inodes are created or initialized it read the data >>>> from disk and store it. >>>> >>>> >>>> 4) Before setting to inode_ctx we compare the timestamp stored and the >>>> timestamp received, and only store if the stored value is lesser than >>>> the current value. > If we choose not to set this attribute for self-heal/rebalance (as > stated above) daemons, we would need special handling for the requests > sent by them (i.e, to heal this time attribute as well on the > destination file/dir). Hope the above explanation answer your question. > >>>> >>>> >>>> 5) So in best case data will be stored and retrieved from the >>>> memory. We >>>> replace the values in iatt with the values in inode_ctx. >>>> >>>> >>>> 6) File ops that changes the parent directory attr time need to be >>>> consistent across all the distributed directories across the >>>> subvolumes. >>>> (for eg: a create call will change ctime and mtime of parent dir) >>>> >>>> 6.1) This has to handle separately because we only send the >>>> fop to >>>> the hashed subvolume. >>>> >>>> 6.2) We can asynchronously send the timeupdate setattr fop to the >>>> other subvoumes and change the values for parent directory if the file >>>> fops is successful on hashed subvolume. > > The same needs to be handled even during DHT directory healing right? True. > >>>> >>>> 6.3) This will have a window where the times are inconsistent >>>> across dht subvolume (Please provide your suggestions) >>> Isn't this the same problem for 'normal' AFR volumes? I guess self-heal >>> needs to know how to pick the right value for the [cm]time xattr. >> >> Yes and need to heal. Both self heal and dht. But till then there can be >> difference in values. > > Is this design targetting to synchronize only ctime/mtime? If 'atime' > is also considered , as the read/stat done by AFR shall modify atime > only on the first subvol, even AFR xlator needs to take care of > updating other subvols. Same goes with EC as well. Actually we can extend the effort with out much changes. I personally wanted to do that, but also depends on the actual use case. > > Thanks, > Soumya _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel