Re: Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Thu, 16 Mar 2017 14:27:01 +0530

On 03/15/2017 11:31 PM, Soumya Koduri wrote:
> Hi Rafi,
>
> I haven't thoroughly gone through design. But have few
> comments/queries which I have posted inline for now .
>
> On 02/28/2017 01:11 PM, Mohammed Rafi K C wrote:
>> Thanks for the reply , Comments are inline
>>
>>
>>
>> On 02/28/2017 12:50 PM, Niels de Vos wrote:
>>> On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote:
>>>> Hi All,
>>>>
>>>>
>>>> We discussed the problem $subject in the mail thread [1]. Based on the
>>>> comments and suggestions I will summarize the design (Made as
>>>> points for
>>>> simplicity.)
>>>>
>>>>
>>>> 1) As part of each fop, top layer will generate a time stamp and
>>>> pass it
>>>> to the down along with other param.
>>>>
>>>>     1.1) This will bring a dependency for NTP synced clients along
>>>> with
>>>> servers
>>> What do you mean with "top layer"? Is this on the Gluster client, or
>>> does the time get inserted on the bricks?
>> It is the top layer (master xlator) in client graph like fuse, gfapi,
>> nfs . My mistake I should have mentioned . Sorry for that.
>
> These clients shouldn't include internal client processes like
> rebalance, self-heal daemons right? IIUC from [1], we should avoid
> changing times during rebalance and self-heals.
>
> Also what about fops generated from the underlying layers -
> getxattr/setxattr which may modify these time attributes?

Since the time stamps are appended from master xlators like fuse , we
will not have the timestamp for internal daemons as they don't have
master xlator loaded. internal fops won't generate new timestamp , even
if we are sending an internal fops from say dht, it will have only one
time genrated by fuse. So I think this is fine.

>
>>
>>
>>>
>>> I think we should not require a hard dependency on NTP, but have it
>>> strongly suggested. Having a synced time in a clustered environment is
>>> always helpful for reading and matching logs.
>> Agreed, but if we go with option 1 where we generate time from client,
>> then time will not be in sync if not done with NTP.
>>
>>
>>
>>>
>>>>     1.2) There can be a diff in time if the fop stuck in the xlator
>>>> for
>>>> various reason, for ex: because of locks.
>>> Or just slow networks? Blocking (mandatory?) locks should be handled
>>> correctly. The time a FOP is blocked can be long.
>> True, the questions can this be included in timestamp valie, because if
>> it generated from say fuse then when it reaches to the brick the time
>> may have moved ahead. what do you think about it ?
>>
>>
>>>
>>>> 2) On the server posix layer stores the value in the memory (inode
>>>> ctx)
>>>> and will sync the data periodically to the disk as an extended attr
> Will you use any timer thread for asynchronous update?

Yes, May be a timer thread.

>
>>>>
>>>>      2.1) of course sync call also will force it. And fop comes for an
>>>> inode which is not linked, we do the sync immediately.
>>> Does it need to be in the posix layer?
>>
>> You mean storing the time attr ? then it need not be , protocol/server
>> is also another candidate but I feel posix is ahead in the race ;) .
>
> I agree with Shyam and Niels that posix layer doesn't seem right.
> Since having this support comes with performance cost, how about a
> separate xlator (which shall be optional)?

I take this as strong point. But I still wanted myself to be clarify
about the performance drop for the periodic sync. I will do a poc on that.

>
>>
>>
>>>
>>>> 3) Each time when inodes are created or initialized it read the data
>>>> from disk and store it.
>>>>
>>>>
>>>> 4) Before setting to inode_ctx we compare the timestamp stored and the
>>>> timestamp received, and only store if the stored value is lesser than
>>>> the current value.
> If we choose not to set this attribute for self-heal/rebalance (as
> stated above) daemons, we would need special handling for the requests
> sent by them (i.e, to heal this time attribute as well on the
> destination file/dir).

Hope the above explanation answer your question.

>
>>>>
>>>>
>>>> 5) So in best case data will be stored and retrieved from the
>>>> memory. We
>>>> replace the values in iatt with the values in inode_ctx.
>>>>
>>>>
>>>> 6) File ops that changes the parent directory attr time need to be
>>>> consistent across all the distributed directories across the
>>>> subvolumes.
>>>> (for eg: a create call will change ctime and mtime of parent dir)
>>>>
>>>>      6.1) This has to handle separately because we only send the
>>>> fop to
>>>> the hashed subvolume.
>>>>
>>>>      6.2) We can asynchronously send the timeupdate setattr fop to the
>>>> other subvoumes and change the values for parent directory if the file
>>>> fops is successful on hashed subvolume.
>
> The same needs to be handled even during DHT directory healing right?

True.

>
>>>>
>>>>      6.3) This will have a window where the times are inconsistent
>>>> across dht subvolume (Please provide your suggestions)
>>> Isn't this the same problem for 'normal' AFR volumes? I guess self-heal
>>> needs to know how to pick the right value for the [cm]time xattr.
>>
>> Yes and need to heal. Both self heal and dht. But till then there can be
>> difference in values.
>
> Is this design targetting to synchronize only ctime/mtime? If 'atime'
> is also considered , as the read/stat done by AFR shall modify atime
> only on the first subvol, even AFR xlator needs to take care of
> updating other subvols. Same goes with EC as well.

Actually we can extend the effort with out much changes. I personally
wanted to do that, but also depends on the actual use case.

>
> Thanks,
> Soumya

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel