Re: Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 16 Mar 2017 03:13:48 +0530

On Wed, Mar 15, 2017 at 11:31 PM, Soumya Koduri <skoduri@xxxxxxxxxx> wrote:
Hi Rafi,

I haven't thoroughly gone through design. But have few comments/queries which I have posted inline for now .

On 02/28/2017 01:11 PM, Mohammed Rafi K C wrote:

Thanks for the reply , Comments are inline

On 02/28/2017 12:50 PM, Niels de Vos wrote:

On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote:

Hi All,

We discussed the problem $subject in the mail thread [1]. Based on the

comments and suggestions I will summarize the design (Made as points for

simplicity.)

1) As part of each fop, top layer will generate a time stamp and pass it

to the down along with other param.

    1.1) This will bring a dependency for NTP synced clients along with

servers

What do you mean with "top layer"? Is this on the Gluster client, or

does the time get inserted on the bricks?

It is the top layer (master xlator) in client graph like fuse, gfapi,

nfs . My mistake I should have mentioned . Sorry for that.

These clients shouldn't include internal client processes like rebalance, self-heal daemons right? IIUC from [1], we should avoid changing times during rebalance and self-heals.

Also what about fops generated from the underlying layers - getxattr/setxattr which may modify these time attributes?

I think we should not require a hard dependency on NTP, but have it

strongly suggested. Having a synced time in a clustered environment is

always helpful for reading and matching logs.

Agreed, but if we go with option 1 where we generate time from client,

then time will not be in sync if not done with NTP.

    1.2) There can be a diff in time if the fop stuck in the xlator for

various reason, for ex: because of locks.

Or just slow networks? Blocking (mandatory?) locks should be handled

correctly. The time a FOP is blocked can be long.

True, the questions can this be included in timestamp valie, because if

it generated from say fuse then when it reaches to the brick the time

may have moved ahead. what do you think about it ?

2) On the server posix layer stores the value in the memory (inode ctx)

and will sync the data periodically to the disk as an extended attr

Will you use any timer thread for asynchronous update?

     2.1) of course sync call also will force it. And fop comes for an

inode which is not linked, we do the sync immediately.

Does it need to be in the posix layer?

You mean storing the time attr ? then it need not be , protocol/server

is also another candidate but I feel posix is ahead in the race ;) .

I agree with Shyam and Niels that posix layer doesn't seem right. Since having this support comes with performance cost, how about a separate xlator (which shall be optional)?

3) Each time when inodes are created or initialized it read the data

from disk and store it.

4) Before setting to inode_ctx we compare the timestamp stored and the

timestamp received, and only store if the stored value is lesser than

the current value.

If we choose not to set this attribute for self-heal/rebalance (as stated above) daemons, we would need special handling for the requests sent by them (i.e, to heal this time attribute as well on the destination file/dir).

5) So in best case data will be stored and retrieved from the memory. We

replace the values in iatt with the values in inode_ctx.

6) File ops that changes the parent directory attr time need to be

consistent across all the distributed directories across the subvolumes.

(for eg: a create call will change ctime and mtime of parent dir)

     6.1) This has to handle separately because we only send the fop to

the hashed subvolume.

     6.2) We can asynchronously send the timeupdate setattr fop to the

other subvoumes and change the values for parent directory if the file

fops is successful on hashed subvolume.

The same needs to be handled even during DHT directory healing right?

     6.3) This will have a window where the times are inconsistent

across dht subvolume (Please provide your suggestions)

Isn't this the same problem for 'normal' AFR volumes? I guess self-heal

needs to know how to pick the right value for the [cm]time xattr.

Yes and need to heal. Both self heal and dht. But till then there can be

difference in values.

Is this design targetting to synchronize only ctime/mtime? If 'atime' is also considered , as the read/stat done by AFR shall modify atime only on the first subvol, even AFR xlator needs to take care of updating other subvols. Same goes with EC as well.

atime is updated on open which is sent to all subvols in AFR/EC

Thanks,

Soumya

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel