On 03/08/2017 04:26 AM, Mohammed Rafi K C wrote:
On 03/07/2017 08:28 PM, Shyam wrote:
On 02/28/2017 12:51 AM, Mohammed Rafi K C wrote:
- I would like to see the design in such a form that, when server side
replication and disperse become a reality, parts of the
design/implementation remain the same. This way work need not be
redone when server side anything happens. (I would have extended this
to adapt to DHT2 as well, but will leave that to you)
- As this evolves we need to assign clear responsibilities to
different xlators that we are discussing, which will eventually happen
anyway, but just noting it ahead for clarity as we discuss this further.
I feel it is much more flexible with server side replication/ec .
Because we have a leader where we can control the behavior like
generating the timestamp, syncing the data to other replica etc. I need
to think about dht2. Nevertheless I will include this details in a document.
Yes, I agree that with server side replication/ec this becomes clearer,
and possibly with less timesync requirements on the clients. Noting them
down would help, thanks.
2) On the server posix layer stores the value in the memory (inode ctx)
and will sync the data periodically to the disk as an extended attr
I believe this should not be a part of the posix xlator. our posix
store definition is that it uses a local file-system underneath that
is POSIX compliant. This need, for storing the time information
outside of POSIX specification (as an xattr or otherwise), is
something that is gluster specific. As a result we should not fold
this into the posix xlator.
For example, if we replace posix later with a db store, or a key-vlaue
store, we would need to code this cache management of time information
again for these stores, but if we abstract it out, then we do not need
to do the same.
I agree that we may have to re-implement this if we coupled with posix xlator. But this is a very small piece of code where we store this time in indeo ctx and syncing it when require. Also as Amar pointed out each, back-end store may have different behavior. We can write this as abstract way so that we can re-use this tomorrow. But IMHO, I don't see this as an xlator.
Re-use is primarily what I am looking at. This looks more like a problem
of "when will an inode metadata will be flushed to disk", time stamps
being one of the criteria, so implementing it with good abstractions
will help, thanks.
6) File ops that changes the parent directory attr time need to be
consistent across all the distributed directories across the subvolumes.
(for eg: a create call will change ctime and mtime of parent dir)
6.1) This has to handle separately because we only send the fop to
the hashed subvolume.
6.2) We can asynchronously send the timeupdate setattr fop to the
other subvoumes and change the values for parent directory if the file
fops is successful on hashed subvolume.
Am I right in understanding that this is not part of the solution, and
just a suggestion on what we may do in the future, or is it part of
the solution proposed?
If we have an agreement from dht maintainers, I'm ready to take dht part
also as part of this effort :) .
For a full solution, this would be a must, right? else, we again solve
it part way and leave some gaps. I would like to see the problem
addressed in whole if possible as the timestamp issue has been addressed
in parts for way too long IMHO.
From a DHT POV, we assimilate time stamp from all directories and use
the highest. Considering this, if a subvolume that was last updated for
a directory (due to a create/unlink or other call that updated parent
timestamps), goes down, the time information may be incorrect. But, the
cluster is running in a degraded mode anyway, as an entire subvol to DHT
(and hence access and modifications and creations that land here are not
possible), in such a situation returning stale timestamps maybe an
option. Du thoughts?
Basically, yes we can asynchronously update time xattr, as stated, but
maybe we can live without it as well?
If the latter (i.e part of the solution proposed), which layer has the
responsibility to asynchronously update other DHT subvolumes?
For example, posix-xlator does not have that knowledge, so it should
not be that xlator, which means it is some other xlator, now is that
on the client or on the server, and how is it crash consistent are
some things that come to mind when reading this, but will wait for the
details before thinking aloud.
Yes, In the proposed solution it was dht who has to initiate the fop to
sync the time attributes to the other subvolumes (synchronously or
asynchronously) after let's say a create fop in the hashed subvol (just
en ag ;) ).
I'm totally in agreement with crash consistency, but thinking in broad
normally posix doesn't guarantee the persistence of the data unless
there is an explicit sync call . I thought we can include this also as a
cache coherence problem. What do you think ?
The issue that I see here is, one of the subvolumes is updated, and the
other not, so when the subvolume that is updated goes down and then
comes back up again, we are really not keeping the guarantee, we will
return an older timestamp one time and a newer one later.
IOW, in POSIX this would still be the older timestamp, whereas in our
case this would be a newer timestamp, as one subvol of the directory got
updated.
The reasoning above is to state, if what we end up doing can be punted
like posix guarantee at all?
6.3) This will have a window where the times are inconsistent
across dht subvolume (Please provide your suggestions)
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel