Hi Noah,
I was able to reproduce your issue with a similar test using the fuse
client and the clock_offset option for the mds. This is what I see
happening:
clientA's clock is a few seconds behind the mds clock
clientA creates the file
- the mds sets the mtime from its current time
- clientA acquires the exclusive capability (cap) for the file
clientA writes to the file
- the mtime is updated locally (at clientA with its current time)
clientA closes the file
- the exclusive cap is flushed to the mds, but the mtime is less
than the create mtime because of the clock skew, so the mds
doesn't update it to the mtime from clientA's write
clientA stats the file
- the mtime from the write (still cached) gets returned. I saw a
race in my tests, where sometimes the mtime was from the cache
(if the flush hadn't completed I assume), and sometimes it was
from the mds.
clientB stats the file
- the exclusive cap is revoked at clientA, but the mtime returned
to clientB is from the mds
The goal of the current implementation is to provide an mtime that is
non-decreasing, but that conflicts with using mtime as a version in this
case. Using mtime as a version has its own set of problems, but I won't
go into that here. I think there are a few alternatives if we want to
try to have a more consistent mtime value across clients.
1. Let the client set the create mtime. This avoids the issue that the
mds and client clocks are out of sync, but in other cases where the
client has a clock a few seconds ahead of other clients, we run into a
similar problem. This might be reasonable considering clients that
share state will more likely have synchronized clocks than the clients
and mds.
2. Provide a config option to always set the mtime on cap flush/revoke,
even if its less than the current mtime. This breaks the non-decreasing
behavior, and requires the user set a config option across the cluster
if they want this.
3. When a client acquires the cap for a file, have the mds provide its
current time as well. As the client updates the mtime, it uses the
timestamp provided by the mds and the time since the cap was acquired.
Except for the skew caused by the message latency, this approach allows
the mtime to be based off the mds time, so it will be consistent across
clients and the mds. It does however, allow a client to set an mtime to
the future (based off of its local time), which might be undesirable,
but that is more like how NFS behaves. Message latency probably won't
be much of an issue either, as the granularity of mtime is a second.
Also, the client can set its cap acquired timestamp to the time at which
the cap was requested, ensuring that the relative increment includes the
round trip latency so that the mtime will always be set further ahead.
Of course, this approach would be a lot more intrusive to implement. :-)
-sam
On 11/20/2012 01:44 PM, Noah Watkins wrote:
This is a description of the clock synchronization issue we are facing
in Hadoop:
Components of Hadoop use mtime as a versioning mechanism. Here is an
example where Client B tests the expected 'version' of a file created
by Client A:
Client A: create file, write data into file.
Client A: expected_mtime <-- lstat(file)
Client A: broadcast expected_mtime to client B
...
Client B: mtime <-- lstat(file)
Client B: test expected_mtime == mtime
Since mtime may be set in Ceph by both client and MDS, inconsistent
mtime view is possible when clocks are not adequately synchronized.
Here is a test that reproduces the problem. In the following output,
issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
set to an hour earlier than the MDS node.
nwatkins@issdm-22:~$ ssh issdm-18 date && ./test
Tue Nov 20 11:40:28 PST 2012 // MDS TIME
local time: Tue Nov 20 10:42:47 2012 // Client TIME
fstat time: Tue Nov 20 11:40:28 2012 // mtime seen after file
creation (MDS time)
lstat time: Tue Nov 20 10:42:47 2012 // mtime seen after file write
(client time)
Here is the code used to produce that output.
#include <errno.h>
#include <sys/fcntl.h>
#include <sys/time.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <sys/xattr.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <cephfs/libcephfs.h>
#include <time.h>
int main(int argc, char **argv)
{
struct stat st;
struct ceph_mount_info *cmount;
struct timeval tv;
/* setup */
ceph_create(&cmount, "admin");
ceph_conf_read_file(cmount, "/users/nwatkins/Projects/ceph.conf");
ceph_mount(cmount, "/");
/* print local time for reference */
gettimeofday(&tv, NULL);
printf("local time: %s", ctime(&tv.tv_sec));
/* create a file */
char buf[256];
sprintf(buf, "/somefile.%d", getpid());
int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
assert(fd > 0);
/* get mtime for this new file */
memset(&st, 0, sizeof(st));
int ret = ceph_fstat(cmount, fd, &st);
assert(ret == 0);
printf("fstat time: %s", ctime(&st.st_mtime));
/* write some data into the file */
ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
assert(ret == sizeof(buf));
ceph_close(cmount, fd);
memset(&st, 0, sizeof(st));
ret = ceph_lstat(cmount, buf, &st);
assert(ret == 0);
printf("lstat time: %s", ctime(&st.st_mtime));
ceph_shutdown(cmount);
return 0;
}
Note that this output is currently using the short patch from
http://marc.info/?l=ceph-devel&m=133178637520337&w=2 which forces
getattr to always go to the MDS.
diff --git a/src/client/Client.cc b/src/client/Client.cc
index 4a9ae3c..2bb24b7 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char
*buf, loff_t \
size)
int Client::_getattr(Inode *in, int mask, int uid, int gid)
{
- bool yes = in->caps_issued_mask(mask);
+ bool yes = false; //in->caps_issued_mask(mask);
ldout(cct, 10) << "_getattr mask " << ccap_string(mask) << "
issued=" << yes << \
dendl; if (yes)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html