Re: Hadoop and Ceph client/mds view of modification time

Sage Weil <sage@xxxxxxxxxxx> · Tue, 27 Nov 2012 11:59:50 -0800 (PST)

> On 11/27/2012 12:01 PM, Sage Weil wrote:
> > On Tue, 27 Nov 2012, David Zafman wrote:
> > > 
> > > On Nov 27, 2012, at 9:03 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> > > 
> > > > On Tue, 27 Nov 2012, Sam Lang wrote:
> > > > 
> > > > > 3. When a client acquires the cap for a file, have the mds provide its
> > > > > current
> > > > > time as well.  As the client updates the mtime, it uses the timestamp
> > > > > provided
> > > > > by the mds and the time since the cap was acquired.
> > > > > Except for the skew caused by the message latency, this approach
> > > > > allows the
> > > > > mtime to be based off the mds time, so it will be consistent across
> > > > > clients
> > > > > and the mds.  It does however, allow a client to set an mtime to the
> > > > > future
> > > > > (based off of its local time), which might be undesirable, but that is
> > > > > more
> > > > > like how  NFS behaves.  Message latency probably won't be much of an
> > > > > issue
> > > > > either, as the granularity of mtime is a second. Also, the client can
> > > > > set its
> > > > > cap acquired timestamp to the time at which the cap was requested,
> > > > > ensuring
> > > > > that the relative increment includes the round trip latency so that
> > > > > the mtime
> > > > > will always be set further ahead. Of course, this approach would be a
> > > > > lot more
> > > > > intrusive to implement. :-)
> > > > 
> > > > Yeah, I'm less excited about this one.
> > > > 
> > > > I think that giving consistent behavior from a single client despite
> > > > clock
> > > > skew is a good goal.  That will make things like pjd's test behave
> > > > consistently, for example.
> > > > 
> > > 
> > > My suggestion is that a client writing to a file will try to use it's
> > > local clock unless it would cause the mtime to go backward.  In that
> > > case it will simply perform the minimum mtime advance possible (1
> > > second?).  This handles the case in which one client created a file
> > > using his clock (per previous suggested change), then another client
> > > writes with a clock that is behind.
> 
> We can choose to not decrement at the client, but because mtime is a time_t
> (seconds since epoch), we can't increment by 1 for each write. 1000 writes
> each taking 0.01s would move the mtime 990 seconds into the future.

Time resolution is nanoseconds, so this shouldn't be a problem.

> > 
> > That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
> > what POSIX says about that, though: if you utimes(2) an mtime into the
> > future, what happens on write(2)?
> 
> According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only
> require an update to mtime, it doesn't specify what the update should be:
> 
> "Upon successful completion, where nbyte is greater than 0, write() shall mark
> for update the st_ctime and st_mtime fields of the file, and if the file is a
> regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared."
> 
> In NFS, the server sets the mtime.  Its relatively common to see "Warning:
> file 'foo' has modification time in the future" if you're compiling on nfs and
> your client and nfs server clocks are skewed.  So allowing the mtime to be set
> in the near future would at least follow the principle of least surprise for
> most folks.

We can make this a client config option (set to current time vs 
add epsilon).

I also like the idea of providing the timestamp on file creation.  We 
could do both.

sage

> 
> -sam
> 
> > 
> > sage
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html