Re: Hadoop and Ceph client/mds view of modification time

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 27 Nov 2012 09:07:20 -0800

On Tuesday, November 27, 2012 at 8:45 AM, Sam Lang wrote:
>  
> Hi Noah,
>  
> I was able to reproduce your issue with a similar test using the fuse  
> client and the clock_offset option for the mds. This is what I see  
> happening:
>  
> clientA's clock is a few seconds behind the mds clock
>  
> clientA creates the file
> - the mds sets the mtime from its current time
> - clientA acquires the exclusive capability (cap) for the file
>  
> clientA writes to the file
> - the mtime is updated locally (at clientA with its current time)
>  
> clientA closes the file
> - the exclusive cap is flushed to the mds, but the mtime is less
> than the create mtime because of the clock skew, so the mds
> doesn't update it to the mtime from clientA's write
>  
> clientA stats the file
> - the mtime from the write (still cached) gets returned. I saw a
> race in my tests, where sometimes the mtime was from the cache
> (if the flush hadn't completed I assume), and sometimes it was
> from the mds.
>  
> clientB stats the file
> - the exclusive cap is revoked at clientA, but the mtime returned
> to clientB is from the mds

Hurray, I think we all agree about what's happening now! :)

Have you checked to see if the MDS ever sets mtime after create, or is it always dictated by the client following that?

>  
> The goal of the current implementation is to provide an mtime that is  
> non-decreasing, but that conflicts with using mtime as a version in this  
> case. Using mtime as a version has its own set of problems, but I won't  
> go into that here. I think there are a few alternatives if we want to  
> try to have a more consistent mtime value across clients.
>  
> 1. Let the client set the create mtime. This avoids the issue that the  
> mds and client clocks are out of sync, but in other cases where the  
> client has a clock a few seconds ahead of other clients, we run into a  
> similar problem. This might be reasonable considering clients that  
> share state will more likely have synchronized clocks than the clients  
> and mds.
>  
> 2. Provide a config option to always set the mtime on cap flush/revoke,  
> even if its less than the current mtime. This breaks the non-decreasing  
> behavior, and requires the user set a config option across the cluster  
> if they want this.
>  
> 3. When a client acquires the cap for a file, have the mds provide its  
> current time as well. As the client updates the mtime, it uses the  
> timestamp provided by the mds and the time since the cap was acquired.
> Except for the skew caused by the message latency, this approach allows  
> the mtime to be based off the mds time, so it will be consistent across  
> clients and the mds. It does however, allow a client to set an mtime to  
> the future (based off of its local time), which might be undesirable,  
> but that is more like how NFS behaves. Message latency probably won't  
> be much of an issue either, as the granularity of mtime is a second.  
> Also, the client can set its cap acquired timestamp to the time at which  
> the cap was requested, ensuring that the relative increment includes the  
> round trip latency so that the mtime will always be set further ahead.  
> Of course, this approach would be a lot more intrusive to implement. :-)

I actually like this third approach of letting the MDS be authoritative about time even if it's not directly involved. Given that, I wonder if perhaps the client should just have time translation functions it uses everywhere?
However, the problem with that is that the different MDS daemons might also disagree about time. Perhaps they could adopt the master MDS clock or something skanky like that… :/

The fundamental issue of time resolution is why things are the way they are now, of course — you usually don't want things going "backwards" in time, but clock skews are a real problem in large clusters so we just decided to not let things go back on the assumption it would be a temporary and little-noticed disparity, with easy-to-understand behavior. Obviously we were wrong about it being little-noticed.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html