On Thu, Jul 10, 2008 at 3:23 PM, Peter Staubach <staubach@xxxxxxxxxx> wrote: > Chuck Lever wrote: >> >> On Thu, Jul 10, 2008 at 1:41 PM, Peter Staubach <staubach@xxxxxxxxxx> >> wrote: >> >>> >>> Chuck Lever wrote: >>> >>>> >>>> Hi Peter- >>>> >>>> >>>> >>> >>> Hi, Chuck. >>> >>> >>>> >>>> On Tue, Jul 8, 2008 at 12:08 PM, Peter Staubach <staubach@xxxxxxxxxx> >>>> wrote: >>>> >>>> >>>>> >>>>> Hi. >>>>> >>>>> I've been looking at a bugzilla which describes a problem where >>>>> a customer was advised to use either the "noac" or "actimeo=0" >>>>> mount options to solve a consistency problem that they were >>>>> seeing in the file attributes. It turned out that this solution >>>>> did not work reliably for them because sometimes, the local >>>>> attribute cache was believed to be valid and not timed out. >>>>> (With an attribute cache timeout of 0, the cache should always >>>>> appear to be timed out.) >>>>> >>>>> In looking at this situation, it appears to me that the problem >>>>> is that the attribute cache timeout code has an off-by-one >>>>> error in it. It is assuming that the cache is valid in the >>>>> region, [read_cache_jiffies, read_cache_jiffies + attrtimeo]. The >>>>> cache should be considered valid only in the region, >>>>> [read_cache_jiffies, read_cache_jiffies + attrtimeo). With this >>>>> change, the options, "noac" and "actimeo=0", work as originally >>>>> expected. >>>>> >>>>> While I was there, I addressed a problem with the jiffies >>>>> overflowing on 32 bit systems. When overflow occurs, the >>>>> value of read_cache_jiffies + attrtimeo can be less then the >>>>> value of read_cache_jiffies. This would cause an unnecessary >>>>> GETATTR over the wire. >>>>> >>>>> Thoughts and/or comments? This is an updated patch which includes >>>>> the previous support which was added to correct the noac/actimeo=0 >>>>> handling. >>>>> >>>>> >>>> >>>> A couple of random thoughts below. >>>> >>>> >>>> >>> >>> Some thoughts in response -- >>> >>> >>>>> >>>>> Signed-off-by: Peter Staubach <staubach@xxxxxxxxxx> >>>>> >>>>> >>>>> --- linux-2.6.25.i686/fs/nfs/dir.c.org >>>>> +++ linux-2.6.25.i686/fs/nfs/dir.c >>>>> @@ -1808,7 +1808,7 @@ static int nfs_access_get_cached(struct >>>>> cache = nfs_access_search_rbtree(inode, cred); >>>>> if (cache == NULL) >>>>> goto out; >>>>> - if (!time_in_range(jiffies, cache->jiffies, cache->jiffies + >>>>> nfsi->attrtimeo)) >>>>> + if (!nfs_time_in_range_open(jiffies, cache->jiffies, >>>>> cache->jiffies >>>>> + nfsi->attrtimeo)) >>>>> goto out_stale; >>>>> res->jiffies = cache->jiffies; >>>>> res->cred = cache->cred; >>>>> --- linux-2.6.25.i686/fs/nfs/inode.c.org >>>>> +++ linux-2.6.25.i686/fs/nfs/inode.c >>>>> @@ -706,14 +706,7 @@ int nfs_attribute_timeout(struct inode * >>>>> >>>>> if (nfs_have_delegation(inode, FMODE_READ)) >>>>> return 0; >>>>> - /* >>>>> - * Special case: if the attribute timeout is set to 0, then >>>>> always >>>>> - * treat the cache as having expired (unless >>>>> holding >>>>> - * a delegation). >>>>> - */ >>>>> - if (nfsi->attrtimeo == 0) >>>>> - return 1; >>>>> - return !time_in_range(jiffies, nfsi->read_cache_jiffies, >>>>> nfsi->read_cache_jiffies + nfsi->attrtimeo); >>>>> + return !nfs_time_in_range_open(jiffies, >>>>> nfsi->read_cache_jiffies, >>>>> nfsi->read_cache_jiffies + nfsi->attrtimeo); >>>>> } >>>>> >>>>> /** >>>>> @@ -1098,7 +1091,7 @@ static int nfs_update_inode(struct inode >>>>> nfsi->attrtimeo_timestamp = now; >>>>> nfsi->last_updated = now; >>>>> } else { >>>>> - if (!time_in_range(now, nfsi->attrtimeo_timestamp, >>>>> nfsi->attrtimeo_timestamp + nfsi->attrtimeo)) { >>>>> + if (!nfs_time_in_range_open(now, >>>>> nfsi->attrtimeo_timestamp, >>>>> nfsi->attrtimeo_timestamp + nfsi->attrtimeo)) { >>>>> if ((nfsi->attrtimeo <<= 1) > >>>>> NFS_MAXATTRTIMEO(inode)) >>>>> nfsi->attrtimeo = NFS_MAXATTRTIMEO(inode); >>>>> nfsi->attrtimeo_timestamp = now; >>>>> --- linux-2.6.25.i686/include/linux/nfs_fs.h.org >>>>> +++ linux-2.6.25.i686/include/linux/nfs_fs.h >>>>> @@ -121,7 +121,7 @@ struct nfs_inode { >>>>> * >>>>> * We need to revalidate the cached attrs for this inode if >>>>> * >>>>> - * jiffies - read_cache_jiffies > attrtimeo >>>>> + * jiffies - read_cache_jiffies >= attrtimeo >>>>> */ >>>>> unsigned long read_cache_jiffies; >>>>> unsigned long attrtimeo; >>>>> @@ -244,6 +244,22 @@ static inline unsigned NFS_MAXATTRTIMEO( >>>>> return S_ISDIR(inode->i_mode) ? nfss->acdirmax : nfss->acregmax; >>>>> } >>>>> >>>>> +static inline int nfs_time_in_range_open(unsigned long a, >>>>> + unsigned long b, unsigned long c) >>>>> >>>>> >>>> >>>> All of nfs_time_in_range_open's callers use a sum of 'b' and >>>> 'nfsi->attrtimeo' for 'c'. Would it be cleaner to pass in >>>> nfsi->attrtimeo' rather than 'b + nfsi->attrtimeo' and do the sum >>>> here? It might make sense to explicitly check nfsi->attrtimeo for >>>> zero here to document the special behavior of "actimeo=0". >>>> >>>> >>>> >>> >>> The behavior of "actimeo=0" isn't any more special than "actimeo=1". >>> It simply indicates that the attribute timeout is 0 jiffies long. >>> >> >> Right. I'm simply suggesting that adding explicit code is good >> documentation for this case. It calls it out so developers remember >> that to check that case when they change this code. >> >> You are correct that "noac/actimeo=0" is not the common case; however, >> it is a case that gets ignored and therefore broken easily, and that >> usually results in corruption of a customer's data. >> >> >>> >>> I thought about reducing the arguments, but it didn't seem to yield >>> anything any clearer to me. >>> >>> >>>> >>>> Alternately, checking explicitly if b and c are equal might accomplish >>>> the same without changing the synopsis. >>>> >>>> Also, all of nfs_time_in_range_open's callers negate the return value. >>>> Would the code in the callers be cleaner if that negation was moved >>>> into nfs_time_in_range_open? You might rename >>>> nfs_time_in_range_open() as nfs_cache_has_expired(), for example, to >>>> make the 'if' statements in the callers make sense in English. >>>> >>>> My feeling is that if you have to sit and stare at this for 5 minutes >>>> to understand precisely how it works, it has already become too >>>> obfuscated. In addition to fixing any bugs, I wonder if we can make >>>> it easier to understand and maintain as well. >>>> >>>> >>>> >>>>> >>>>> +{ >>>>> + /* >>>>> + * If c is less then b, then the jiffies have wrapped. >>>>> + * If so, then check to see if a is between b and the >>>>> + * max jiffies value or between 0 and the value of c. >>>>> + * This is the range between b and c. >>>>> >>>>> >>>> >>>> include/linux/jiffies.h claims it handles jiffy wrapping correctly. >>>> Why isn't time_in_range() sufficient if 'c' has wrapped? If it isn't, >>>> should you fix time_in_range() too? >>>> >>>> >>>> >>> >>> Clearly, time_in_range() is not sufficient if the 'c' has >>> wrapped. It only tests to see if a >=b and a <= c. If 'c' >>> is less than 'b', then time_in_range() will return false. >>> >>> I am reluctant to fix time_in_range() because I don't know >>> that it is broken. It appears to me that it works for other >>> uses, because otherwise, someone would have "fixed" it. >>> >> >> The only callers I found are the NFS client and the RPC client's auth >> cache, so it is probably safe to change time_in_range() without >> concern for breaking someone else's code. It's all ours, baby :-) >> >> <fleite@xxxxxxxxxx> introduced time_in_range() a year ago with commit >> c7e15961 for, it appears from his patch description, very similar >> reasons to your fix. It might be a good idea to discuss the wrapping >> bug with him. >> >> >>>> >>>> You could then simplify this to "return b != c && time_in_range(a, b, >>>> c);" or something like that. Or if you negate the return value here: >>>> >>>> static inline nfs_attributes_have_expired(unsigned long current, >>>> unsigned long >>>> start, unsigned long end) >>>> { >>>> return (start == end) || !time_in_range(current, start, end); >>>> } >>>> >>>> My 0.02USD. >>>> >>>> >>>> >>> >>> The change, which makes attrtimeo=0 work for free, is to figure out >>> that if the attrtimeo is N, then the attribute cache is valid from >>> time, T, to T + N - 1, not T + N. Thus, the current attribute >>> cache implementation is off by one because the attribute cache >>> should expire at time, T + N. The time_in_range() macro was handy >>> and looked right, but wasn't quite right for the desired semantics. >>> >>> Adding tests to check to see if b and c are equal is tuning for >>> the wrong case, I think. I believe that the majority of file >>> systems are not mounted with "noac" or "actimeo=0", so the extra >>> test would just be overhead for the common case. >>> >> >> True enough, but you can "fix" that simply by reversing the two checks: >> >> return !time_in_range(a, b, c) || unlikely(b == c); >> >> Again, I think there is some value in explicitly documenting the >> actimeo=0 case here whether or not it is covered implicitly by >> time_in_range(), precisely because it is not the common case and is >> often forgotten when changing attribute cache-related logic. This is >> exactly why we are now here fixing this problem! >> >> The comments you added here nicely explain the complexity of the time >> checks, but do not explicitly state that actimeo=0 must work after any >> changes to this code -- one of the important reasons that you have >> open-coded the time comparisons rather than reusing time_in_range(). >> >> For me this is one of those times where cleverly folding all the cases >> into a single group of logic makes the code less good because it >> increases the chances of breakage later on, for example if >> time_in_range() is changed by someone else who doesn't have local >> knowledge of NFS. >> >> > > This was really just an off by one bug. _All_ attribute cache > timeouts are one clock tick too long. > > Adding unlikely() around the test may help to reduce its cost, > but I don't think that it will make it zero cost. Ordering the > tests will also help to minimize the cost, but still won't make > the additional test zero cost. Because of the "||", the second test never gets executed if the first test evaluates to true. So effectively, reordering the tests does make the second test zero cost in the common case. The unlikely() may force the compiler to rearrange the instructions so the "false" case in the second test is less expensive, but otherwise it won't have much effect at all. > Actually, the _only_ reason that I implemented nfs_time_in_range_open > instead of just modifying time_in_range() was that I didn't want > to impact things that were orthogonal to the bug that I needed to > fix. Given that time_in_range() is only used by NFS and RPC, > perhaps we can safely modify it. If time_in_range had only been > being used by those three tests, I would have simply updated it. > > Simply correcting the math gets us the desired functionality for > zero additional cost over the broken support. In my viewpoint, > it is also the easiest to understand because there won't be any > special cases to worry about and the math will match the desired > semantics. Yep, makes sense. I would still like to request a comment that calls out the actimeo=0 zero. >>>>> + * >>>>> + * Otherwise, just check to see whether a is in [b, c). >>>>> + */ >>>>> + if (c < b) >>>>> + return time_after_eq(a, b) || time_before(a, c); >>>>> + return time_after_eq(a, b) && time_before(a, c); >>>>> +} >>>>> + >>>>> static inline int NFS_STALE(const struct inode *inode) >>>>> { >>>>> return test_bit(NFS_INO_STALE, &NFS_I(inode)->flags); >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> >> > > -- Edward R. Murrow told his generation of journalists no one can eliminate their prejudices, just recognize them. Here is my bias: extremes of wealth and poverty cannot be reconciled with a truly just society. -- Bill Moyers -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html