On 20 Feb 2013 04:09:00 +0100 neilb@xxxxxxx wrote: > On 19 Feb 2013 18:08:40 +0100 bstroesser@xxxxxxxxxxxxxx wrote: > > > Second attempt using the correct FROM. Sorry for the noise. > > > > > > Hi, > > > > I found a problem in sunrpc.ko on a SLES11 SP1 (2.6.32.59-0,7.1) and > > fixed it (see first patch ifor 2.6.32.60 below). > > For us the patch works fine (tested on 2.6.32.59-0.7.1). > > > > AFAICS from the code, the problem seems to persist in current kernel > > versions also. Thus, I added the second patch for 3.7.9. > > As the setup to reproduce the problem is quite complex, I couldn't > > test the second patch yet. So consider this one as a RFC. > > > > Best regards, > > Bodo > > > > Please CC me, I'm not on the list. > > > > ========================================= > > From: Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > Date: Fri, 08 Feb 2013 > > Subject: [PATCH] net: sunrpc: fix races in RPC cache > > > > We found the problem and tested the patch on a SLES11 SP1 > > 2.6.32.59-0.7.1 > > > > This patch applies to linux-2.6.32.60 (changed monotonic_seconds --> > > get_seconds()) > > > > Sporadically NFS3 RPC requests to the nfs server are dropped due to > > cache_check() (net/sunrpc/cache.c) returning -ETIMEDOUT for an entry > > of the "auth.unix.gid" cache. > > In this case, no NFS reply is sent to the client. > > > > The reason for the dropped requests are races in cache_check() when > > cache_make_upcall() returns -EINVAL (because it is called for a cache > > without readers) and cache_check() therefore refreshes the cache entry > > (rv == -EAGAIN). > > > > There are three details that need to be changed: > > 1) cache_revisit_request() must not be called before cache_fresh_locked() > > has updated the cache entry, as cache_revisit_request() wakes up > > threads waiting for the cache entry to be updated. > > This certainly seems correct. It is wrong to call cache_revisit_request() so early. > > > The explicit call to cache_revisit_request() is not needed, as > > cache_fresh_unlocked() calls it anyway. > > But cache_fresh_unlocked is only called if "rv == -EAGAIN", however we also need to call it in the case where "age > refresh_age/2" - it must always be called after clearing CACHE_PENDING. > > Also, cache_fresh_unlocked() only calls cache_revisit_request() if CACHE_PENDING is set, but we have just cleared it! Some definitely something wrong here. > (Note that I'm looking at the SLES 2.6.32 code at the moment, mainline is a bit different). > > > > (But in case of not updating the cache entry, cache_revisit_request() > > must be called. Thus, we use a fall through in the "case"). > > Hmm... I don't like case fallthroughs unless they have nice big comments: > /* FALLTHROUGH */ > or similar. :-) > > > 2) CACHE_PENDING must not be cleared before cache_fresh_locked() has > > updated the cache entry, as cache_defer_req() can return without really > > sleeping if it detects CACHE_PENDING unset. > > Agreed. So we should leave the clearing of CACHE_PENDING to cache_fresh_unlocked(). > > > > (In case of not updating the cache entry again we use the fall > > through) > > 3) Imagine a thread that calls cache_check() and gets rv = -ENOENT from > > cache_is_valid(). Then it sets CACHE_PENDINGs and calls > > cache_make_upcall(). > > We assume that meanwhile get_seconds() advances to the next > > sec. and a second thread also calls cache_check(). It gets -EAGAIN from > > cache_is_valid() for the same cache entry. As CACHE_PENDING still is > > set, it calls cache_defer_req() immediately and waits for a wakeup from > > the first thread. > > After cache_make_upcall() returned -EINVAL, the first thread does not > > update the cache entry as it had got rv = -ENOENT, but wakes up the > > second thread by calling cache_revisit_request(). > > Thread two wakes up, calls cache_is_valid() and again gets -EAGAIN. > > Thus, the result of the second cache_check() is -ETIMEDOUT and the > > NFS request is dropped. > > Yep, that's not so good.... > > > > To solve this, the cache entry now is updated not only if rv == -EAGAIN > > but if rv == -ENOENT also. This is a sufficient workaround, as the > > first thread would have to stay in cache_check() between its call to > > cache_is_valid() and clearing CACHE_PENDING for more than 60 seconds > > to break the workaround. > > Still, it isn't nice to just have a work-around. It would be best to have a fix. > The key problem here is that cache_is_valid() is time-sensitive. This have been address in mainline - cache_is_valid doesn't depend on the current time there. So, the solution would be a backport of the current mainline code ... Anyway, I think for SLES11 SP1 and 2.6.32.60 the work-around would be sufficient. BTW: it has the positive side effect, that - while a cache entry is in its second half of life - no longer each cache_check() tries to do a cache_make_upcall(). > > > > > > Signed-off-by: Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > --- > > > > --- a/net/sunrpc/cache.c 2012-08-08 21:35:09.000000000 +0200 > > +++ b/net/sunrpc/cache.c 2013-02-08 14:29:41.000000000 +0100 > > @@ -233,15 +233,14 @@ int cache_check(struct cache_detail *det > > if (!test_and_set_bit(CACHE_PENDING, &h->flags)) { > > switch (cache_make_upcall(detail, h)) { > > case -EINVAL: > > - clear_bit(CACHE_PENDING, &h->flags); > > - cache_revisit_request(h); > > - if (rv == -EAGAIN) { > > + if (rv == -EAGAIN || rv == -ENOENT) { > > set_bit(CACHE_NEGATIVE, &h->flags); > > cache_fresh_locked(h, get_seconds()+CACHE_NEW_EXPIRY); > > + clear_bit(CACHE_PENDING, &h->flags); > > cache_fresh_unlocked(h, detail); > > rv = -ENOENT; > > + break; > > } > > - break; > > > > case -EAGAIN: > > clear_bit(CACHE_PENDING, &h->flags); > > I agree with some of this.... > Maybe: > > switch(cache_make_upcall(detail, h)) { > case -EINVAL: > if (rv) { > set_bit(CACHE_NEGATIVE, &h->flags); > cache_fresh_locked(h, get_seconds() + CACHE_NEW_EXPIRY); > rv = -ENOENT; > } > /* FALLTHROUGH */ > case -EAGAIN: > cache_fresh_unlocked(h, detail); > } I agree, your patch is obviously better than the mine. But let me suggest one little change: I would like to substitute cache_fresh_unlocked() by clear_bit() and cache_revisit_request(), as the call to cache_dequeue() in cache_fresh_unlocked() seems to be obsolete here: switch(cache_make_upcall(detail, h)) { case -EINVAL: if (rv) { set_bit(CACHE_NEGATIVE, &h->flags); cache_fresh_locked(h, get_seconds() + CACHE_NEW_EXPIRY); rv = -ENOENT; } /* FALLTHROUGH */ case -EAGAIN: clear_bit(CACHE_PENDING, &h->flags); cache_revisit_request(h, detail); } > > Though it isn't good that cache_fresh_locked() is being called without holding a lock! Maybe we should import try_to_negate_entry() from mainline. Or simply add write_lock() and write_unlock() around the set_bit() and cache_fresh_locked() > > > ========================================= > > From: Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > Date: Fri, 08 Feb 2013 > > Subject: [PATCH] net: sunrpc: fix races in RPC cache > > > > This patch applies to SLES 11 SP2 linux-3.0.51-0.7.9 and also to > > vanilla linux-3.7.9 > > > > It is untested and is only based on a code review after we analyzed > > the reason for NFS requests being dropped on a > > SLES11 SP1 (linux-2.6.32.59-0.7.1) > > > > Sporadically NFS3 RPC requests to the nfs server are dropped due to > > cache_check() (net/sunrpc/cache.c) returning -ETIMEDOUT for an entry > > of the "auth.unix.gid" cache. > > In this case, no NFS reply is sent to the client. > > > > The reason for the dropped requests are races in cache_check() when > > cache_make_upcall() returns -EINVAL (because it is called for a cache > > without readers) and cache_check() therefore refreshes the cache entry > > (rv == -EAGAIN). > > > > There are two details that need to be changed: > > 1) cache_revisit_request() must not be called before cache_fresh_locked() > > has updated the cache entry, as cache_revisit_request() wakes up > > threads waiting for the cache entry to be updated. > > The explicit call to cache_revisit_request() is not needed, as > > cache_fresh_unlocked() calls it anyway. > > (But in case of not updating the cache entry, cache_revisit_request() > > must be called). > > 2) CACHE_PENDING must not be cleared before cache_fresh_locked() has > > updated the cache entry, as cache_wait_req() called by > > cache_defer_req() can return without really sleeping if it detects > > CACHE_PENDING unset. > > > > Signed-off-by: Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > --- > > > > --- a/net/sunrpc/cache.c 2013-02-08 15:56:07.000000000 +0100 > > +++ b/net/sunrpc/cache.c 2013-02-08 16:04:32.000000000 +0100 > > @@ -230,11 +230,14 @@ static int try_to_negate_entry(struct ca > > rv = cache_is_valid(detail, h); > > if (rv != -EAGAIN) { > > write_unlock(&detail->hash_lock); > > + clear_bit(CACHE_PENDING, &h->flags); > > + cache_revisit_request(h); > > This should just be cache_fresh_unlocked(), as below. > > return rv; > > } > > set_bit(CACHE_NEGATIVE, &h->flags); > > cache_fresh_locked(h, seconds_since_boot()+CACHE_NEW_EXPIRY); > > write_unlock(&detail->hash_lock); > > + clear_bit(CACHE_PENDING, &h->flags); > Clearing this bit is wrong - cache_frsh_unlocked will do that. Yes. My fault. But instead of stripping the clear_bit(), I'd better like to change the following cache_fresh_unlocked() to cache_revisit_request() > > > cache_fresh_unlocked(h, detail); > > return -ENOENT; > > } > So maybe: > > static int try_to_negate_entry(....) > { > int rv; > > write_lock(&detail->hash_lock); > rv = cache_is_valid(detail, h); > if (rv == -EAGAIN) { > set_bit(CACHE_NEGATIVE, &h->flags); > cache_fresh_locked(h, ....); > rv = -ENOENT; > } > write_unlock(&detail->hash_lock); > cache_fresh_unlocked(h, detail); For the same reason as written above, I'd better like clear_bit() and cache_revisit_request() > return rv; > } > ???? > > > > @@ -275,8 +278,6 @@ int cache_check(struct cache_detail *det > > if (!test_and_set_bit(CACHE_PENDING, &h->flags)) { > > switch (cache_make_upcall(detail, h)) { > > case -EINVAL: > > - clear_bit(CACHE_PENDING, &h->flags); > > - cache_revisit_request(h); > > rv = try_to_negate_entry(detail, h); > > break; > > case -EAGAIN: > > Yes, those lines should definitely be removed. > > So maybe this against mainline: > > diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index 9afa439..7296644 100644 > --- a/net/sunrpc/cache.c > +++ b/net/sunrpc/cache.c > @@ -228,12 +228,11 @@ static int try_to_negate_entry(struct cache_detail *detail, struct cache_head *h > > write_lock(&detail->hash_lock); > rv = cache_is_valid(detail, h); > - if (rv != -EAGAIN) { > - write_unlock(&detail->hash_lock); > - return rv; > + if (rv == -EAGAIN) { > + set_bit(CACHE_NEGATIVE, &h->flags); > + cache_fresh_locked(h, seconds_since_boot()+CACHE_NEW_EXPIRY); > + rv = -ENOENT; > } > - set_bit(CACHE_NEGATIVE, &h->flags); > - cache_fresh_locked(h, seconds_since_boot()+CACHE_NEW_EXPIRY); > write_unlock(&detail->hash_lock); > cache_fresh_unlocked(h, detail); > return -ENOENT; > @@ -275,13 +274,10 @@ int cache_check(struct cache_detail *detail, > if (!test_and_set_bit(CACHE_PENDING, &h->flags)) { > switch (cache_make_upcall(detail, h)) { > case -EINVAL: > - clear_bit(CACHE_PENDING, &h->flags); > - cache_revisit_request(h); > rv = try_to_negate_entry(detail, h); > break; > case -EAGAIN: > - clear_bit(CACHE_PENDING, &h->flags); > - cache_revisit_request(h); > + cache_fresh_unlocked(h, detail); And again: I don't like it ;-) > break; > } > } > > > Is that convincing? Of course. Do you think it would be a good idea to use the same code structure as suggested for 2.6.32.60 ? If so, this would result in the following patch: --- a/net/sunrpc/cache.c 2013-02-20 13:56:30.000000000 +0100 +++ b/net/sunrpc/cache.c 2013-02-20 14:05:08.000000000 +0100 @@ -222,23 +222,6 @@ } } -static int try_to_negate_entry(struct cache_detail *detail, struct cache_head *h) -{ - int rv; - - write_lock(&detail->hash_lock); - rv = cache_is_valid(detail, h); - if (rv != -EAGAIN) { - write_unlock(&detail->hash_lock); - return rv; - } - set_bit(CACHE_NEGATIVE, &h->flags); - cache_fresh_locked(h, seconds_since_boot()+CACHE_NEW_EXPIRY); - write_unlock(&detail->hash_lock); - cache_fresh_unlocked(h, detail); - return -ENOENT; -} - /* * This is the generic cache management routine for all * the authentication caches. @@ -275,10 +258,15 @@ if (!test_and_set_bit(CACHE_PENDING, &h->flags)) { switch (cache_make_upcall(detail, h)) { case -EINVAL: - clear_bit(CACHE_PENDING, &h->flags); - cache_revisit_request(h); - rv = try_to_negate_entry(detail, h); - break; + write_lock(&detail->hash_lock); + rv = cache_is_valid(detail, h); + if (rv == -EAGAIN) { + set_bit(CACHE_NEGATIVE, &h->flags); + cache_fresh_locked(h, seconds_since_boot()+CACHE_NEW_EXPIRY); + rv = -ENOENT; + } + write_unlock(&detail->hash_lock); + /* FALLTHROUGH */ case -EAGAIN: clear_bit(CACHE_PENDING, &h->flags); cache_revisit_request(h); > > Thanks a lot for your very thorough analysis! > > NeilBrown > Thank you for the quick response! Best regards, Bodo ÿôèº{.nÇ+?·?®??+%?Ëÿ±éݶ¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥