On 20 Mar 2013 19:45:48 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> wrote: > On 19 Mar 2013 04:24:00 +0100 NeilBrown <neilb@xxxxxxx> wrote: > > > On 14 Mar 2013 18:31:38 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > wrote: > > > > > On 13 Mar 2013 07:55:00 +0100 NeilBrown <neilb@xxxxxxx> wrote: > > > > > > > On 11 Mar 2013 17:13:41 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > AFAICS, there is one more race in RPC cache. > > > > > > > > > > The problem showed up for the "auth.unix.ip" cache, that > > > > > has a reader. > > > > > > > > > > If a server thread tries to find a cache entry, it first > > > > > does a lookup (or calls ip_map_cached_get() in this specific > > > > > case). Then, it calls cache_check() for this entry. > > > > > > > > > > If the reader updates the cache entry just between the > > > > > thread's lookup and cache_check() execution, cache_check() > > > > > will do an upcall for this cache entry. This is, because > > > > > sunrpc_cache_update() calls cache_fresh_locked(old, 0), > > > > > which sets expiry_time to 0. > > > > > > > > > > Unfortunately, the reply to the upcall will not dequeue > > > > > the queued cache_request, as the reply will be assigned to > > > > > the cache entry newly created by the above cache update. > > > > > > > > > > The result is a growing queue of orphaned cache_request > > > > > structures --> memory leak. > > > > > > > > > > I found this on a SLES11 SP1 with a backport of the latest > > > > > patches that fix the other RPC races. On this old kernel, > > > > > the problem also leads to svc_drop() being called for the > > > > > affected RPC request (after svc_defer()). > > > > > > > > > > Best Regards > > > > > Bodo > > > > > > > > I don't think this is a real problem. > > > > The periodic call to "cache_clean" should find these orphaned requests and > > > > purge them. So you could get a short term memory leak, but it should > > > > correct itself. > > > > Do you agree? > > > > > > > > Thanks, > > > > NeilBrown > > > > > > Meanwhile I found the missing part of the race! > > > > > > It's just what I wrote above but additionally, immediately after > > > the reader updated the cache, cache_clean() unhashes the old cache > > > entry. In that case: > > > 1) the thread will queue a request for a cache entry, that isn't > > > in the hash lists. > > > 2) cache_clean() never will access this old cache entry again > > > 3) every further cache update will call cache_dequeue() for a newer > > > cache entry, the request is never dequeued > > > > > > --> memory leak. > > > > Yes, I see that now. Thanks. > > > > I suspect that bug was introduced by commit 3af4974eb2c7867d6e16. > > Prior to then, cache_clean would never remove anything with an active > > reference. If something was about to get CACHE_PENDING, it would have a > > reference so cache_clean would leave it alone. > > > > I wanted to fix this by getting the last 'put' to call cache_dequeue() if > > CACHE_PENDING was still set. However I couldn't get access to the > > CACHE_PENDING flag and the cache_detail at the same place - so I gave up on > > that. > > > > The patch I have included below sets a flag when an cache item is removed > > from the cache (by cache_clean) and refuses to lodge an upcall if the flag is > > set. That will ensure there is never a pending upcall when the item is > > finally freed. > > > > You patch only addresses the particular situation that you had found. I > > think it is possible that there might be other situations that also lead to > > memory leaks, so I wanted a solution that would guarantee that there couldn't > > be a leak. > > > > > > > > I verified this inserting some debug instructions. In a testrun > > > that triggered 6 times, and the queue printed by crash after the > > > run had 6 orphaned entries. > > > > > > As I wrote yesterday, I have a patch that solves the problem by > > > retesting the state of the cache entry after setting CACHE_PENDING > > > in cache_check(). I can send it if you like. > > > > > > Bodo > > > > > > P.S.: > > > Maybe I'm wrong, but AFAICS, there are two further minor problems > > > regarding (nearly) expired cache entries: > > > a) ip_map_cached_get() in some situations can return an already > > > outdated (it uses cache_valid that not fully checks the state) > > > cache entry. If cache_check() is caclled for that entry, it will > > > fail, I think > > > b) Generally, if a cache entry is returned by sunrpc_cache_lookup() > > > and the entry is in the last second of it's expiry_time, the > > > clock might move to the next second before cache_check is called. > > > If this happens, cache_check will fail, I think. > > > Do you agree? > > > > Yes, but I'm not sure how important this is. > > Normally cache entries should be refreshed well before they expire, so we > > should normally find an entry with more than half of its lifetime left. > > > > In the rare case where the scenario you describe happens, cache_check() will > > return -ETIMEDOUT. > > In mainline, that will cause the request to be dropped and the connection > > closed so the client will try again and won't hit any race and so should get > > a correct result. > > In SLES11SP1, we retry the lookup and will hopefully get a better result > > without having to close the connection. > > > > So this should be rare and should fail-safe. > > > > Does that agree with your understanding? > > > > > > Thanks, > > NeilBrown > > > > > > From e76e6583405a3b5ff9c8d2ae1184704efb209ef0 Mon Sep 17 00:00:00 2001 > > From: NeilBrown <neilb@xxxxxxx> > > Date: Tue, 19 Mar 2013 13:58:58 +1100 > > Subject: [PATCH] sunrpc/cache: ensure items removed from cache do not have > > pending upcalls. > > > > It is possible for a race to set CACHE_PENDING after cache_clean() > > has removed a cache entry from the cache. > > If CACHE_PENDING is still set when the entry is finally 'put', > > the cache_dequeue() will never happen and we can leak memory. > > > > So set a new flag 'CACHE_CLEANED' when we remove something from > > the cache, and don't queue and upcall if it is set. > > > > If CACHE_PENDING is set before CACHE_CLEANED, the call that > > cache_clean() makes to cache_fresh_unlocked() will free memory > > as needed. If CACHE_PENDING is set after CACHE_CLEANED, the > > test in sunrpc_cache_pipe_upcall will ensure that the memory > > is not allocated. > > > > Reported-by: <bstroesser@xxxxxxxxxxxxxx> > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > > > diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h > > index 303399b..8419f7d 100644 > > --- a/include/linux/sunrpc/cache.h > > +++ b/include/linux/sunrpc/cache.h > > @@ -57,6 +57,7 @@ struct cache_head { > > #define CACHE_VALID 0 /* Entry contains valid data */ > > #define CACHE_NEGATIVE 1 /* Negative entry - there is no match for the key */ > > #define CACHE_PENDING 2 /* An upcall has been sent but no reply received yet*/ > > +#define CACHE_CLEANED 3 /* Entry has been cleaned from cache */ > > > > #define CACHE_NEW_EXPIRY 120 /* keep new things pending confirmation for 120 seconds */ > > > > diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c > > index 1677cfe..8e7ccbb 100644 > > --- a/net/sunrpc/cache.c > > +++ b/net/sunrpc/cache.c > > @@ -306,7 +306,7 @@ EXPORT_SYMBOL_GPL(cache_check); > > * a current pointer into that list and into the table > > * for that entry. > > * > > - * Each time clean_cache is called it finds the next non-empty entry > > + * Each time cache_clean is called it finds the next non-empty entry > > * in the current table and walks the list in that entry > > * looking for entries that can be removed. > > * > > @@ -453,6 +453,7 @@ static int cache_clean(void) > > current_index ++; > > spin_unlock(&cache_list_lock); > > if (ch) { > > + set_bit(CACHE_CLEANED, &ch->flags); > > cache_fresh_unlocked(ch, d); > > cache_put(ch, d); > > } > > @@ -1176,6 +1177,9 @@ int sunrpc_cache_pipe_upcall(struct cache_detail *detail, struct cache_head *h) > > warn_no_listener(detail); > > return -EINVAL; > > } > > + if (test_bit(CACHE_CLEANED, &h->flags)) > > + /* Too late to make an upcall */ > > + return -EAGAIN; > > > > buf = kmalloc(PAGE_SIZE, GFP_KERNEL); > > if (!buf) > > > > I added this patch to my SLES11 SP1 and still had dropped RPC requests, > but no leaked memory. > > The drops occur, because cache_check() allows to make upcalls for cache > entries, that are substituted with new ones by sunrpc_cache_update(). > If cache_clean() comes too late to wake up the thread waiting for an > update of the old cache entry, a timeout will occur and the RPC request > is dropped. > > This will not happen on mainline, but why should we allow upcalls to be > queued for replaced cache entries? This will create unnecessary traffic > on the channel to the reader and as a result unnecessary calls to > sunrpc_cache_update() > > By additionally to the above patch inserting > "setbit(CACHE_CLEANED, &old->flags);" > into sunrpc_cache_update() between the two lines > "cache_fresh_unlocked(tmp, detail);" > "cache_fresh_unlocked(old, detail);" > this unclean behavior can be avoided. > > If we do so, maybe CACHE_CLEAN should better be named CACHE_OLD or > something else that reflects both uses of the bit. > > Do you agree? This applies only to SLES11-SP1 (2.6.32+) right? In mainline the request won't be dropped because the cache item isn't assumed to still be valid. So we need to make sure that sunrpc_cache_pipe_upcall doesn't make an upcall on a cache item that has been replaced. I'd rather not use the CACHE_CLEAN bit (whether renamed or not) as it has a well defined meaning "has been removed from cache" and I'd rather not blur that meaning. We already have a state that means "this has been replace"- ->expiry_time is 0. So how about adding if (h->expiry_time == 0) return -EAGAIN; to sunrpc_cache_pipe_upcall() in SLES11-SP1. Does that work for you? Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature