Re: sunrpc/cache.c: races while updating cache entries

Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> · 21 Mar 2013 17:41:35 +0100

On 21 Mar 2013 00:34:00 +0100 NeilBrown <neilb@xxxxxxx> wrote:

> On 20 Mar 2013 19:45:48 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
> wrote:
> 
> > On 19 Mar 2013 04:24:00 +0100 NeilBrown <neilb@xxxxxxx> wrote:
> > 
> > > On 14 Mar 2013 18:31:38 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
> > > wrote:
> > > 
> > > > On 13 Mar 2013 07:55:00 +0100 NeilBrown <neilb@xxxxxxx> wrote:
> > > > 
> > > > > On 11 Mar 2013 17:13:41 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
> > > > > wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > AFAICS, there is one more race in RPC cache.
> > > > > > 
> > > > > > The problem showed up for the "auth.unix.ip" cache, that
> > > > > > has a reader.
> > > > > > 
> > > > > > If a server thread tries to find a cache entry, it first
> > > > > > does a lookup (or calls ip_map_cached_get() in this specific
> > > > > > case). Then, it calls cache_check() for this entry.
> > > > > > 
> > > > > > If the reader updates the cache entry just between the
> > > > > > thread's lookup and cache_check() execution, cache_check()
> > > > > > will do an upcall for this cache entry. This is, because
> > > > > > sunrpc_cache_update() calls cache_fresh_locked(old, 0),
> > > > > > which sets expiry_time to 0.
> > > > > > 
> > > > > > Unfortunately, the reply to the upcall will not dequeue
> > > > > > the queued cache_request, as the reply will be assigned to
> > > > > > the cache entry newly created by the above cache update.
> > > > > > 
> > > > > > The result is a growing queue of orphaned cache_request
> > > > > > structures --> memory leak.
> > > > > > 
> > > > > > I found this on a SLES11 SP1 with a backport of the latest
> > > > > > patches that fix the other RPC races. On this old kernel,
> > > > > > the problem also leads to svc_drop() being called for the
> > > > > > affected RPC request (after svc_defer()).
> > > > > > 
> > > > > > Best Regards
> > > > > > Bodo
> > > > > 
> > > > > I don't think this is a real problem.
> > > > > The periodic call to "cache_clean" should find these orphaned requests and
> > > > > purge them.  So you could get a short term memory leak, but it should
> > > > > correct itself.
> > > > > Do you agree?
> > > > > 
> > > > > Thanks,
> > > > > NeilBrown
> > > > 
> > > > Meanwhile I found the missing part of the race!
> > > > 
> > > > It's just what I wrote above but additionally, immediately after
> > > > the reader updated the cache, cache_clean() unhashes the old cache
> > > > entry. In that case:
> > > > 1) the thread will queue a request for a cache entry, that isn't
> > > >    in the hash lists.
> > > > 2) cache_clean() never will access this old cache entry again
> > > > 3) every further cache update will call cache_dequeue() for a newer
> > > >    cache entry, the request is never dequeued
> > > > 
> > > > --> memory leak.
> > > 
> > > Yes, I see that now.  Thanks.
> > > 
> > > I suspect that bug was introduced by commit 3af4974eb2c7867d6e16.
> > > Prior to then, cache_clean would never remove anything with an active
> > > reference.  If something was about to get CACHE_PENDING, it would have a
> > > reference so cache_clean would leave it alone.
> > > 
> > > I wanted to fix this by getting the last 'put' to call cache_dequeue() if
> > > CACHE_PENDING was still set.  However I couldn't get access to the
> > > CACHE_PENDING flag and the cache_detail at the same place - so I gave up on
> > > that.
> > > 
> > > The patch I have included below sets a flag when an cache item is removed
> > > from the cache (by cache_clean) and refuses to lodge an upcall if the flag is
> > > set.  That will ensure there is never a pending upcall when the item is
> > > finally freed.
> > > 
> > > You patch only addresses the particular situation that you had found.  I
> > > think it is possible that there might be other situations that also lead to
> > > memory leaks, so I wanted a solution that would guarantee that there couldn't
> > > be a leak.
> > > 
> > > > 
> > > > I verified this inserting some debug instructions. In a testrun
> > > > that triggered 6 times, and the queue printed by crash after the
> > > > run had 6 orphaned entries.
> > > > 
> > > > As I wrote yesterday, I have a patch that solves the problem by
> > > > retesting the state of the cache entry after setting CACHE_PENDING
> > > > in cache_check(). I can send it if you like.
> > > > 
> > > > Bodo
> > > > 
> > > > P.S.:
> > > > Maybe I'm wrong, but AFAICS, there are two further minor problems
> > > > regarding (nearly) expired cache entries:
> > > > a) ip_map_cached_get() in some situations can return an already
> > > >    outdated (it uses cache_valid that not fully checks the state)
> > > >    cache entry. If cache_check() is caclled for that entry, it will
> > > >    fail, I think
> > > > b) Generally, if a cache entry is returned by sunrpc_cache_lookup()
> > > >    and the entry is in the last second of it's expiry_time, the 
> > > >    clock might move to the next second before cache_check is called.
> > > >    If this happens, cache_check will fail, I think.
> > > > Do you agree?
> > > 
> > > Yes, but I'm not sure how important this is.
> > > Normally cache entries should be refreshed well before they expire, so we
> > > should normally find an entry with more than half of its lifetime left.
> > > 
> > > In the rare case where the scenario you describe happens, cache_check() will
> > > return -ETIMEDOUT.
> > > In mainline, that will cause the request to be dropped and the connection
> > > closed so the client will try again and won't hit any race and so should get
> > > a correct result.
> > > In SLES11SP1, we retry the lookup and will hopefully get a better result
> > > without having to close the connection.
> > > 
> > > So this should be rare and should fail-safe.
> > > 
> > > Does that agree with your understanding?
> > > 
> > > 
> > > Thanks,
> > > NeilBrown
> > > 
> > > 
> > > From e76e6583405a3b5ff9c8d2ae1184704efb209ef0 Mon Sep 17 00:00:00 2001
> > > From: NeilBrown <neilb@xxxxxxx>
> > > Date: Tue, 19 Mar 2013 13:58:58 +1100
> > > Subject: [PATCH] sunrpc/cache: ensure items removed from cache do not have
> > >  pending upcalls.
> > > 
> > > It is possible for a race to set CACHE_PENDING after cache_clean()
> > > has removed a cache entry from the cache.
> > > If CACHE_PENDING is still set when the entry is finally 'put',
> > > the cache_dequeue() will never happen and we can leak memory.
> > > 
> > > So set a new flag 'CACHE_CLEANED' when we remove something from
> > > the cache, and don't queue and upcall if it is set.
> > > 
> > > If CACHE_PENDING is set before CACHE_CLEANED, the call that
> > > cache_clean() makes to cache_fresh_unlocked() will free memory
> > > as needed.  If CACHE_PENDING is set after CACHE_CLEANED, the
> > > test in sunrpc_cache_pipe_upcall will ensure that the memory
> > > is not allocated.
> > > 
> > > Reported-by: <bstroesser@xxxxxxxxxxxxxx>
> > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > 
> > > diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
> > > index 303399b..8419f7d 100644
> > > --- a/include/linux/sunrpc/cache.h
> > > +++ b/include/linux/sunrpc/cache.h
> > > @@ -57,6 +57,7 @@ struct cache_head {
> > >  #define	CACHE_VALID	0	/* Entry contains valid data */
> > >  #define	CACHE_NEGATIVE	1	/* Negative entry - there is no match for the key */
> > >  #define	CACHE_PENDING	2	/* An upcall has been sent but no reply received yet*/
> > > +#define	CACHE_CLEANED	3	/* Entry has been cleaned from cache */
> > >  
> > >  #define	CACHE_NEW_EXPIRY 120	/* keep new things pending confirmation for 120 seconds */
> > >  
> > > diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> > > index 1677cfe..8e7ccbb 100644
> > > --- a/net/sunrpc/cache.c
> > > +++ b/net/sunrpc/cache.c
> > > @@ -306,7 +306,7 @@ EXPORT_SYMBOL_GPL(cache_check);
> > >   * a current pointer into that list and into the table
> > >   * for that entry.
> > >   *
> > > - * Each time clean_cache is called it finds the next non-empty entry
> > > + * Each time cache_clean is called it finds the next non-empty entry
> > >   * in the current table and walks the list in that entry
> > >   * looking for entries that can be removed.
> > >   *
> > > @@ -453,6 +453,7 @@ static int cache_clean(void)
> > >  			current_index ++;
> > >  		spin_unlock(&cache_list_lock);
> > >  		if (ch) {
> > > +			set_bit(CACHE_CLEANED, &ch->flags);
> > >  			cache_fresh_unlocked(ch, d);
> > >  			cache_put(ch, d);
> > >  		}
> > > @@ -1176,6 +1177,9 @@ int sunrpc_cache_pipe_upcall(struct cache_detail *detail, struct cache_head *h)
> > >  		warn_no_listener(detail);
> > >  		return -EINVAL;
> > >  	}
> > > +	if (test_bit(CACHE_CLEANED, &h->flags))
> > > +		/* Too late to make an upcall */
> > > +		return -EAGAIN;
> > >  
> > >  	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > >  	if (!buf)
> > > 
> > 
> > I added this patch to my SLES11 SP1 and still had dropped RPC requests,
> > but no leaked memory.
> > 
> > The drops occur, because cache_check() allows to make upcalls for cache
> > entries, that are substituted with new ones by sunrpc_cache_update().
> > If cache_clean() comes too late to wake up the thread waiting for an
> > update of the old cache entry, a timeout will occur and the RPC request
> > is dropped.
> > 
> > This will not happen on mainline, but why should we allow upcalls to be
> > queued for replaced cache entries? This will create unnecessary traffic
> > on the channel to the reader and as a result unnecessary calls to
> > sunrpc_cache_update()
> > 
> > By additionally to the above patch inserting
> >     "setbit(CACHE_CLEANED, &old->flags);"
> > into sunrpc_cache_update() between the two lines
> >     "cache_fresh_unlocked(tmp, detail);"
> >     "cache_fresh_unlocked(old, detail);"
> > this unclean behavior can be avoided.
> > 
> > If we do so, maybe CACHE_CLEAN should better be named CACHE_OLD or
> > something else that reflects both uses of the bit.
> > 
> > Do you agree?
> 
> This applies only to SLES11-SP1 (2.6.32+) right?
> In mainline the request won't be dropped because the cache item isn't assumed
> to still be valid.

I agree, there will be no drop in mainline.

But in mainline there will be a useless upcall. That upcall might even be
processed by the reader, if cache_clean isn't faster then the reader.
The answer from the reader will again replace the current cache entry
(which is just a few microseconds old) by a new one. The new one must be
allocated, the replaced one must later be cleaned. The single item cache
is invalidated again.

In worst case, the unnecessary replacement also could trigger the next
round of the game ...

So in my opinion it would be better to add the patch you suggested below
to mainline also.

> 
> So we need to make sure that sunrpc_cache_pipe_upcall doesn't make an upcall
> on a cache item that has been replaced.  I'd rather not use the CACHE_CLEAN
> bit (whether renamed or not) as it has a well defined meaning "has been
> removed from cache" and I'd rather not blur that meaning.
> We already have a state that means "this has been replace"- ->expiry_time is
> 0.
> So how about adding
>    if (h->expiry_time == 0)
>          return -EAGAIN;
> to sunrpc_cache_pipe_upcall() in SLES11-SP1.
> 
> Does that work for you?

Yes, that looks good. My test with this fix is running successfully
since 5 hours. I'll let it run until Monday.

Thank you for your help,
Bodo

> 
> Thanks,
> NeilBrown
ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥