Re: [PATCH] sunrpc.ko: RPC cache fix races

Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> · 25 Feb 2013 20:55:50 +0100

On 24 Feb 2013 23:53:00 +0100 neilb@xxxxxxx wrote:

> On 21 Feb 2013 16:44:41 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
> wrote:
> 
> > On 21 Feb 2013 00:55:00 +0100 neilb@xxxxxxx wrote:
> > 
> > > On 20 Feb 2013 14:57:07 +0100 bstroesser@xxxxxxxxxxxxxx wrote:
> > > 
> > > > On 20 Feb 2013 04:09:00 +0100 neilb@xxxxxxx wrote:
> > 
> > snip
> > 
> > > > > Maybe:
> > > > > 
> > > > >   switch(cache_make_upcall(detail, h)) {
> > > > >   case -EINVAL:
> > > > >         if (rv) {
> > > > > 		set_bit(CACHE_NEGATIVE, &h->flags);
> > > > > 		cache_fresh_locked(h, get_seconds() + CACHE_NEW_EXPIRY);
> > > > > 		rv = -ENOENT;
> > > > > 	}
> > > > > 	/* FALLTHROUGH */
> > > > >   case -EAGAIN:
> > > > > 	cache_fresh_unlocked(h, detail);
> > > > >   }
> > > > 
> > > > I agree, your patch is obviously better than the mine.
> > > > But let me suggest one little change: I would like to substitute
> > > > cache_fresh_unlocked() by clear_bit() and cache_revisit_request(),
> > > > as the call to cache_dequeue() in cache_fresh_unlocked() seems to
> > > > be obsolete here:
> > > 
> > > It is exactly this sort of thinking (on my part) that got us into this mess
> > > in the first place.  I reasoned that the full locking/testing/whatever wasn't
> > > necessary and took a short cut.  It wasn't a good idea.
> > 
> > Maybe I'm totally wrong, but AFAICS, calling cache_dequeue() here in extreme
> > situations (two threads in parallel calling check_cache() while a first reader
> > is going to open cache access) could again cause a race (?)
> 
> Can you explain the race you see?

O.k., let me try ...

Suppposed there are 2 threads calling cache_check concurrently and a
user process that is going to become a reader for the involved cache.

The first thread calls cache_is_valid() and gets -EAGAIN.
Next, it sets CACHE_PENDING and calls cache_make_upcall(), which
returns -EINVAL, as there is no reader yet.

Meanwhile the second thread also calls cache_is_valid(), also gets
-EAGAIN, but now is interrupted and delayed for a while.

The first thread continues its work, negates the entry and calls
cache_fresh_locked() followed by cache_fresh_unlocked().
In cache_fresh_unlocked it resets CACHE_PENDING and calls
cache_revisit_request(). While this, it is interrupted and delayed.

The user process is scheduled and opens the channel to become a reader.

The second thread wakes up again and sets CACHE_PENDING. Next it calls
cache_make_upcall(), which results in a request being queued, as there
is a reader now.

The first thread wakes up and continues its work calling cache_dequeue().
Maybe the request is dropped before the reader could process it.

Do you agree or did I miss something?

> 
> > 
> > BTW: if there is a reader for a cache, is there any protection against many
> > upcalls being queued for the same cache entry?
> 
> The CACHE_PENDING flag should provide that protection.  We only queue an
> upcall after "test_and_set", and always dequeue after "clear_bit".

Sorry, my question wasn't very clear. CACHE_PENDING *does* provide the
protection against parallel upcalls.

OTOH, as cache_is_valid() is called before test_and_set_bit(CACHE_PENDING)
and then the upcall is made without again calling cache_is_valid(),
I see a small chance of a second request being queued unnecessarily just
after the first request was answered. 

Bodo

> 
> NeilBrown
> 
> 
> > 
> > Bodo
> > 
> > > 
> > > Given that this is obviously difficult code to get right, we should make it
> > > as easy to review as possible.  Have "cache_fresh_unlocked" makes it more
> > > obviously correct, and that is a good thing.
> > > Maybe cache_dequeue isn't needed here, but it won't hurt so I'd much rather
> > > have the clearer code.
> > > In fact, I'd also like to change
> > > 
> > > 			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
> > > 				cache_dequeue(current_detail, ch);
> > > 			cache_revisit_request(ch);
> > > 
> > > near the end of cache_clean to call  cache_fresh_unlocked().
> > > 
ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥