Re: sunrpc/cache.c: races while updating cache entries

Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> · 13 Mar 2013 17:47:34 +0100

On 13 Mar 2013 07:55:00 +0100 NeilBrown <neilb@xxxxxxx> wrote:

> On 11 Mar 2013 17:13:41 +0100 Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
> wrote:
> 
> > Hi,
> > 
> > AFAICS, there is one more race in RPC cache.
> > 
> > The problem showed up for the "auth.unix.ip" cache, that
> > has a reader.
> > 
> > If a server thread tries to find a cache entry, it first
> > does a lookup (or calls ip_map_cached_get() in this specific
> > case). Then, it calls cache_check() for this entry.
> > 
> > If the reader updates the cache entry just between the
> > thread's lookup and cache_check() execution, cache_check()
> > will do an upcall for this cache entry. This is, because
> > sunrpc_cache_update() calls cache_fresh_locked(old, 0),
> > which sets expiry_time to 0.
> > 
> > Unfortunately, the reply to the upcall will not dequeue
> > the queued cache_request, as the reply will be assigned to
> > the cache entry newly created by the above cache update.
> > 
> > The result is a growing queue of orphaned cache_request
> > structures --> memory leak.
> > 
> > I found this on a SLES11 SP1 with a backport of the latest
> > patches that fix the other RPC races. On this old kernel,
> > the problem also leads to svc_drop() being called for the
> > affected RPC request (after svc_defer()).
> > 
> > Best Regards
> > Bodo
> 
> I don't think this is a real problem.
> The periodic call to "cache_clean" should find these orphaned requests and
> purge them.  So you could get a short term memory leak, but it should
> correct itself.
> Do you agree?
> 
> Thanks,
> NeilBrown

I agree that it _should_ correct itself. I read the source again
and I now see that cache_clean should do the work fine.

OTOH, on my SLES11 SP1, there *is* a memory leak and also RPC
requests are dropped (which can't happen on mainline, but that
made me look into it).
Please see the output from crash on the life system below. That
output is from a time far after all NFS activity was stopped and
the next cycle of cache_clean() had run (ip_map_cache.nextcheck
has changed several times from 3173 to the current value of 10373).

At the end of the crash output you'll find two examples of what
cache_request.item points to. I've looked into some more of them.
They all have next == 0 and refcount.counter >= 1 (The max counter
I saw was 3).

So, the cache_head items all are already unhashed, but the
cache_requests still are queued, right?

I have no idea, how this could happen. Maybe my patches for SLES11
SP1 are not ok.

So, behind the crash output, I also added my patches. Maybe you
could have a look onto it?

There is one more small point, that isn't clear to me: when the
"old" races in auth.unix.gid caused a NFS request to be dropped,
the TCP connection after a few minutes was marked by
svc_age_temp_xprts() to be closed.
When a request is dropped because of the race in auth.unix.ip,
svc_age_temp_xprts() does not mark it to be closed. Why does it
behave so differently?

Thank you and best regards
Bodo

PS:
I also have one further patch, that fixes the problem by retesting
the cache entry's state after having set CACHE_PENDING in
cache_check() and by examining expiry_time only if it is not 0.


crash> print &ip_map_cache.queue
$43 = (struct list_head *) 0xffffffffa04a1478
crash> list 0 -H 0xffffffffa04a1478 -s cache_request
ffff8800bbf80180
struct cache_request {
  q = {
    list = {
      next = 0xffff8800ba28e940,
      prev = 0xffffffffa04a1478
    },
    reader = 0
  },
  item = 0xffff8800bb822c40,
  buf = 0xffff8800baadf000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800ba28e940
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bb5d9c80,
      prev = 0xffff8800bbf80180
    },
    reader = 0
  },
  item = 0xffff8800bbd88540,
  buf = 0xffff8800bedcb000 "nfsd 192.168.11.2\n",
  len = 18,
  readers = 0
}
ffff8800bb5d9c80
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bbd1b800,
      prev = 0xffff8800ba28e940
    },
    reader = 0
  },
  item = 0xffff880204b1b340,
  buf = 0xffff8800bb21c000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800bbd1b800
struct cache_request {
  q = {
    list = {
      next = 0xffff8800b5444d00,
      prev = 0xffff8800bb5d9c80
    },
    reader = 0
  },
  item = 0xffff8800bb991940,
  buf = 0xffff8800b28f9000 "nfsd 192.168.11.2\n",
  len = 18,
  readers = 0
}
ffff8800b5444d00
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bf21cc80,
      prev = 0xffff8800bbd1b800
    },
    reader = 0
  },
  item = 0xffff88020a3885c0,
  buf = 0xffff8800b2880000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800bf21cc80
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bed9cb80,
      prev = 0xffff8800b5444d00
    },
    reader = 0
  },
  item = 0xffff880204371ec0,
  buf = 0xffff8800b1093000 "nfsd 192.168.11.2\n",
  len = 18,
  readers = 0
}
ffff8800bed9cb80
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bac27c40,
      prev = 0xffff8800bf21cc80
    },
    reader = 0
  },
  item = 0xffff8802359741c0,
  buf = 0xffff8800bb54c000 "nfsd 192.168.11.2\n",
  len = 18,
  readers = 0
}
ffff8800bac27c40
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bef1d480,
      prev = 0xffff8800bed9cb80
    },
    reader = 0
  },
  item = 0xffff8800bbd3c140,
  buf = 0xffff8800bb8ae000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800bef1d480
struct cache_request {
  q = {
    list = {
      next = 0xffff8800baae5f80,
      prev = 0xffff8800bac27c40
    },
    reader = 0
  },
  item = 0xffff8800bbfd62c0,
  buf = 0xffff8800b9ecd000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800baae5f80
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bf0cd680,
      prev = 0xffff8800bef1d480
    },
    reader = 0
  },
  item = 0xffff8800bb8224c0,
  buf = 0xffff8800bf249000 "nfsd 192.168.11.2\n",
  len = 18,
  readers = 0
}
ffff8800bf0cd680
struct cache_request {
  q = {
    list = {
      next = 0xffff8800bb91a500,
      prev = 0xffff8800baae5f80
    },
    reader = 0
  },
  item = 0xffff8800bf1865c0,
  buf = 0xffff8800bb1c8000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800bb91a500
struct cache_request {
  q = {
    list = {
      next = 0xffff8800b12f3200,
      prev = 0xffff8800bf0cd680
    },
    reader = 0
  },
  item = 0xffff8800bec7b940,
  buf = 0xffff8800b26b8000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff8800b12f3200
struct cache_request {
  q = {
    list = {
      next = 0xffff880234c97840,
      prev = 0xffff8800bb91a500
    },
    reader = 0
  },
  item = 0xffff8800b6516440,
  buf = 0xffff8800bb933000 "nfsd 192.168.10.2\n",
  len = 18,
  readers = 0
}
ffff880234c97840
struct cache_request {
  q = {
    list = {
      next = 0xffffffffa04a1478,
      prev = 0xffff8800b12f3200
    },
    reader = 1
  },
  item = 0x0,
  buf = 0x313a333a303a38 <Address 0x313a333a303a38 out of bounds>,
  len = 0,
  readers = 0
}
crash> print *(struct cache_head *)0xffff8800b6516440
$44 = {
  next = 0x0,
  expiry_time = 0,
  last_refresh = 1140,
  ref = {
    refcount = {
      counter = 1
    }
  },
  flags = 5
}
crash> print *(struct cache_head *)0xffff8800bec7b940
$45 = {
  next = 0x0,
  expiry_time = 0,
  last_refresh = 1138,
  ref = {
    refcount = {
      counter = 2
    }
  },
  flags = 5
}
crash>




############################

--- a/net/sunrpc/cache.c	2013-03-05 12:55:28.000000000 +0100
+++ b/net/sunrpc/cache.c	2013-03-05 12:54:42.000000000 +0100
@@ -962,23 +962,33 @@ static int cache_release(struct inode *i
 
 static void cache_dequeue(struct cache_detail *detail, struct cache_head *ch)
 {
-	struct cache_queue *cq;
+	struct cache_queue *cq, *tmp;
+	struct cache_request *cr;
+	struct list_head dequeued;
+
+	INIT_LIST_HEAD(&dequeued);
 	spin_lock(&queue_lock);
-	list_for_each_entry(cq, &detail->queue, list)
+	list_for_each_entry_safe(cq, tmp, &detail->queue, list)
 		if (!cq->reader) {
-			struct cache_request *cr = container_of(cq, struct cache_request, q);
+			cr = container_of(cq, struct cache_request, q);
 			if (cr->item != ch)
 				continue;
+			if (test_bit(CACHE_PENDING, &ch->flags))
+				/* Lost a race and it is pending again */
+				break;
 			if (cr->readers != 0)
 				continue;
-			list_del(&cr->q.list);
-			spin_unlock(&queue_lock);
-			cache_put(cr->item, detail);
-			kfree(cr->buf);
-			kfree(cr);
-			return;
+			list_move(&cr->q.list, &dequeued);
 		}
 	spin_unlock(&queue_lock);
+
+	while (!list_empty(&dequeued)) {
+		cr = list_entry(dequeued.next, struct cache_request, q.list);
+		list_del(&cr->q.list);
+		cache_put(cr->item, detail);
+		kfree(cr->buf);
+		kfree(cr);
+	}
 }
 
 /*
@@ -1081,6 +1091,7 @@ int sunrpc_cache_pipe_upcall(struct cach
 	struct cache_request *crq;
 	char *bp;
 	int len;
+	int ret = 0;
 
 	if (atomic_read(&detail->readers) == 0 &&
 	    detail->last_close < get_seconds() - 30) {
@@ -1113,10 +1124,18 @@ int sunrpc_cache_pipe_upcall(struct cach
 	crq->len = PAGE_SIZE - len;
 	crq->readers = 0;
 	spin_lock(&queue_lock);
-	list_add_tail(&crq->q.list, &detail->queue);
+	if (test_bit(CACHE_PENDING, &h->flags))
+		list_add_tail(&crq->q.list, &detail->queue);
+	else
+		/* Lost a race, no longer PENDING, so don't enqueue */
+		ret = -EAGAIN;
 	spin_unlock(&queue_lock);
 	wake_up(&queue_wait);
-	return 0;
+	if (ret == -EAGAIN) {
+		kfree(buf);
+		kfree(crq);
+	}
+	return ret;
 }
 EXPORT_SYMBOL_GPL(sunrpc_cache_pipe_upcall);
 
############################

--- a/net/sunrpc/cache.c	2013-03-05 13:10:58.000000000 +0100
+++ b/net/sunrpc/cache.c	2013-03-05 12:54:42.000000000 +0100
@@ -233,19 +233,16 @@ int cache_check(struct cache_detail *det
 		if (!test_and_set_bit(CACHE_PENDING, &h->flags)) {
 			switch (cache_make_upcall(detail, h)) {
 			case -EINVAL:
-				clear_bit(CACHE_PENDING, &h->flags);
-				cache_revisit_request(h);
-				if (rv == -EAGAIN) {
+				write_lock(&detail->hash_lock);
+				if (rv) {
 					set_bit(CACHE_NEGATIVE, &h->flags);
 					cache_fresh_locked(h, monotonic_seconds()+CACHE_NEW_EXPIRY);
-					cache_fresh_unlocked(h, detail);
 					rv = -ENOENT;
 				}
-				break;
-
+				write_unlock(&detail->hash_lock);
+				/* FALLTHROUGH */
 			case -EAGAIN:
-				clear_bit(CACHE_PENDING, &h->flags);
-				cache_revisit_request(h);
+				cache_fresh_unlocked(h, detail);
 				break;
 			}
 		}
@@ -405,17 +402,12 @@ static int cache_clean(void)
 			    && ch->last_refresh >= current_detail->flush_time
 				)
 				continue;
-			if (test_and_clear_bit(CACHE_PENDING, &ch->flags))
-				cache_dequeue(current_detail, ch);
 
-			if (atomic_read(&ch->ref.refcount) == 1)
-				break;
-		}
-		if (ch) {
 			*cp = ch->next;
 			ch->next = NULL;
 			current_detail->entries--;
 			rv = 1;
+			break;
 		}
 		write_unlock(&current_detail->hash_lock);
 		d = current_detail;
@@ -423,7 +415,7 @@ static int cache_clean(void)
 			current_index ++;
 		spin_unlock(&cache_list_lock);
 		if (ch) {
-			cache_revisit_request(ch);
+			cache_fresh_unlocked(ch, d);
 			cache_put(ch, d);
 		}
 	} else
ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥