Re: [PATCH v6 00/31] kmemcg shrinkers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 14, 2013 at 11:48:05AM +1000, Dave Chinner wrote:
> On Mon, May 13, 2013 at 12:00:04PM +0400, Glauber Costa wrote:
> > On 05/13/2013 11:14 AM, Dave Chinner wrote:
> > > Now, the read-only workload is iterating through a cold-cache lookup
> > > workload of 50 million inodes - at roughly 150,000/s. It's a
> > > touch-once workload, so shoul dbe turning the cache over completely
> > > every 10 seconds. However, in the time it's taken for me to explain
> > > this:
> > > 
> > >  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> > > 1954493 1764661  90%    1.12K  69831       28   2234592K xfs_inode
> > > 1643868 281962  17%    0.22K  45663       36    365304K xfs_ili   
> > > 
> > > Only 200k xfs_ili's have been freed. So the rate of reclaim of them
> > > is roughly 5k/s. Given the read-only nature of this workload, they
> > > should be gone from the cache in a few seconds. Another indication
> > > of problems here is the level of internal fragmentation of the
> > > xfs_ili slab. They should cycle out of the cache in LRU manner, just
> > > like inodes - the modify workload is a "touch once" workload as
> > > well, so there should be no internal fragmentation of the slab
> > > cache.
> > > 
> > 
> > Initial testing I have done indicates - although it does not undoubtly
> > prove  - that the problem may be with dentries, not inodes
> 
> That tallies with the stats I'm seeing showing a significant
> difference in the balance of allocated vs "free" dentries. On a 3.9 kernel,
> the is little difference between them - dentries move quickly to the
> LRU and are considered free, while this patchset starts the same
> they quickly diverge, with the free count dropping well away from
> the allocated count.

So, there's something early on going wrong in the patch set.  This
is from a tree at this patch in the series:

803d32a inode: convert inode lru list to generic lru list code.

Which is before the dentry cache is converted to the new LRU list
code. So there's something wrong either in the underlying linux-next

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1894912 1610470  84%    0.06K  29608       64    118432K kmalloc-64
1894738 1660467  87%    1.12K  67696       28   2166272K xfs_inode
1892232 1633839  86%    0.22K  52562       36    420496K xfs_ili
1887962 1614100  85%    0.21K  51026       37    408208K dentry
 562744 562191  99%    0.55K  20098       28    321568K radix_tree_node

And:

$ cat /proc/sys/fs/dentry-state 
1702143 96055   45      0       0       0
$

Which reflects this:

struct dentry_stat_t {
        int nr_dentry;
        int nr_unused;
        int age_limit;          /* age in seconds */
        int want_pages;         /* pages requested by system */
        int dummy[2];
};

Which basicaly says we have 1.7 million allocated dentrys, but only
100k dentries on the LRU lists.  So there's something wrong either
in the underlying linux-next tree, or the initial 3 dentry cache
patches are now buggy.

<revert back to linux-next tree base>

<groan>

test-4 login: [   71.106361] XFS (vdc): Mounting Filesystem
[   71.130097] XFS (vdc): Ending clean mount
[   91.980679] fs_mark (4394) used greatest stack depth: 3048 bytes left
[   92.286173] fs_mark (4396) used greatest stack depth: 3032 bytes left
[   92.340949] fs_mark (4397) used greatest stack depth: 3024 bytes left
[  120.162200] lowmemorykiller: send sigkill to 2948 (rsyslogd), adj 0, size 209
[  122.518167] fs_mark (4434) used greatest stack depth: 2952 bytes left
[  127.213331] lowmemorykiller: send sigkill to 3421 (pmcd), adj 0, size 202
[  165.402109] lowmemorykiller: send sigkill to 3302 (cron), adj 0, size 94
[  165.435809] lowmemorykiller: send sigkill to 1 (init), adj 0, size 87
[  169.003846] fs_mark (4484) used greatest stack depth: 2720 bytes left
[  189.093392] lowmemorykiller: send sigkill to 1 (init), adj 0, size 86
[  195.153252] lowmemorykiller: send sigkill to 1 (init), adj 0, size 80
[  209.016457] lowmemorykiller: send sigkill to 1 (init), adj 0, size 86
[  219.431805] lowmemorykiller: send sigkill to 1 (init), adj 0, size 86

So, the lowmemory killer is fucked up in the linux-next tree, not by
this patchset. Before it killed pmcd, it looked like the dentry
counters were running as per 3.9.0. Reboot, try again:

[   79.304611] lowmemorykiller: send sigkill to 4593 (fs_mark), adj 0, size 2121
[  131.334226] lowmemorykiller: send sigkill to 4647 (find), adj 0, size 7658
[  131.762285] lowmemorykiller: send sigkill to 4645 (find), adj 0, size 7658
[  131.858137] lowmemorykiller: send sigkill to 4653 (find), adj 0, size 7658
[  131.982366] lowmemorykiller: send sigkill to 4655 (find), adj 0, size 7658
[  132.455610] lowmemorykiller: send sigkill to 4657 (find), adj 0, size 7658
[  132.983835] lowmemorykiller: send sigkill to 4659 (find), adj 0, size 7658
[  133.136868] lowmemorykiller: send sigkill to 4661 (find), adj 0, size 7658
[  133.762004] lowmemorykiller: send sigkill to 4665 (find), adj 0, size 7658
[  139.666345] lowmemorykiller: send sigkill to 4685 (rm), adj 0, size 8195
[  142.964679] lowmemorykiller: send sigkill to 4691 (rm), adj 0, size 8195
[  154.573456] lowmemorykiller: send sigkill to 4686 (rm), adj 0, size 8293

Right, I'm turning that crap off.

Ok, that's more like what I expect:

$ cat /proc/sys/fs/dentry-state 
937104  929728  45      0       0       0
$ cat /proc/sys/fs/dentry-state 
1124254 1116881 45      0       0       0
$ cat /proc/sys/fs/dentry-state 
1256143 1248768 45      0       0       0
$ cat /proc/sys/fs/dentry-state 
761321  753937  45      0       0       0
$ cat /proc/sys/fs/dentry-state 
177308  169925  45      0       0       0
$ cat /proc/sys/fs/dentry-state 
614756  607371  45      0       0       0
$ cat /proc/sys/fs/dentry-state 
848316  840932  45      0       0       0

unused tracks allocated very closely.

So it's patch 4 that is broken:

dcache: remove dentries from LRU before putting on dispose list

I've found the problem. dentry_kill() returns the current dentry if
it cannot lock the dentry->d_inode or the dentry->d_parent, and when
that happens try_prune_one_dentry() silently fails to prune the
dentry.  But, at this point, we've already removed the dentry from
both the LRU and the shrink list, and so it gets dropped on the
floor.

patch 4 needs some work:

	- fix the above leak shrink list leak
	- fix the scope of the sb locking inside shrink_dcache_sb()
	- remove the readditional of dentry_lru_prune().

The reworked patch below does this.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

dcache: remove dentries from LRU before putting on dispose list

From: Dave Chinner <dchinner@xxxxxxxxxx>

One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.

This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_del() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.

[ v2: don't decrement nr unused twice, spotted by Sha Zhengju ]
[ v7: (dchinner)
- shrink list leaks dentries when inode/parent can't be locked in
  dentry_kill().
- fix the scope of the sb locking inside shrink_dcache_sb()
- remove the readdition of dentry_lru_prune(). ]

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

---
 fs/dcache.c |   90 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 22 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 795c15d..edaf462 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -315,7 +315,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 }
 
 /*
- * dentry_lru_(add|del|prune|move_tail) must be called with d_lock held.
+ * dentry_lru_(add|del|move_list) must be called with d_lock held.
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
@@ -341,7 +341,8 @@ static void __dentry_lru_del(struct dentry *dentry)
  */
 static void dentry_lru_del(struct dentry *dentry)
 {
-	if (!list_empty(&dentry->d_lru)) {
+	if (!list_empty(&dentry->d_lru) &&
+	    !(dentry->d_flags & DCACHE_SHRINK_LIST)) {
 		spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 		__dentry_lru_del(dentry);
 		spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
@@ -350,13 +351,15 @@ static void dentry_lru_del(struct dentry *dentry)
 
 static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
 {
+	BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
 	spin_lock(&dentry->d_sb->s_dentry_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, list);
-		dentry->d_sb->s_nr_dentry_unused++;
-		this_cpu_inc(nr_dentry_unused);
 	} else {
 		list_move_tail(&dentry->d_lru, list);
+		dentry->d_sb->s_nr_dentry_unused--;
+		this_cpu_dec(nr_dentry_unused);
 	}
 	spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
 }
@@ -454,7 +457,8 @@ EXPORT_SYMBOL(d_drop);
  * If ref is non-zero, then decrement the refcount too.
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
-static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
+static inline struct dentry *
+dentry_kill(struct dentry *dentry, int ref, int unlock_on_failure)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
@@ -463,8 +467,10 @@ static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
-		spin_unlock(&dentry->d_lock);
-		cpu_relax();
+		if (unlock_on_failure) {
+			spin_unlock(&dentry->d_lock);
+			cpu_relax();
+		}
 		return dentry; /* try again with same dentry */
 	}
 	if (IS_ROOT(dentry))
@@ -551,7 +557,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1);
+	dentry = dentry_kill(dentry, 1, 1);
 	if (dentry)
 		goto repeat;
 }
@@ -750,12 +756,12 @@ EXPORT_SYMBOL(d_prune_aliases);
  *
  * This may fail if locks cannot be acquired no problem, just try again.
  */
-static void try_prune_one_dentry(struct dentry *dentry)
+static struct dentry * try_prune_one_dentry(struct dentry *dentry)
 	__releases(dentry->d_lock)
 {
 	struct dentry *parent;
 
-	parent = dentry_kill(dentry, 0);
+	parent = dentry_kill(dentry, 0, 0);
 	/*
 	 * If dentry_kill returns NULL, we have nothing more to do.
 	 * if it returns the same dentry, trylocks failed. In either
@@ -767,9 +773,9 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	 * fragmentation.
 	 */
 	if (!parent)
-		return;
+		return NULL;
 	if (parent == dentry)
-		return;
+		return dentry;
 
 	/* Prune ancestors. */
 	dentry = parent;
@@ -778,9 +784,9 @@ static void try_prune_one_dentry(struct dentry *dentry)
 		if (dentry->d_count > 1) {
 			dentry->d_count--;
 			spin_unlock(&dentry->d_lock);
-			return;
+			return NULL;
 		}
-		dentry = dentry_kill(dentry, 1);
+		dentry = dentry_kill(dentry, 1, 1);
 	}
 }
 
@@ -800,21 +806,31 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 
 		/*
+		 * The dispose list is isolated and dentries are not accounted
+		 * to the LRU here, so we can simply remove it from the list
+		 * here regardless of whether it is referenced or not.
+		 */
+		list_del_init(&dentry->d_lru);
+		dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+
+		/*
 		 * We found an inuse dentry which was not removed from
-		 * the LRU because of laziness during lookup.  Do not free
-		 * it - just keep it off the LRU list.
+		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_count) {
-			dentry_lru_del(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
-
 		rcu_read_unlock();
 
-		try_prune_one_dentry(dentry);
+		dentry = try_prune_one_dentry(dentry);
 
 		rcu_read_lock();
+		if (dentry) {
+			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			list_add(&dentry->d_lru, list);
+			spin_unlock(&dentry->d_lock);
+		}
 	}
 	rcu_read_unlock();
 }
@@ -855,8 +871,10 @@ relock:
 			list_move(&dentry->d_lru, &referenced);
 			spin_unlock(&dentry->d_lock);
 		} else {
-			list_move_tail(&dentry->d_lru, &tmp);
+			list_move(&dentry->d_lru, &tmp);
 			dentry->d_flags |= DCACHE_SHRINK_LIST;
+			this_cpu_dec(nr_dentry_unused);
+			sb->s_nr_dentry_unused--;
 			spin_unlock(&dentry->d_lock);
 			if (!--count)
 				break;
@@ -870,6 +888,27 @@ relock:
 	shrink_dentry_list(&tmp);
 }
 
+/*
+ * Mark all the dentries as on being the dispose list so we don't think they are
+ * still on the LRU if we try to kill them from ascending the parent chain in
+ * try_prune_one_dentry() rather than directly from the dispose list.
+ */
+static void
+shrink_dcache_list(
+	struct list_head *dispose)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(dentry, dispose, d_lru) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_SHRINK_LIST;
+		spin_unlock(&dentry->d_lock);
+	}
+	rcu_read_unlock();
+	shrink_dentry_list(dispose);
+}
+
 /**
  * shrink_dcache_sb - shrink dcache for a superblock
  * @sb: superblock
@@ -883,9 +922,16 @@ void shrink_dcache_sb(struct super_block *sb)
 
 	spin_lock(&sb->s_dentry_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
-		list_splice_init(&sb->s_dentry_lru, &tmp);
+		/*
+		 * account for removal here so we don't need to handle it later
+		 * even though the dentry is no longer on the lru list.
+		 */
 		spin_unlock(&sb->s_dentry_lru_lock);
-		shrink_dentry_list(&tmp);
+		list_splice_init(&sb->s_dentry_lru, &tmp);
+		this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
+		sb->s_nr_dentry_unused = 0;
+
+		shrink_dcache_list(&tmp);
 		spin_lock(&sb->s_dentry_lru_lock);
 	}
 	spin_unlock(&sb->s_dentry_lru_lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux