Re: VFS scalability git tree

Nick Piggin <npiggin@xxxxxxxxx> · Tue, 27 Jul 2010 18:06:32 +1000

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.
> 
> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run. When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

So about this XFS livelock type thingy. It looks like this, and happens
periodically while running the above fs_mark benchmark requiring reclaim
of inodes:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
15  0   6900  31032    192 471852    0    0    28 183296 8520 46672  5 91  4  0
19  0   7044  22928    192 466712   96  144  1056 115586 8622 41695  3 96  1  0
19  0   7136  59884    192 471200  160   92  6768 34564  995  542  1 99 0  0
19  0   7244  17008    192 467860    0  104  2068 32953 1044  630  1 99 0  0
18  0   7244  43436    192 467324    0    0    12     0  817  405  0 100 0  0
18  0   7244  43684    192 467324    0    0     0     0  806  425  0 100 0  0
18  0   7244  43932    192 467324    0    0     0     0  808  403  0 100 0  0
18  0   7244  44924    192 467324    0    0     0     0  808  398  0 100 0  0
18  0   7244  45456    192 467324    0    0     0     0  809  409  0 100 0  0
18  0   7244  45472    192 467324    0    0     0     0  805  412  0 100 0  0
18  0   7244  46392    192 467324    0    0     0     0  807  401  0 100 0  0
18  0   7244  47012    192 467324    0    0     0     0  810  414  0 100 0  0
18  0   7244  47260    192 467324    0    0     0     0  806  396  0 100 0  0
18  0   7244  47752    192 467324    0    0     0     0  806  403  0 100 0  0
18  0   7244  48204    192 467324    0    0     0     0  810  409  0 100 0  0
18  0   7244  48608    192 467324    0    0     0     0  807  412  0 100 0  0
18  0   7244  48876    192 467324    0    0     0     0  805  406  0 100 0  0
18  0   7244  49000    192 467324    0    0     0     0  809  402  0 100 0  0
18  0   7244  49408    192 467324    0    0     0     0  807  396  0 100 0  0
18  0   7244  49908    192 467324    0    0     0     0  809  406  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  404  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  406  0 100 0  0
19  0   7244  73436    192 467324    0    0     0  6340  808  384  0 100 0  0
20  0   7244 490220    192 467324    0    0     0  8411  830  389  0 100 0  0
18  0   7244 620092    192 467324    0    0     0     4  809  435  0 100 0  0
18  0   7244 620344    192 467324    0    0     0     0  806  430  0 100 0  0
16  0   7244 682620    192 467324    0    0    44    80  890  326  0 100 0  0
12  0   7244 604464    192 479308   76    0 11716 73555 2242 14318  2 94 4  0
12  0   7244 556700    192 483488    0    0  4276 77680 6576 92285  1 97 2  0
17  0   7244 502508    192 485456    0    0  2092 98368 6308 91919  1 96 4  0
11  0   7244 416500    192 487116    0    0  1760 114844 7414 63025  2 96  2  0

Nothing much happening except 100% system time for seconds at a time
(length of time varies). This is on a ramdisk, so it isn't waiting
for IO.

During this time, lots of things are contending on the lock:

    60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
     4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
     3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
     3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
     3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
                   |
                   --- __write_lock_failed
                      |
                      |--99.92%-- xfs_inode_ag_walk
                      |          xfs_inode_ag_iterator
                      |          xfs_reclaim_inode_shrink
                      |          shrink_slab
                      |          shrink_zone
                      |          balance_pgdat
                      |          kswapd
                      |          kthread
                      |          kernel_thread_helper
                       --0.08%-- [...]

     3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
     1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
     1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
     0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
     0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
                   |
                   --- xfs_log_commit_cil
                       _xfs_trans_commit
                      |
                      |--60.00%-- xfs_remove
                      |          xfs_vn_unlink
                      |          vfs_unlink
                      |          do_unlinkat
                      |          sys_unlink

I'm not sure if there was a long-running read locker in there causing
all the write lockers to fail, or if they are just running into one
another. But anyway, I hacked the following patch which seemed to
improve that behaviour. I haven't run any throughput numbers on it yet,
but I could if you're interested (and it's not completely broken!)

Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes
that appear to be busy to improve locking efficiency.

Index: source/fs/xfs/linux-2.6/xfs_sync.c
===================================================================

--- source.orig/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:12:11.000000000 +1000
+++ source/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:58:59.000000000 +1000
@@ -87,6 +87,91 @@ xfs_inode_ag_lookup(
 	return ip;
 }
 
+#define RECLAIM_BATCH_SIZE	32
+STATIC int
+xfs_inode_ag_walk_reclaim(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	int			(*execute)(struct xfs_inode *ip,
+					   struct xfs_perag *pag, int flags),
+	int			flags,
+	int			tag,
+	int			exclusive,
+	int			*nr_to_scan)
+{
+	uint32_t		first_index;
+	int			last_error = 0;
+	int			skipped;
+	xfs_inode_t		*batch[RECLAIM_BATCH_SIZE];
+	int			batchnr;
+	int			i;
+
+	BUG_ON(!exclusive);
+
+restart:
+	skipped = 0;
+	first_index = 0;
+next_batch:
+	batchnr = 0;
+	/* fill the batch */
+	write_lock(&pag->pag_ici_lock);
+	do {
+		xfs_inode_t	*ip;
+
+		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
+		if (!ip)
+			break;	
+		if (!(flags & SYNC_WAIT) &&
+				(!xfs_iflock_free(ip) ||
+				__xfs_iflags_test(ip, XFS_IRECLAIM)))
+			continue;
+
+		/*
+		 * The radix tree lock here protects a thread in xfs_iget from
+		 * racing with us starting reclaim on the inode.  Once we have
+		 * the XFS_IRECLAIM flag set it will not touch us.
+		 */
+		spin_lock(&ip->i_flags_lock);
+		ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
+			/* ignore as it is already under reclaim */
+			spin_unlock(&ip->i_flags_lock);
+			continue;
+		}
+		__xfs_iflags_set(ip, XFS_IRECLAIM);
+		spin_unlock(&ip->i_flags_lock);
+
+		batch[batchnr++] = ip;
+	} while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE);
+	write_unlock(&pag->pag_ici_lock);
+
+	for (i = 0; i < batchnr; i++) {
+		int		error = 0;
+		xfs_inode_t	*ip = batch[i];
+
+		/* execute doesn't require pag->pag_ici_lock */
+		error = execute(ip, pag, flags);
+		if (error == EAGAIN) {
+			skipped++;
+			continue;
+		}
+		if (error)
+			last_error = error;
+
+		/* bail out if the filesystem is corrupted.  */
+		if (error == EFSCORRUPTED)
+			break;
+	}
+	if (batchnr == RECLAIM_BATCH_SIZE)
+		goto next_batch;
+
+	if (0 && skipped) {
+		delay(1);
+		goto restart;
+	}
+	return last_error;
+}
+
 STATIC int
 xfs_inode_ag_walk(
 	struct xfs_mount	*mp,
@@ -113,6 +198,7 @@ restart:
 			write_lock(&pag->pag_ici_lock);
 		else
 			read_lock(&pag->pag_ici_lock);
+
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
@@ -198,8 +284,12 @@ xfs_inode_ag_iterator(
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
 	ag = 0;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+		if (tag == XFS_ICI_RECLAIM_TAG)
+			error = xfs_inode_ag_walk_reclaim(mp, pag, execute,
+						flags, tag, exclusive, &nr);
+		else
+			error = xfs_inode_ag_walk(mp, pag, execute,
+						flags, tag, exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -789,23 +879,6 @@ xfs_reclaim_inode(
 {
 	int	error = 0;
 
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* ignore as it is already under reclaim */
-		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
-		return 0;
-	}
-	__xfs_iflags_set(ip, XFS_IRECLAIM);
-	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
Index: source/fs/xfs/xfs_inode.h
===================================================================
--- source.orig/fs/xfs/xfs_inode.h	2010-07-26 21:10:33.000000000 +1000
+++ source/fs/xfs/xfs_inode.h	2010-07-26 21:11:59.000000000 +1000
@@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_
 	return try_wait_for_completion(&ip->i_flush);
 }
 
+static inline int xfs_iflock_free(xfs_inode_t *ip)
+{
+	return completion_done(&ip->i_flush);
+}
+
 static inline void xfs_ifunlock(xfs_inode_t *ip)
 {
 	complete(&ip->i_flush);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html