On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). > > So about this XFS livelock type thingy. It looks like this, and happens > periodically while running the above fs_mark benchmark requiring reclaim > of inodes: .... > Nothing much happening except 100% system time for seconds at a time > (length of time varies). This is on a ramdisk, so it isn't waiting > for IO. > > During this time, lots of things are contending on the lock: > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > | > --- __write_lock_failed > | > |--99.92%-- xfs_inode_ag_walk > | xfs_inode_ag_iterator > | xfs_reclaim_inode_shrink > | shrink_slab > | shrink_zone > | balance_pgdat > | kswapd > | kthread > | kernel_thread_helper > --0.08%-- [...] > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > | > --- xfs_log_commit_cil > _xfs_trans_commit > | > |--60.00%-- xfs_remove > | xfs_vn_unlink > | vfs_unlink > | do_unlinkat > | sys_unlink > > I'm not sure if there was a long-running read locker in there causing > all the write lockers to fail, or if they are just running into one > another. The longest hold is in the inode cluster writeback (xfs_iflush_cluster), but if there is no IO then I don't see how that would be a problem. I suspect that it might be caused by having several CPUs all trying to run the shrinker at the same time and them all starting at the same AG and therefore lockstepping and getting nothing done because they are all scanning the same inodes. Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to avoid this lockstepping. I've attached a patch below to do this - can you give it a try? > But anyway, I hacked the following patch which seemed to > improve that behaviour. I haven't run any throughput numbers on it yet, > but I could if you're interested (and it's not completely broken!) Batching is certainly something that I have been considering, but apart from the excessive scanning bug, the per-ag inode tree lookups hve not featured prominently in any profiling I've done, so it hasn't been a high priority. You patch looks like it will work fine, but I think it can be made a lot cleaner. I'll have a closer look at this once I get to the bottom of the dbench hang you are seeing.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx xfs: add an AG iterator start rotor From: Dave Chinner <dchinner@xxxxxxxxxx> To avoid multiple CPUs from executing inode cache shrinkers on the same AG all at the same time, make every shrinker call start on a different AG. This will mostly prevent concurrent shrinker calls from competing for the serialising pag_ici_lock and lock-stepping reclaim. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/xfs/linux-2.6/xfs_sync.c | 11 ++++++++++- fs/xfs/xfs_mount.h | 1 + 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index dfcbd98..5322105 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -181,11 +181,14 @@ xfs_inode_ag_iterator( struct xfs_perag *pag; int error = 0; int last_error = 0; + xfs_agnumber_t start_ag; xfs_agnumber_t ag; int nr; + int looped = 0; nr = nr_to_scan ? *nr_to_scan : INT_MAX; - ag = 0; + start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount; + ag = start_ag; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, exclusive, &nr); @@ -197,6 +200,12 @@ xfs_inode_ag_iterator( } if (nr <= 0) break; + if (ag >= mp->m_sb.sb_agcount) { + looped = 1; + ag = 0; + } + if (ag >= start_ag && looped) + break; } if (nr_to_scan) *nr_to_scan = nr; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 622da21..fae61bb 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -199,6 +199,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + atomic_t m_agiter_rotor; /* ag iterator start rotor */ } xfs_mount_t; /* -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html