Re: [PATCH v6 1/2] sb: add a new writeback list for sync

Jan Kara <jack@xxxxxxx> · Thu, 21 Jan 2016 17:34:11 +0100

On Thu 21-01-16 10:22:57, Brian Foster wrote:
> On Thu, Jan 21, 2016 at 07:11:59AM +1100, Dave Chinner wrote:
> > On Wed, Jan 20, 2016 at 02:26:26PM +0100, Jan Kara wrote:
> > > On Tue 19-01-16 12:59:12, Brian Foster wrote:
> > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > 
> > > > wait_sb_inodes() currently does a walk of all inodes in the
> > > > filesystem to find dirty one to wait on during sync. This is highly
> > > > inefficient and wastes a lot of CPU when there are lots of clean
> > > > cached inodes that we don't need to wait on.
> > > > 
> > > > To avoid this "all inode" walk, we need to track inodes that are
> > > > currently under writeback that we need to wait for. We do this by
> > > > adding inodes to a writeback list on the sb when the mapping is
> > > > first tagged as having pages under writeback. wait_sb_inodes() can
> > > > then walk this list of "inodes under IO" and wait specifically just
> > > > for the inodes that the current sync(2) needs to wait for.
> > > > 
> > > > Define a couple helpers to add/remove an inode from the writeback
> > > > list and call them when the overall mapping is tagged for or cleared
> > > > from writeback. Update wait_sb_inodes() to walk only the inodes
> > > > under writeback due to the sync.
> > > 
> 
> Hi Jan, Dave,
> 
> > > The patch looks good.  Just one comment: This grows struct inode by two
> > > longs. Such a growth should be justified by measuring the improvements. So
> > > can you measure some numbers showing how much the patch helped? I think it
> > > would be interesting to see:
> > > 
> 
> Thanks.. indeed, I had run some simple tests that demonstrate the
> effectiveness of the change. I reran them recently against the latest
> version. Some results are appended to this mail.
> 
> Note that I don't have anything at the moment that demonstrates a
> notable improvement with rcu over the original spin lock approach. I can
> play with that a bit more, but that's not really the crux of the patch.
> 
> > > a) How much sync(2) speed has improved if there's not much to wait for.
> > 
> > Depends on the size of the inode cache when sync is run.  If it's
> > empty it's not noticable. When you have tens of millions of cached,
> > clean inodes the inode list traversal can takes tens of seconds.
> > This is the sort of problem Josef reported that FB were having...
> > 
> 
> FWIW, Ceph has indicated this is a pain point for them as well. The
> results at [0] below show the difference in sync time with a largely
> populated inode cache before and after this patch.
> 
> > > b) See whether parallel heavy stat(2) load which is rotating lots of inodes
> > > in inode cache sees some improvement when it doesn't have to contend with
> > > sync(2) on s_inode_list_lock. I believe Dave Chinner had some loads where
> > > the contention on s_inode_list_lock due to sync and rotation of inodes was
> > > pretty heavy.
> > 
> > Just my usual fsmark workloads - they have parallel find and
> > parallel ls -lR traversals over the created fileset. Even just
> > running sync during creation (because there are millions of cached
> > inodes, and ~250,000 inodes being instiated and reclaimed every
> > second) causes lock contention problems....
> > 
> 
> I ran a similar parallel (16x) fs_mark workload using '-S 4,' which
> incorporates a sync() per pass. Without this patch, this demonstrates a
> slow degradation as the inode cache grows. Results at [1].

Thanks for the results. I think it would be good if you incorporated them
in the changelog since other people will likely be asking similar
questions when seeing the inode is growing. Other than that feel free to
add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

								Honza
> 16xcpu, 32GB RAM x86-64 server
> Storage is LVM volumes on hw raid0.
> 
> [0] -- sync test w/ ~10m clean inode cache
> - 10TB pre-populated XFS fs, cache populated via parallel find/stat
>   workload
> 
> --- 4.4.0+
> 
> # cat /proc/slabinfo | grep xfs
> xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0 : slabdata      0      0      0
> xfs_dquot              0      0    656   49    8 : tunables    0    0    0 : slabdata      0      0      0
> xfs_buf           496293 496893    640   51    8 : tunables    0    0    0 : slabdata   9743   9743      0
> xfs_icr                0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_inode         10528071 10529150   1728   18    8 : tunables    0    0    0 : slabdata 584999 584999      0
> xfs_efd_item           0      0    400   40    4 : tunables    0    0    0 : slabdata      0      0      0
> xfs_da_state         544    544    480   34    4 : tunables    0    0    0 : slabdata     16     16      0
> xfs_btree_cur          0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0
> 
> # time sync
> 
> real    0m7.322s
> user    0m0.000s
> sys     0m7.314s
> # time sync
> 
> real    0m7.299s
> user    0m0.000s
> sys     0m7.296s
> 
> --- 4.4.0+ w/ sync patch
> 
> # cat /proc/slabinfo | grep xfs
> xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0 : slabdata      0      0      0
> xfs_dquot              0      0    656   49    8 : tunables    0    0    0 : slabdata      0      0      0
> xfs_buf           428214 428514    640   51    8 : tunables    0    0    0 : slabdata   8719   8719      0
> xfs_icr                0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
> xfs_inode         11054375 11054438   1728   18    8 : tunables    0    0    0 : slabdata 721323 721323      0
> xfs_efd_item           0      0    400   40    4 : tunables    0    0    0 : slabdata      0      0      0
> xfs_da_state         544    544    480   34    4 : tunables    0    0    0 : slabdata     16     16      0
> xfs_btree_cur          0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0
> 
> # time sync
> 
> real    0m0.040s
> user    0m0.001s
> sys     0m0.003s
> # time sync
> 
> real    0m0.002s
> user    0m0.001s
> sys     0m0.002s
> 
> [1] -- fs_mark -D 1000 -S4 -n 1000 -d /mnt/0 ... -d /mnt/15 -L 32
> - 1TB XFS fs
> 
> --- 4.4.0+
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      2        16000        51200       3313.3           822514
>      2        32000        51200       3353.6           310268
>      2        48000        51200       3475.2           289941
>      2        64000        51200       3104.6           289993
>      2        80000        51200       2944.9           292124
>      2        96000        51200       3010.4           288042
>      3       112000        51200       2756.4           289761
>      3       128000        51200       2753.2           288096
>      3       144000        51200       2474.4           290797
>      3       160000        51200       2657.9           290898
>      3       176000        51200       2498.0           288247
>      3       192000        51200       2415.5           287329
>      3       208000        51200       2336.1           291113
>      3       224000        51200       2352.9           290103
>      3       240000        51200       2309.6           289580
>      3       256000        51200       2344.3           289828
>      3       272000        51200       2293.0           291282
>      3       288000        51200       2295.5           286538
>      4       304000        51200       2119.0           288906
>      4       320000        51200       2059.6           293605
>      4       336000        51200       2129.1           289825
>      4       352000        51200       1929.8           288186
>      4       368000        51200       1987.5           294596
>      4       384000        51200       1929.1           293528
>      4       400000        51200       1934.8           288138
>      4       416000        51200       1823.6           292318
>      4       432000        51200       1838.7           290890
>      4       448000        51200       1797.5           288816
>      4       464000        51200       1823.2           287190
>      4       480000        51200       1738.7           295745
>      4       496000        51200       1716.4           293821
>      5       512000        51200       1726.7           290445
> 
> --- 4.4.0+ w/ sync patch
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      2        16000        51200       3409.7           999579
>      2        32000        51200       3481.3           286877
>      2        48000        51200       3447.3           282743
>      2        64000        51200       3522.3           283400
>      2        80000        51200       3427.0           286360
>      2        96000        51200       3360.2           307219
>      3       112000        51200       3377.7           286625
>      3       128000        51200       3363.7           285929
>      3       144000        51200       3345.7           283138
>      3       160000        51200       3384.9           291081
>      3       176000        51200       3084.1           285265
>      3       192000        51200       3388.4           291439
>      3       208000        51200       3242.8           286332
>      3       224000        51200       3337.9           285006
>      3       240000        51200       3442.8           292109
>      3       256000        51200       3230.3           283432
>      3       272000        51200       3358.3           286996
>      3       288000        51200       3309.0           288058
>      4       304000        51200       3293.4           284309
>      4       320000        51200       3221.4           284476
>      4       336000        51200       3241.5           283968
>      4       352000        51200       3228.3           284354
>      4       368000        51200       3255.7           286072
>      4       384000        51200       3094.6           290240
>      4       400000        51200       3385.6           288158
>      4       416000        51200       3265.2           284387
>      4       432000        51200       3315.2           289656
>      4       448000        51200       3275.1           284562
>      4       464000        51200       3238.4           294976
>      4       480000        51200       3060.0           290088
>      4       496000        51200       3359.5           286949
>      5       512000        51200       3156.2           288126
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html