Re: [PATCH v6 1/2] sb: add a new writeback list for sync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 21, 2016 at 07:11:59AM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 02:26:26PM +0100, Jan Kara wrote:
> > On Tue 19-01-16 12:59:12, Brian Foster wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > wait_sb_inodes() currently does a walk of all inodes in the
> > > filesystem to find dirty one to wait on during sync. This is highly
> > > inefficient and wastes a lot of CPU when there are lots of clean
> > > cached inodes that we don't need to wait on.
> > > 
> > > To avoid this "all inode" walk, we need to track inodes that are
> > > currently under writeback that we need to wait for. We do this by
> > > adding inodes to a writeback list on the sb when the mapping is
> > > first tagged as having pages under writeback. wait_sb_inodes() can
> > > then walk this list of "inodes under IO" and wait specifically just
> > > for the inodes that the current sync(2) needs to wait for.
> > > 
> > > Define a couple helpers to add/remove an inode from the writeback
> > > list and call them when the overall mapping is tagged for or cleared
> > > from writeback. Update wait_sb_inodes() to walk only the inodes
> > > under writeback due to the sync.
> > 

Hi Jan, Dave,

> > The patch looks good.  Just one comment: This grows struct inode by two
> > longs. Such a growth should be justified by measuring the improvements. So
> > can you measure some numbers showing how much the patch helped? I think it
> > would be interesting to see:
> > 

Thanks.. indeed, I had run some simple tests that demonstrate the
effectiveness of the change. I reran them recently against the latest
version. Some results are appended to this mail.

Note that I don't have anything at the moment that demonstrates a
notable improvement with rcu over the original spin lock approach. I can
play with that a bit more, but that's not really the crux of the patch.

> > a) How much sync(2) speed has improved if there's not much to wait for.
> 
> Depends on the size of the inode cache when sync is run.  If it's
> empty it's not noticable. When you have tens of millions of cached,
> clean inodes the inode list traversal can takes tens of seconds.
> This is the sort of problem Josef reported that FB were having...
> 

FWIW, Ceph has indicated this is a pain point for them as well. The
results at [0] below show the difference in sync time with a largely
populated inode cache before and after this patch.

> > b) See whether parallel heavy stat(2) load which is rotating lots of inodes
> > in inode cache sees some improvement when it doesn't have to contend with
> > sync(2) on s_inode_list_lock. I believe Dave Chinner had some loads where
> > the contention on s_inode_list_lock due to sync and rotation of inodes was
> > pretty heavy.
> 
> Just my usual fsmark workloads - they have parallel find and
> parallel ls -lR traversals over the created fileset. Even just
> running sync during creation (because there are millions of cached
> inodes, and ~250,000 inodes being instiated and reclaimed every
> second) causes lock contention problems....
> 

I ran a similar parallel (16x) fs_mark workload using '-S 4,' which
incorporates a sync() per pass. Without this patch, this demonstrates a
slow degradation as the inode cache grows. Results at [1].

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

16xcpu, 32GB RAM x86-64 server
Storage is LVM volumes on hw raid0.

[0] -- sync test w/ ~10m clean inode cache
- 10TB pre-populated XFS fs, cache populated via parallel find/stat
  workload

--- 4.4.0+

# cat /proc/slabinfo | grep xfs
xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_dquot              0      0    656   49    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_buf           496293 496893    640   51    8 : tunables    0    0    0 : slabdata   9743   9743      0
xfs_icr                0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_inode         10528071 10529150   1728   18    8 : tunables    0    0    0 : slabdata 584999 584999      0
xfs_efd_item           0      0    400   40    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_da_state         544    544    480   34    4 : tunables    0    0    0 : slabdata     16     16      0
xfs_btree_cur          0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0

# time sync

real    0m7.322s
user    0m0.000s
sys     0m7.314s
# time sync

real    0m7.299s
user    0m0.000s
sys     0m7.296s

--- 4.4.0+ w/ sync patch

# cat /proc/slabinfo | grep xfs
xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_dquot              0      0    656   49    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_buf           428214 428514    640   51    8 : tunables    0    0    0 : slabdata   8719   8719      0
xfs_icr                0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_inode         11054375 11054438   1728   18    8 : tunables    0    0    0 : slabdata 721323 721323      0
xfs_efd_item           0      0    400   40    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_da_state         544    544    480   34    4 : tunables    0    0    0 : slabdata     16     16      0
xfs_btree_cur          0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0

# time sync

real    0m0.040s
user    0m0.001s
sys     0m0.003s
# time sync

real    0m0.002s
user    0m0.001s
sys     0m0.002s

[1] -- fs_mark -D 1000 -S4 -n 1000 -d /mnt/0 ... -d /mnt/15 -L 32
- 1TB XFS fs

--- 4.4.0+

FSUse%        Count         Size    Files/sec     App Overhead
     2        16000        51200       3313.3           822514
     2        32000        51200       3353.6           310268
     2        48000        51200       3475.2           289941
     2        64000        51200       3104.6           289993
     2        80000        51200       2944.9           292124
     2        96000        51200       3010.4           288042
     3       112000        51200       2756.4           289761
     3       128000        51200       2753.2           288096
     3       144000        51200       2474.4           290797
     3       160000        51200       2657.9           290898
     3       176000        51200       2498.0           288247
     3       192000        51200       2415.5           287329
     3       208000        51200       2336.1           291113
     3       224000        51200       2352.9           290103
     3       240000        51200       2309.6           289580
     3       256000        51200       2344.3           289828
     3       272000        51200       2293.0           291282
     3       288000        51200       2295.5           286538
     4       304000        51200       2119.0           288906
     4       320000        51200       2059.6           293605
     4       336000        51200       2129.1           289825
     4       352000        51200       1929.8           288186
     4       368000        51200       1987.5           294596
     4       384000        51200       1929.1           293528
     4       400000        51200       1934.8           288138
     4       416000        51200       1823.6           292318
     4       432000        51200       1838.7           290890
     4       448000        51200       1797.5           288816
     4       464000        51200       1823.2           287190
     4       480000        51200       1738.7           295745
     4       496000        51200       1716.4           293821
     5       512000        51200       1726.7           290445

--- 4.4.0+ w/ sync patch

FSUse%        Count         Size    Files/sec     App Overhead
     2        16000        51200       3409.7           999579
     2        32000        51200       3481.3           286877
     2        48000        51200       3447.3           282743
     2        64000        51200       3522.3           283400
     2        80000        51200       3427.0           286360
     2        96000        51200       3360.2           307219
     3       112000        51200       3377.7           286625
     3       128000        51200       3363.7           285929
     3       144000        51200       3345.7           283138
     3       160000        51200       3384.9           291081
     3       176000        51200       3084.1           285265
     3       192000        51200       3388.4           291439
     3       208000        51200       3242.8           286332
     3       224000        51200       3337.9           285006
     3       240000        51200       3442.8           292109
     3       256000        51200       3230.3           283432
     3       272000        51200       3358.3           286996
     3       288000        51200       3309.0           288058
     4       304000        51200       3293.4           284309
     4       320000        51200       3221.4           284476
     4       336000        51200       3241.5           283968
     4       352000        51200       3228.3           284354
     4       368000        51200       3255.7           286072
     4       384000        51200       3094.6           290240
     4       400000        51200       3385.6           288158
     4       416000        51200       3265.2           284387
     4       432000        51200       3315.2           289656
     4       448000        51200       3275.1           284562
     4       464000        51200       3238.4           294976
     4       480000        51200       3060.0           290088
     4       496000        51200       3359.5           286949
     5       512000        51200       3156.2           288126

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux