On Thu, Jan 21, 2016 at 07:11:59AM +1100, Dave Chinner wrote: > On Wed, Jan 20, 2016 at 02:26:26PM +0100, Jan Kara wrote: > > On Tue 19-01-16 12:59:12, Brian Foster wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > wait_sb_inodes() currently does a walk of all inodes in the > > > filesystem to find dirty one to wait on during sync. This is highly > > > inefficient and wastes a lot of CPU when there are lots of clean > > > cached inodes that we don't need to wait on. > > > > > > To avoid this "all inode" walk, we need to track inodes that are > > > currently under writeback that we need to wait for. We do this by > > > adding inodes to a writeback list on the sb when the mapping is > > > first tagged as having pages under writeback. wait_sb_inodes() can > > > then walk this list of "inodes under IO" and wait specifically just > > > for the inodes that the current sync(2) needs to wait for. > > > > > > Define a couple helpers to add/remove an inode from the writeback > > > list and call them when the overall mapping is tagged for or cleared > > > from writeback. Update wait_sb_inodes() to walk only the inodes > > > under writeback due to the sync. > > Hi Jan, Dave, > > The patch looks good. Just one comment: This grows struct inode by two > > longs. Such a growth should be justified by measuring the improvements. So > > can you measure some numbers showing how much the patch helped? I think it > > would be interesting to see: > > Thanks.. indeed, I had run some simple tests that demonstrate the effectiveness of the change. I reran them recently against the latest version. Some results are appended to this mail. Note that I don't have anything at the moment that demonstrates a notable improvement with rcu over the original spin lock approach. I can play with that a bit more, but that's not really the crux of the patch. > > a) How much sync(2) speed has improved if there's not much to wait for. > > Depends on the size of the inode cache when sync is run. If it's > empty it's not noticable. When you have tens of millions of cached, > clean inodes the inode list traversal can takes tens of seconds. > This is the sort of problem Josef reported that FB were having... > FWIW, Ceph has indicated this is a pain point for them as well. The results at [0] below show the difference in sync time with a largely populated inode cache before and after this patch. > > b) See whether parallel heavy stat(2) load which is rotating lots of inodes > > in inode cache sees some improvement when it doesn't have to contend with > > sync(2) on s_inode_list_lock. I believe Dave Chinner had some loads where > > the contention on s_inode_list_lock due to sync and rotation of inodes was > > pretty heavy. > > Just my usual fsmark workloads - they have parallel find and > parallel ls -lR traversals over the created fileset. Even just > running sync during creation (because there are millions of cached > inodes, and ~250,000 inodes being instiated and reclaimed every > second) causes lock contention problems.... > I ran a similar parallel (16x) fs_mark workload using '-S 4,' which incorporates a sync() per pass. Without this patch, this demonstrates a slow degradation as the inode cache grows. Results at [1]. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx 16xcpu, 32GB RAM x86-64 server Storage is LVM volumes on hw raid0. [0] -- sync test w/ ~10m clean inode cache - 10TB pre-populated XFS fs, cache populated via parallel find/stat workload --- 4.4.0+ # cat /proc/slabinfo | grep xfs xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_buf 496293 496893 640 51 8 : tunables 0 0 0 : slabdata 9743 9743 0 xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_inode 10528071 10529150 1728 18 8 : tunables 0 0 0 : slabdata 584999 584999 0 xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0 xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0 # time sync real 0m7.322s user 0m0.000s sys 0m7.314s # time sync real 0m7.299s user 0m0.000s sys 0m7.296s --- 4.4.0+ w/ sync patch # cat /proc/slabinfo | grep xfs xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_buf 428214 428514 640 51 8 : tunables 0 0 0 : slabdata 8719 8719 0 xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_inode 11054375 11054438 1728 18 8 : tunables 0 0 0 : slabdata 721323 721323 0 xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0 xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0 # time sync real 0m0.040s user 0m0.001s sys 0m0.003s # time sync real 0m0.002s user 0m0.001s sys 0m0.002s [1] -- fs_mark -D 1000 -S4 -n 1000 -d /mnt/0 ... -d /mnt/15 -L 32 - 1TB XFS fs --- 4.4.0+ FSUse% Count Size Files/sec App Overhead 2 16000 51200 3313.3 822514 2 32000 51200 3353.6 310268 2 48000 51200 3475.2 289941 2 64000 51200 3104.6 289993 2 80000 51200 2944.9 292124 2 96000 51200 3010.4 288042 3 112000 51200 2756.4 289761 3 128000 51200 2753.2 288096 3 144000 51200 2474.4 290797 3 160000 51200 2657.9 290898 3 176000 51200 2498.0 288247 3 192000 51200 2415.5 287329 3 208000 51200 2336.1 291113 3 224000 51200 2352.9 290103 3 240000 51200 2309.6 289580 3 256000 51200 2344.3 289828 3 272000 51200 2293.0 291282 3 288000 51200 2295.5 286538 4 304000 51200 2119.0 288906 4 320000 51200 2059.6 293605 4 336000 51200 2129.1 289825 4 352000 51200 1929.8 288186 4 368000 51200 1987.5 294596 4 384000 51200 1929.1 293528 4 400000 51200 1934.8 288138 4 416000 51200 1823.6 292318 4 432000 51200 1838.7 290890 4 448000 51200 1797.5 288816 4 464000 51200 1823.2 287190 4 480000 51200 1738.7 295745 4 496000 51200 1716.4 293821 5 512000 51200 1726.7 290445 --- 4.4.0+ w/ sync patch FSUse% Count Size Files/sec App Overhead 2 16000 51200 3409.7 999579 2 32000 51200 3481.3 286877 2 48000 51200 3447.3 282743 2 64000 51200 3522.3 283400 2 80000 51200 3427.0 286360 2 96000 51200 3360.2 307219 3 112000 51200 3377.7 286625 3 128000 51200 3363.7 285929 3 144000 51200 3345.7 283138 3 160000 51200 3384.9 291081 3 176000 51200 3084.1 285265 3 192000 51200 3388.4 291439 3 208000 51200 3242.8 286332 3 224000 51200 3337.9 285006 3 240000 51200 3442.8 292109 3 256000 51200 3230.3 283432 3 272000 51200 3358.3 286996 3 288000 51200 3309.0 288058 4 304000 51200 3293.4 284309 4 320000 51200 3221.4 284476 4 336000 51200 3241.5 283968 4 352000 51200 3228.3 284354 4 368000 51200 3255.7 286072 4 384000 51200 3094.6 290240 4 400000 51200 3385.6 288158 4 416000 51200 3265.2 284387 4 432000 51200 3315.2 289656 4 448000 51200 3275.1 284562 4 464000 51200 3238.4 294976 4 480000 51200 3060.0 290088 4 496000 51200 3359.5 286949 5 512000 51200 3156.2 288126 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html