On Thu, Jan 21, 2016 at 05:34:11PM +0100, Jan Kara wrote: > On Thu 21-01-16 10:22:57, Brian Foster wrote: > > On Thu, Jan 21, 2016 at 07:11:59AM +1100, Dave Chinner wrote: > > > On Wed, Jan 20, 2016 at 02:26:26PM +0100, Jan Kara wrote: > > > > On Tue 19-01-16 12:59:12, Brian Foster wrote: > > > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > ... > > > > > > > > Hi Jan, Dave, > > ... > > > > a) How much sync(2) speed has improved if there's not much to wait for. > > > > > > Depends on the size of the inode cache when sync is run. If it's > > > empty it's not noticable. When you have tens of millions of cached, > > > clean inodes the inode list traversal can takes tens of seconds. > > > This is the sort of problem Josef reported that FB were having... > > > > > > > FWIW, Ceph has indicated this is a pain point for them as well. The > > results at [0] below show the difference in sync time with a largely > > populated inode cache before and after this patch. > > > > > > b) See whether parallel heavy stat(2) load which is rotating lots of inodes > > > > in inode cache sees some improvement when it doesn't have to contend with > > > > sync(2) on s_inode_list_lock. I believe Dave Chinner had some loads where > > > > the contention on s_inode_list_lock due to sync and rotation of inodes was > > > > pretty heavy. > > > > > > Just my usual fsmark workloads - they have parallel find and > > > parallel ls -lR traversals over the created fileset. Even just > > > running sync during creation (because there are millions of cached > > > inodes, and ~250,000 inodes being instiated and reclaimed every > > > second) causes lock contention problems.... > > > > > > > I ran a similar parallel (16x) fs_mark workload using '-S 4,' which > > incorporates a sync() per pass. Without this patch, this demonstrates a > > slow degradation as the inode cache grows. Results at [1]. > > Thanks for the results. I think it would be good if you incorporated them > in the changelog since other people will likely be asking similar > questions when seeing the inode is growing. Other than that feel free to > add: > > Reviewed-by: Jan Kara <jack@xxxxxxx> > No problem, thanks! Sure, I don't want to dump the raw stuff into the commit log description to avoid making it too long, but I can reference the core sync time impact. I've appended the following for now: "With this change, filesystem sync times are significantly reduced for fs' with largely populated inode caches and otherwise no other work to do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem with a ~10m entry inode cache, sync times are reduced from ~7.3s to less than 0.1s when the filesystem is fully clean." I'll repost in a day or so if I don't receive any other feedback. Brian > Honza > > 16xcpu, 32GB RAM x86-64 server > > Storage is LVM volumes on hw raid0. > > > > [0] -- sync test w/ ~10m clean inode cache > > - 10TB pre-populated XFS fs, cache populated via parallel find/stat > > workload > > > > --- 4.4.0+ > > > > # cat /proc/slabinfo | grep xfs > > xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_buf 496293 496893 640 51 8 : tunables 0 0 0 : slabdata 9743 9743 0 > > xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_inode 10528071 10529150 1728 18 8 : tunables 0 0 0 : slabdata 584999 584999 0 > > xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0 > > xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0 > > > > # time sync > > > > real 0m7.322s > > user 0m0.000s > > sys 0m7.314s > > # time sync > > > > real 0m7.299s > > user 0m0.000s > > sys 0m7.296s > > > > --- 4.4.0+ w/ sync patch > > > > # cat /proc/slabinfo | grep xfs > > xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_buf 428214 428514 640 51 8 : tunables 0 0 0 : slabdata 8719 8719 0 > > xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_inode 11054375 11054438 1728 18 8 : tunables 0 0 0 : slabdata 721323 721323 0 > > xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0 > > xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0 > > xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0 > > > > # time sync > > > > real 0m0.040s > > user 0m0.001s > > sys 0m0.003s > > # time sync > > > > real 0m0.002s > > user 0m0.001s > > sys 0m0.002s > > > > [1] -- fs_mark -D 1000 -S4 -n 1000 -d /mnt/0 ... -d /mnt/15 -L 32 > > - 1TB XFS fs > > > > --- 4.4.0+ > > > > FSUse% Count Size Files/sec App Overhead > > 2 16000 51200 3313.3 822514 > > 2 32000 51200 3353.6 310268 > > 2 48000 51200 3475.2 289941 > > 2 64000 51200 3104.6 289993 > > 2 80000 51200 2944.9 292124 > > 2 96000 51200 3010.4 288042 > > 3 112000 51200 2756.4 289761 > > 3 128000 51200 2753.2 288096 > > 3 144000 51200 2474.4 290797 > > 3 160000 51200 2657.9 290898 > > 3 176000 51200 2498.0 288247 > > 3 192000 51200 2415.5 287329 > > 3 208000 51200 2336.1 291113 > > 3 224000 51200 2352.9 290103 > > 3 240000 51200 2309.6 289580 > > 3 256000 51200 2344.3 289828 > > 3 272000 51200 2293.0 291282 > > 3 288000 51200 2295.5 286538 > > 4 304000 51200 2119.0 288906 > > 4 320000 51200 2059.6 293605 > > 4 336000 51200 2129.1 289825 > > 4 352000 51200 1929.8 288186 > > 4 368000 51200 1987.5 294596 > > 4 384000 51200 1929.1 293528 > > 4 400000 51200 1934.8 288138 > > 4 416000 51200 1823.6 292318 > > 4 432000 51200 1838.7 290890 > > 4 448000 51200 1797.5 288816 > > 4 464000 51200 1823.2 287190 > > 4 480000 51200 1738.7 295745 > > 4 496000 51200 1716.4 293821 > > 5 512000 51200 1726.7 290445 > > > > --- 4.4.0+ w/ sync patch > > > > FSUse% Count Size Files/sec App Overhead > > 2 16000 51200 3409.7 999579 > > 2 32000 51200 3481.3 286877 > > 2 48000 51200 3447.3 282743 > > 2 64000 51200 3522.3 283400 > > 2 80000 51200 3427.0 286360 > > 2 96000 51200 3360.2 307219 > > 3 112000 51200 3377.7 286625 > > 3 128000 51200 3363.7 285929 > > 3 144000 51200 3345.7 283138 > > 3 160000 51200 3384.9 291081 > > 3 176000 51200 3084.1 285265 > > 3 192000 51200 3388.4 291439 > > 3 208000 51200 3242.8 286332 > > 3 224000 51200 3337.9 285006 > > 3 240000 51200 3442.8 292109 > > 3 256000 51200 3230.3 283432 > > 3 272000 51200 3358.3 286996 > > 3 288000 51200 3309.0 288058 > > 4 304000 51200 3293.4 284309 > > 4 320000 51200 3221.4 284476 > > 4 336000 51200 3241.5 283968 > > 4 352000 51200 3228.3 284354 > > 4 368000 51200 3255.7 286072 > > 4 384000 51200 3094.6 290240 > > 4 400000 51200 3385.6 288158 > > 4 416000 51200 3265.2 284387 > > 4 432000 51200 3315.2 289656 > > 4 448000 51200 3275.1 284562 > > 4 464000 51200 3238.4 294976 > > 4 480000 51200 3060.0 290088 > > 4 496000 51200 3359.5 286949 > > 5 512000 51200 3156.2 288126 > > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html