On Thu, Jun 12, 2014 at 06:34:07PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > We've had plenty of requests for an asynchronous fsync over the past > few years, and we've got the infrastructure there to do it. But > nobody has wired it up to test it. The common request we get from > userspace storage applications is to do a post-write pass over a set > of files that were just written (i.e. bulk background fsync) for > point-in-time checkpointing or flushing purposes. > > So, just to see if I could brute force an effective implementation, > wire up aio_fsync, add a workqueue and push all the fsync calls off > to the workqueue. The workqueue will allow parallel dispatch, switch > execution if a fsync blocks for any reason, etc. Brute force and > very effective.... > > So, I hacked up fs_mark to enable fsync via the libaio io_fsync() > interface to run some tests. The quick test is: > > - write 10000 4k files into the cache > - run a post write open-fsync-close pass (sync mode 5) > - run 5 iterations > - run a single thread, then 4 threads. > > First I ran it on a 500TB sparse filesystem on a SSD. > > FSUse% Count Size Files/sec App Overhead > 0 10000 4096 599.1 153855 > 0 20000 4096 739.2 151228 > 0 30000 4096 672.2 152937 > 0 40000 4096 719.9 150615 > 0 50000 4096 708.4 154889 > > real 1m13.121s > user 0m0.825s > sys 0m11.024s > > Runs at around 500 log forces a second and 1500 IOPS. > > Using io_fsync(): > > FSUse% Count Size Files/sec App Overhead > 0 10000 4096 2700.5 130313 > 0 20000 4096 3938.8 133602 > 0 30000 4096 4608.7 107871 > 0 40000 4096 4768.4 82965 > 0 50000 4096 4615.0 89220 > > real 0m12.691s > user 0m0.460s > sys 0m7.389s > > Runs at around 4,000 log forces a second and 4500 IOPS. Massive > reduction in runtime through parallel dispatch of the fsync calls. > > Run the same workload, 4 threads at a time. Normal fsync: > > FSUse% Count Size Files/sec App Overhead > 0 40000 4096 2151.5 617010 > 0 80000 4096 1953.0 613470 > 0 120000 4096 1874.4 625027 > 0 160000 4096 1907.4 624319 > 0 200000 4096 1924.3 627567 > > real 1m42.243s > user 0m3.552s > sys 0m49.118s > > Runs at ~2000 log forces/s and 3,500 IOPS. > > Using io_fsync(): > > FSUse% Count Size Files/sec App Overhead > 0 40000 4096 11518.9 427666 > 0 80000 4096 15668.8 401661 > 0 120000 4096 15607.0 382279 > 0 160000 4096 14935.0 399097 > 0 200000 4096 15198.6 413965 > > real 0m14.192s > user 0m1.891s > sys 0m30.136s > > Almost perfect scaling! ~15,000 log forces a second and ~20,000 IOPS. > > Now run the tests on a HW RAID0 of spinning disk: > > Threads files/s run time log force/s IOPS > 1, fsync 800 1m 5.1s 800 1500 > 1, io_fsync 6000 8.4s 5000 5500 > 4, fsync 1800 1m47.1s 2200 3500 > 4, io_fsync 19000 10.3s 21000 26000 > > Pretty much the same results. Spinning disks don't scale much > further. The SSD can go a bit higher, with 8 threads generating > a consistent 24,000 files/s, but at that point we're starting to see > non-linear system CPU usage (probably lock contention in the log). > > But, regardless, there's a massive potential for speed gains for > applications that need to do bulk fsync operations and don't need to > care about the IO latency of individual fsync operations.... > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- That looks great. This is something that could be quite beneficial to glusterfs, as a real world example. The replication mechanism does an xattr dance across servers and required addition of fsync's into the algorithm to ensure correctness in the case of failures. This had a notable impact on performance. We thought a bit about hooking up aio_fsync(), but more along the lines of waiting for the log to force rather than forcing it explicitly, but didn't really go anywhere with it. I didn't consider we'd get such a benefit from simply dropping it into a workqueue. :) I do like Christoph's idea... perhaps create a generic_file_aio_fsync() or some such? Brian > fs/xfs/xfs_file.c | 41 +++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_mount.h | 2 ++ > fs/xfs/xfs_super.c | 9 +++++++++ > 3 files changed, 52 insertions(+) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 077bcc8..9cdecee 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -45,6 +45,7 @@ > #include <linux/pagevec.h> > > static const struct vm_operations_struct xfs_file_vm_ops; > +struct workqueue_struct *xfs_aio_fsync_wq; > > /* > * Locking primitives for read and write IO paths to ensure we consistently use > @@ -228,6 +229,45 @@ xfs_file_fsync( > return error; > } > > +struct xfs_afsync_args { > + struct work_struct work; > + struct kiocb *iocb; > + struct file *file; > + int datasync; > +}; > + > +STATIC void > +xfs_file_aio_fsync_work( > + struct work_struct *work) > +{ > + struct xfs_afsync_args *args = container_of(work, > + struct xfs_afsync_args, work); > + int error; > + > + error = xfs_file_fsync(args->file, 0, -1LL, args->datasync); > + aio_complete(args->iocb, error, 0); > + kmem_free(args); > +} > + > +STATIC int > +xfs_file_aio_fsync( > + struct kiocb *iocb, > + int datasync) > +{ > + struct xfs_afsync_args *args; > + > + args = kmem_zalloc(sizeof(struct xfs_afsync_args), KM_SLEEP|KM_MAYFAIL); > + if (!args) > + return -ENOMEM; > + > + INIT_WORK(&args->work, xfs_file_aio_fsync_work); > + args->iocb = iocb; > + args->file = iocb->ki_filp; > + args->datasync = datasync; > + queue_work(xfs_aio_fsync_wq, &args->work); > + return -EIOCBQUEUED; > +} > + > STATIC ssize_t > xfs_file_aio_read( > struct kiocb *iocb, > @@ -1475,6 +1515,7 @@ const struct file_operations xfs_file_operations = { > .open = xfs_file_open, > .release = xfs_file_release, > .fsync = xfs_file_fsync, > + .aio_fsync = xfs_file_aio_fsync, > .fallocate = xfs_file_fallocate, > }; > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index 7295a0b..dfcf37b 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -390,6 +390,8 @@ extern int xfs_dev_is_read_only(struct xfs_mount *, char *); > > extern void xfs_set_low_space_thresholds(struct xfs_mount *); > > +extern struct workqueue_struct *xfs_aio_fsync_wq; > + > #endif /* __KERNEL__ */ > > #endif /* __XFS_MOUNT_H__ */ > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index f2e5f8a..86d4923 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1718,12 +1718,21 @@ xfs_init_workqueues(void) > if (!xfs_alloc_wq) > return -ENOMEM; > > + xfs_aio_fsync_wq = alloc_workqueue("xfsfsync", 0, 0); > + if (!xfs_aio_fsync_wq) > + goto destroy_alloc_wq; > + > return 0; > + > +destroy_alloc_wq: > + destroy_workqueue(xfs_alloc_wq); > + return -ENOMEM; > } > > STATIC void > xfs_destroy_workqueues(void) > { > + destroy_workqueue(xfs_aio_fsync_wq); > destroy_workqueue(xfs_alloc_wq); > } > > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs