On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote: > On 2014-06-15 20:00, Dave Chinner wrote: > >On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote: > >>On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote: > >>>On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote: > >>>>On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote: > >>>>>There doesn't really seem anything XFS specific here, so instead > >>>>>of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c > >>>>>based on the workqueue and ->fsync. > >>>> > >>>>I really don't know whether the other ->fsync methods in other > >>>>filesystems can stand alone like that. I also don't have the > >>>>time to test that it works properly on all filesystems right now. > >>> > >>>Of course they can, as shown by various calls to vfs_fsync_range that > >>>is nothing but a small wrapper around ->fsync. > >> > >>Sure, but that's not getting 10,000 concurrent callers, is it? And > >>some fsync methods require journal credits, and others serialise > >>completely, and so on. > >> > >>Besides, putting an *unbound, highly concurrent* aio queue into the > >>kernel for an operation that can serialise the entire filesystem > >>seems like a pretty nasty user-level DOS vector to me. > > > >FWIW, the non-linear system CPU overhead of a fs_mark test I've been > >running isn't anything related to XFS. The async fsync workqueue > >results in several thousand worker threads dispatching IO > >concurrently across 16 CPUs: > > > >$ ps -ef |grep kworker |wc -l > >4693 > >$ > > > >Profiles from 3.15 + xfs for-next + xfs aio_fsync show: > > > >- 51.33% [kernel] [k] percpu_ida_alloc > > - percpu_ida_alloc > > + 85.73% blk_mq_wait_for_tags > > + 14.23% blk_mq_get_tag > >- 14.25% [kernel] [k] _raw_spin_unlock_irqrestore > > - _raw_spin_unlock_irqrestore > > - 66.26% virtio_queue_rq > > - __blk_mq_run_hw_queue > > - 99.65% blk_mq_run_hw_queue > > + 99.47% blk_mq_insert_requests > > + 0.53% blk_mq_insert_request > >..... > >- 7.91% [kernel] [k] _raw_spin_unlock_irq > > - _raw_spin_unlock_irq > > - 69.59% __schedule > > - 86.49% schedule > > + 47.72% percpu_ida_alloc > > + 21.75% worker_thread > > + 19.12% schedule_timeout > >.... > > + 18.06% blk_mq_make_request > > > >Runtime: > > > >real 4m1.243s > >user 0m47.724s > >sys 11m56.724s > > > >Most of the excessive CPU usage is coming from the blk-mq layer, and > >XFS is barely showing up in the profiles at all - the IDA tag > >allocator is burning 8 CPUs at about 60,000 write IOPS.... > > > >I know that the tag allocator has been rewritten, so I tested > >against a current a current Linus kernel with the XFS aio-fsync > >patch. The results are all over the place - from several sequential > >runs of the same test (removing the files in between so each tests > >starts from an empty fs): > > > >Wall time sys time IOPS files/s > >4m58.151s 11m12.648s 30,000 13,500 > >4m35.075s 12m45.900s 45,000 15,000 > >3m10.665s 11m15.804s 65,000 21,000 > >3m27.384s 11m54.723s 85,000 20,000 > >3m59.574s 11m12.012s 50,000 16,500 > >4m12.704s 12m15.720s 50,000 17,000 > > > >The 3.15 based kernel was pretty consistent around the 4m10 mark, > >generally only +/-10s in runtime and not much change in system time. > >The files/s rate reported by fs_mark doesn't vary that much, either. > >So the new tag allocator seems to be no better in terms of IO > >dispatch scalability, yet adds significant variability to IO > >performance. > > > >What I noticed is a massive jump in context switch overhead: from > >around 250,000/s to over 800,000/s and the CPU profiles show that > >this comes from the new tag allocator: > > > >- 34.62% [kernel] [k] _raw_spin_unlock_irqrestore > > - _raw_spin_unlock_irqrestore > > - 58.22% prepare_to_wait > > 100.00% bt_get > > blk_mq_get_tag > > __blk_mq_alloc_request > > blk_mq_map_request > > blk_sq_make_request > > generic_make_request > > - 22.51% virtio_queue_rq > > __blk_mq_run_hw_queue > >.... > >- 21.56% [kernel] [k] _raw_spin_unlock_irq > > - _raw_spin_unlock_irq > > - 58.73% __schedule > > - 53.42% io_schedule > > 99.88% bt_get > > blk_mq_get_tag > > __blk_mq_alloc_request > > blk_mq_map_request > > blk_sq_make_request > > generic_make_request > > - 35.58% schedule > > + 49.31% worker_thread > > + 32.45% schedule_timeout > > + 10.35% _xfs_log_force_lsn > > + 3.10% xlog_cil_force_lsn > >.... > > > >The new block-mq tag allocator is hammering the waitqueues and > >that's generating a large amount of lock contention. It looks like > >the new allocator replaces CPU burn doing work in the IDA allocator > >with the same amount of CPU burn from extra context switch > >overhead.... > > > >Oh, OH. Now I understand! > > > ># echo 4 > /sys/block/vdc/queue/nr_requests > > > ><run workload> > > > >80.56% [kernel] [k] _raw_spin_unlock_irqrestore > > - _raw_spin_unlock_irqrestore > > - 98.49% prepare_to_wait > > bt_get > > blk_mq_get_tag > > __blk_mq_alloc_request > > blk_mq_map_request > > blk_sq_make_request > > generic_make_request > > + submit_bio > > + 1.07% finish_wait > >+ 13.63% [kernel] [k] _raw_spin_unlock_irq > >... > > > >It's context switch bound at 800,000 context switches/s, burning all > >16 CPUs waking up and going to sleep and doing very little real > >work. How little real work? About 3000 IOPS for 2MB/s of IO. That > >amount of IO should only take a single digit CPU percentage of one > >CPU. > > With thousands of threads? I think not. Sanely submitted 3000 IOPS, > correct, I would agree with you. > > >This seems like bad behaviour to have on a congested block device, > >even a high performance one.... > > That is pretty much the suck. How do I reproduce this (eg what are > you running, and what are the xfs aio fsync patches)? Even if http://oss.sgi.com/pipermail/xfs/2014-June/036773.html > dispatching thousands of threads to do IO is a bad idea (it very > much is), gracefully handling is a must. I haven't seen any bad > behavior from the new allocator, it seems to be well behaved (for > most normal cases, anyway). I'd like to take a stab at ensuring this > works, too. > > If you tell me exactly what you are running, I'll reproduce and get > this fixed up tomorrow. Test case - take fs_mark: git://oss.sgi.com/dgc/fs_mark Apply the patch for aio fsync support: http://oss.sgi.com/pipermail/xfs/2014-June/036774.html Run this test: $ time ./fs_mark -D 10000 -S5 -n 50000 -s 4096 -L 5 -A \ -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 \ -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 \ -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 \ -d /mnt/scratch/15 Drop the "-A" if you want to use normal fsync (but then you won't see the problem). Use a XFS filesystem that has at least 32 AGs (I'm using a sparse 500TB fs image on a virtio device). I'm also using mkfs options of "-m crc=1,finobt=1", but to use that last one you'll need a mkfs built from the xfsprogs git tree. It shouldn't make any difference to the result, though. I'm running on a VM with 16 CPUs and 16GB RAM, using fakenuma=4. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html