On Tue, Dec 03, 2019 at 10:53:21AM +1100, Dave Chinner wrote: > On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote: > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@xxxxxxxx> wrote: > > > > > > --- a/fs/iomap/direct-io.c > > > > > > +++ b/fs/iomap/direct-io.c > > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > > > blk_wake_io_task(waiter); > > > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > > > - > > > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > > > + schedule_work(&dio->aio.work); > > > > > > > > > > I'm not sure that this will make a real difference because it ends up > > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > > > the work will still be pinned to a CPU > > > > > Using system_unbound_wq should make a difference because it doesn't > > > > > pin the work on a CPU > > > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > > > makes a difference: > > > > > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > > > index 9329ced91f1d..2f4488b0ecec 100644 > > > > --- a/fs/direct-io.c > > > > +++ b/fs/direct-io.c > > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > > > { > > > > struct workqueue_struct *old; > > > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > > > - WQ_MEM_RECLAIM, 0, > > > > + WQ_MEM_RECLAIM | > > > > + WQ_UNBOUND, 0, > > > > sb->s_id); > > > > > > That's not an answer to the user task migration issue. > > > > > > That is, all this patch does is trade user task migration when the > > > CPU is busy for migrating all the queued work off the CPU so the > > > user task does not get migrated. IOWs, this forces all the queued > > > work to be migrated rather than the user task. IOWs, it does not > > > address the issue we've exposed in the scheduler between tasks with > > > competing CPU affinity scheduling requirements - it just hides the > > > symptom. > > > > > > Maintaining CPU affinity across dispatch and completion work has > > > been proven to be a significant performance win. Right throughout > > > the IO stack we try to keep this submitter/completion affinity, > > > and that's the whole point of using a bound wq in the first place: > > > efficient delayed batch processing of work on the local CPU. > > > > Do you really want to target the same CPU ? looks like what you really > > want to target the same cache instead > > Well, yes, ideally we want to target the same cache, but we can't do > that with workqueues. > > However, the block layer already does that same-cache steering for > it's directed completions (see __blk_mq_complete_request()), so we > are *already running in a "hot cache" CPU context* when we queue > work. When we queue to the same CPU, we are simply maintaining the > "cache-hot" context that we are already running in. __blk_mq_complete_request() doesn't always complete the request on the submission CPU, which is only done in case of 1:1 queue mapping and N:1 mapping when nr_hw_queues < nr_nodes. Also, the default completion flag is SAME_GROUP, which just requires the completion CPU to share cache with submission CPU: #define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */ Thanks, Ming