On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > Hi Guys, > > It is found that single AIO thread is migrated crazely by scheduler, and > the migrate period can be < 10ms. Follows the test a): > > - run single job fio[1] for 30 seconds: > ./xfs_complete 512 > > - observe fio io thread migration via bcc trace[2], and the migration > times can reach 5k ~ 10K in above test. In this test, CPU utilization > is 30~40% on the CPU running fio IO thread. Using the default scheduler tunings: kernel.sched_wakeup_granularity_ns = 4000000 kernel.sched_min_granularity_ns = 3000000 I'm not seeing any migrations at all on a 16p x86-64 box. Even with the tunings you suggest: sysctl kernel.sched_min_granularity_ns=10000000 sysctl kernel.sched_wakeup_granularity_ns=15000000 There are no migrations at all. During the overwrite phase of the test, I'm seeing 66% fio usage, 28% system time and of that, only 3% of the CPU time is in the IO completion workqueue time. It's all running on the one CPU, and it's running at about 15,000 context switches per second.. I'm not seeing any CPU migrations at all, despite the workload being completely CPU bound on a single CPU. This is the typical output from top: Tasks: 262 total, 2 running, 260 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 66.2 us, 27.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 6.0 si, 0.0 st %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 16005.3 total, 8737.7 free, 6762.5 used, 505.2 buff/cache MiB Swap: 486.3 total, 486.3 free, 0.0 used. 8640.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5554 root 20 0 672064 15356 1392 R 97.0 0.1 0:17.85 fio 1388 root 20 0 0 0 0 I 3.0 0.0 0:01.39 kworker/3:1-dio/sdb 5552 root 20 0 661364 275380 271992 S 0.7 1.7 0:15.88 fio 4891 dave 20 0 15572 5876 4600 S 0.3 0.0 0:00.56 sshd 4933 dave 20 0 11732 4276 3296 R 0.3 0.0 0:01.30 top i.e. it runs entirely on CPU 3 for the whole 30s measurement period. And the number of migrate task events: $ sudo trace-cmd start -e sched_migrate_task $ sudo ./run_test.sh ..... $ sudo trace-cmd show |grep fio kworker/1:1-1252 [001] d..2 2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1 rcu_preempt-11 [011] d..2 2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11 rcu_preempt-11 [006] d..2 2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6 <idle>-0 [007] dNs2 2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7 $ And top tells me these fio processes are consuming CPU: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5695 root 20 0 663872 7120 1324 R 97.7 0.0 0:03.48 fio 5689 root 20 0 661356 275512 272128 S 0.3 1.7 0:15.46 fio So, the fio process that got migrated 4 times in 30s is the fio process that isn't doing any of the work. There's no migration going on here at all on a vanilla kernel on a Debian userspace, with or without the suggested scheduler tunings. > - after applying the debug patch[3] to queue XFS completion work on > other CPU(not current CPU), the above crazy fio IO thread migration > can't be observed. > > IMO, it is normal for user to saturate aio thread, since this way may > save context switch. > > Guys, any idea on the crazy aio thread migration? It doesn't happen on x86-64 with a current TOT vanilla kernel and a debian userspace. What userspace are you testing with? > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > fio test is created for simulating one real performance report which is > proved to be caused by frequent aio submission thread migration. What is the underlying hardware? I'm running in a 16p KVM guest on a 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant CPU migration occurring at all from your test workload. > [3] test patch for queuing xfs completetion on other CPU > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 1fc28c2da279..bdc007a57706 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > + unsigned cpu = cpumask_next(smp_processor_id(), > + cpu_online_mask); > + > + if (cpu >= nr_cpu_ids) > + cpu = 0; > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); > } else { > iomap_dio_complete_work(&dio->aio.work); > } So, can you test the patch I sent you a while back that avoided using the completion workqueue for pure overwrites? Does that make your whacky scheduler behaviour go away, or does it still reschedule like crazy? I haven't posted that patch to the list because I can't prove it makes any difference to performance in any workload on any hardware I've tested it on. I just tested it on your workload, and it makes no different to behaviour or performance. It's definitely hitting the non-workqueue completion path: $ sudo trace-cmd show | grep overwrite |head -1 fio-4526 [009] ..s. 180.361089: iomap_dio_bio_end_io: overwrite completion $ sudo trace-cmd show | grep overwrite |wc -l 51295 $ so it's pretty clear that whatever is going on is not obviously a problem with workqueues or the way iomap does completions. Patch is attached below for you to test. You will need to pull commit 7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked dirty") from the iomap for-next branch to ensure this patch functions correctly. Cheers, -Dave. -- Dave Chinner david@xxxxxxxxxxxxx iomap: don't defer completion of pure AIO+DIO overwrites From: Dave Chinner <dchinner@xxxxxxxxxx> We have a workload issuing 4k sequential AIO+DIOs in batches (yeah, really silly thing to do when you could just emit a single large IO) that is showing some interesting performance anomalies. i.e. massive performance regressions that go away when the workload is straced because it changes AIO completion scheduling and submission batch sizes.... Older kernels that don't have the iomap DIO code don't show the same regression, and they complete the AIO-DIO in the the bio completion context rather than deferring it to a workqueue like iomap does. Hence, for pure overwrites that don't require any metadata updates on IO completion, including file size, call the IO completion directly rather than deferring it to the completion workqueue. THe IOMAP_DIO_OVERWRITE flag is added to tell ->end_io implementations that they are running in bio completion context and that they should not have any update work to do. This leverages the iomap FUA optimisations that allow FUA writes to be issued when no metadata updates are required to be synced to disk during IO completion. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/iomap/direct-io.c | 26 ++++++++++++++++++++------ fs/xfs/xfs_file.c | 3 +++ include/linux/iomap.h | 8 ++++++++ 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 49bf9780e3ed..a89b29306794 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -159,8 +159,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) } else if (dio->flags & IOMAP_DIO_WRITE) { struct inode *inode = file_inode(dio->iocb->ki_filp); - INIT_WORK(&dio->aio.work, iomap_dio_complete_work); - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); + if ((dio->flags & IOMAP_DIO_OVERWRITE) && + !inode->i_mapping->nrpages) { + trace_printk("overwrite completion\n"); + iomap_dio_complete_work(&dio->aio.work); + } else { + INIT_WORK(&dio->aio.work, iomap_dio_complete_work); + queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); + } } else { iomap_dio_complete_work(&dio->aio.work); } @@ -229,10 +235,12 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, * the underlying device supports FUA. This allows us to avoid * cache flushes on IO completion. */ - if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) && - (dio->flags & IOMAP_DIO_WRITE_FUA) && - blk_queue_fua(bdev_get_queue(iomap->bdev))) - use_fua = true; + if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY))) { + dio->flags |= IOMAP_DIO_OVERWRITE; + if ((dio->flags & IOMAP_DIO_WRITE_FUA) && + blk_queue_fua(bdev_get_queue(iomap->bdev))) + use_fua = true; + } } /* @@ -511,9 +519,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, /* * If all the writes we issued were FUA, we don't need to flush the * cache on IO completion. Clear the sync flag for this case. + * + * If we are doing an overwrite and need to sync and FUA cannot be used, + * clear the overwrite flag to ensure the completion is run via the + * workqueue rather than directly. */ if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + if (dio->flags & IOMAP_DIO_NEED_SYNC) + dio->flags &= ~IOMAP_DIO_OVERWRITE; WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); WRITE_ONCE(iocb->private, dio->submit.last_queue); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 525b29b99116..794dea2f1dc3 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -395,6 +395,9 @@ xfs_dio_write_end_io( */ XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size); + if (flags & IOMAP_DIO_OVERWRITE) + return 0; + /* * We can allocate memory here while doing writeback on behalf of * memory reclaim. To avoid memory allocation deadlocks set the diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 8b09463dae0d..009c5969a2ef 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -249,6 +249,14 @@ int iomap_writepages(struct address_space *mapping, #define IOMAP_DIO_UNWRITTEN (1 << 0) /* covers unwritten extent(s) */ #define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */ +/* + * IOMAP_DIO_OVERWRITE indicates a pure overwrite that requires no metadata + * updates on completion. It also indicates that the completion is running in + * the hardware IO completion context (e.g. softirq) rather than on a schedules + * workqueue. + */ +#define IOMAP_DIO_OVERWRITE (1 << 2) + struct iomap_dio_ops { int (*end_io)(struct kiocb *iocb, ssize_t size, int error, unsigned flags);