Re: single aio thread is migrated crazily by scheduler

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 15 Nov 2019 09:08:24 +0800

Hi Dave,

On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote:
> On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > Hi Guys,
> > 
> > It is found that single AIO thread is migrated crazely by scheduler, and
> > the migrate period can be < 10ms. Follows the test a):
> > 
> > 	- run single job fio[1] for 30 seconds:
> > 	./xfs_complete 512
> > 	
> > 	- observe fio io thread migration via bcc trace[2], and the migration
> > 	times can reach 5k ~ 10K in above test. In this test, CPU utilization
> > 	is 30~40% on the CPU running fio IO thread.
> 
> Using the default scheduler tunings:
> 
> kernel.sched_wakeup_granularity_ns = 4000000
> kernel.sched_min_granularity_ns = 3000000
> 
> I'm not seeing any migrations at all on a 16p x86-64 box. Even with
> the tunings you suggest:
> 
> 	sysctl kernel.sched_min_granularity_ns=10000000
> 	sysctl kernel.sched_wakeup_granularity_ns=15000000
> 
> There are no migrations at all.

Looks I forget to pass $BS to the fio command line in the script posted,
please try the following script again and run './xfs_complete 512' first.

[1] xfs_complete: one fio script for running single job overwrite aio on XFS
#!/bin/bash

BS=$1
NJOBS=1
QD=128
DIR=/mnt/xfs
BATCH=1
VERIFY="sha3-512"

sysctl kernel.sched_wakeup_granularity_ns
sysctl kernel.sched_min_granularity_ns

rmmod scsi_debug;modprobe scsi_debug dev_size_mb=6144 ndelay=41000 dix=1 dif=2
DEV=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEV

mkfs.xfs -f $DEV
[ ! -d $DIR ] && mkdir -p $DIR
mount $DEV $DIR

fio --readwrite=randwrite --filesize=5g \
    --overwrite=1 \
    --filename=$DIR/fiofile \
    --runtime=30s --time_based \
    --ioengine=libaio --direct=1 --bs=$BS --iodepth=$QD \
    --iodepth_batch_submit=$BATCH \
    --iodepth_batch_complete_min=$BATCH \
    --numjobs=$NJOBS \
    --verify=$VERIFY \
    --name=/hana/fsperf/foo

umount $DEV
rmmod scsi_debug

When running './xfs_complete 512', lots of fio migration can be observed
via bcc trace:

/usr/share/bcc/tools/trace -C -t  't:sched:sched_migrate_task "%s/%d cpu %d->%d", args->comm,args->pid,args->orig_cpu,args->dest_cpu' | grep fio

...
69.13495 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
69.13513 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
69.86733 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->3
70.20730 3   134     134     kworker/3:1     sched_migrate_task b'fio'/866 cpu 3->0
70.21131 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->1
70.21733 1   112     112     kworker/1:1     sched_migrate_task b'fio'/866 cpu 1->2
70.29528 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.29769 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
70.36332 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->3
70.41924 3   134     134     kworker/3:1     sched_migrate_task b'fio'/866 cpu 3->2
70.78572 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.79061 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
70.91526 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.91607 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
...

In my test just done, the migration count is 12K in 30s fio running.
Sometimes the number can be quite less, but most of times, the number
is very big(> 5k).

Also attaches my kernel config.

> During the overwrite phase of the test, I'm seeing 66% fio usage,
> 28% system time and of that, only 3% of the CPU time is in the IO
> completion workqueue time. It's all running on the one CPU, and it's
> running at about 15,000 context switches per second.. I'm not seeing
> any CPU migrations at all, despite the workload being completely CPU
> bound on a single CPU.

In my test VM, in case of 4K block size(./xfs_complete 4k), when the
following kernel parameters are passed in:

 	sysctl kernel.sched_min_granularity_ns=10000000
 	sysctl kernel.sched_wakeup_granularity_ns=15000000

'top -H' shows that fio consumes ~100% CPU, and the fio IO thread is
migrated very frequently.

> 
> This is the typical output from top:
> 
> Tasks: 262 total,   2 running, 260 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu2  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu3  : 66.2 us, 27.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 st
> %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> %Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu11 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> MiB Mem :  16005.3 total,   8737.7 free,   6762.5 used,    505.2 buff/cache
> MiB Swap:    486.3 total,    486.3 free,      0.0 used.   8640.3 avail Mem 
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    5554 root      20   0  672064  15356   1392 R  97.0   0.1   0:17.85 fio
>    1388 root      20   0       0      0      0 I   3.0   0.0   0:01.39 kworker/3:1-dio/sdb
>    5552 root      20   0  661364 275380 271992 S   0.7   1.7   0:15.88 fio
>    4891 dave      20   0   15572   5876   4600 S   0.3   0.0   0:00.56 sshd
>    4933 dave      20   0   11732   4276   3296 R   0.3   0.0   0:01.30 top
> 
> i.e. it runs entirely on CPU 3 for the whole 30s measurement period.
> 
> And the number of migrate task events:
> 
> $ sudo trace-cmd start -e sched_migrate_task
> $ sudo ./run_test.sh
> .....
> $ sudo trace-cmd show |grep fio
>      kworker/1:1-1252  [001] d..2  2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1
>      rcu_preempt-11    [011] d..2  2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11
>      rcu_preempt-11    [006] d..2  2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6
>           <idle>-0     [007] dNs2  2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7
> $
> 
> And top tells me these fio processes are consuming CPU:
> 
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                  
>    5695 root      20   0  663872   7120   1324 R  97.7   0.0   0:03.48 fio                                                                                      
>    5689 root      20   0  661356 275512 272128 S   0.3   1.7   0:15.46 fio                                                                                      
> 
> So, the fio process that got migrated 4 times in 30s is the fio
> process that isn't doing any of the work.
> 
> There's no migration going on here at all on a vanilla kernel on
> a Debian userspace, with or without the suggested scheduler
> tunings.
> 
> > 	- after applying the debug patch[3] to queue XFS completion work on
> > 	other CPU(not current CPU), the above crazy fio IO thread migration
> > 	can't be observed.
> >
> > IMO, it is normal for user to saturate aio thread, since this way may
> > save context switch.
> > 
> > Guys, any idea on the crazy aio thread migration?
> 
> It doesn't happen on x86-64 with a current TOT vanilla kernel and
> a debian userspace.
> 
> What userspace are you testing with?

Fedora 29.

> 
> > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
> > fio test is created for simulating one real performance report which is
> > proved to be caused by frequent aio submission thread migration.
> 
> What is the underlying hardware? I'm running in a 16p KVM guest on a
> 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant
> CPU migration occurring at all from your test workload.

It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop,
and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info:

[root@ktest-01 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Stepping:            3
CPU MHz:             2712.000
BogoMIPS:            5424.00
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp
[root@ktest-01 ~]# uname -a
Linux ktest-01 5.4.0-rc5-00367-gac8d20c84c47 #1532 SMP Thu Nov 14 19:13:25 CST 2019 x86_64 x86_64 x86_64x

> 
> > [3] test patch for queuing xfs completetion on other CPU
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 1fc28c2da279..bdc007a57706 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
> >  			blk_wake_io_task(waiter);
> >  		} else if (dio->flags & IOMAP_DIO_WRITE) {
> >  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> > +			unsigned cpu = cpumask_next(smp_processor_id(),
> > +					cpu_online_mask);
> > +
> > +			if (cpu >= nr_cpu_ids)
> > +				cpu = 0;
> >  
> >  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);
> >  		} else {
> >  			iomap_dio_complete_work(&dio->aio.work);
> >  		}
> 
> So, can you test the patch I sent you a while back that avoided
> using the completion workqueue for pure overwrites? Does that make
> your whacky scheduler behaviour go away, or does it still reschedule
> like crazy?

Last time, I found that your patch V3 doesn't work as expected since the trace
message of 'overwrite completion' isn't observed when running the external
test case. Seems not see your further response after I reported it to you.

Will try this patch again.

> 
> I haven't posted that patch to the list because I can't prove it
> makes any difference to performance in any workload on any hardware
> I've tested it on. I just tested it on your workload, and it makes
> no different to behaviour or performance. It's definitely hitting
> the non-workqueue completion path:
> 
> $ sudo trace-cmd show | grep overwrite |head -1
>              fio-4526  [009] ..s.   180.361089: iomap_dio_bio_end_io: overwrite completion
> $ sudo trace-cmd show | grep overwrite |wc -l
> 51295
> $
> 
> so it's pretty clear that whatever is going on is not obviously a
> problem with workqueues or the way iomap does completions. Patch is
> attached below for you to test. You will need to pull commit
> 7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked
> dirty") from the iomap for-next branch to ensure this patch
> functions correctly.
> 
> Cheers,
> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> iomap: don't defer completion of pure AIO+DIO overwrites
> 
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> We have a workload issuing 4k sequential AIO+DIOs in batches (yeah,
> really silly thing to do when you could just emit a single large
> IO) that is showing some interesting performance anomalies. i.e.
> massive performance regressions that go away when the workload is
> straced because it changes AIO completion scheduling and submission
> batch sizes....
> 
> Older kernels that don't have the iomap DIO code don't show the same
> regression, and they complete the AIO-DIO in the the bio completion
> context rather than deferring it to a workqueue like iomap does.
> 
> Hence, for pure overwrites that don't require any metadata updates
> on IO completion, including file size, call the IO completion
> directly rather than deferring it to the completion workqueue. THe
> IOMAP_DIO_OVERWRITE flag is added to tell ->end_io implementations
> that they are running in bio completion context and that they should
> not have any update work to do. This leverages the iomap FUA
> optimisations that allow FUA writes to be issued when no metadata
> updates are required to be synced to disk during IO completion.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/iomap/direct-io.c  | 26 ++++++++++++++++++++------
>  fs/xfs/xfs_file.c     |  3 +++
>  include/linux/iomap.h |  8 ++++++++
>  3 files changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 49bf9780e3ed..a89b29306794 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -159,8 +159,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
> -			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			if ((dio->flags & IOMAP_DIO_OVERWRITE) &&
> +			     !inode->i_mapping->nrpages) {
> +				trace_printk("overwrite completion\n");
> +				iomap_dio_complete_work(&dio->aio.work);
> +			} else {
> +				INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +				queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			}
>  		} else {
>  			iomap_dio_complete_work(&dio->aio.work);
>  		}
> @@ -229,10 +235,12 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		 * the underlying device supports FUA. This allows us to avoid
>  		 * cache flushes on IO completion.
>  		 */
> -		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
> -		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
> -		    blk_queue_fua(bdev_get_queue(iomap->bdev)))
> -			use_fua = true;
> +		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY))) {
> +			dio->flags |= IOMAP_DIO_OVERWRITE;
> +			if ((dio->flags & IOMAP_DIO_WRITE_FUA) &&
> +			    blk_queue_fua(bdev_get_queue(iomap->bdev)))
> +				use_fua = true;
> +		}
>  	}
>  
>  	/*
> @@ -511,9 +519,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	/*
>  	 * If all the writes we issued were FUA, we don't need to flush the
>  	 * cache on IO completion. Clear the sync flag for this case.
> +	 *
> +	 * If we are doing an overwrite and need to sync and FUA cannot be used,
> +	 * clear the overwrite flag to ensure the completion is run via the
> +	 * workqueue rather than directly.
>  	 */
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> +	if (dio->flags & IOMAP_DIO_NEED_SYNC)
> +		dio->flags &= ~IOMAP_DIO_OVERWRITE;
>  
>  	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
>  	WRITE_ONCE(iocb->private, dio->submit.last_queue);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 525b29b99116..794dea2f1dc3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -395,6 +395,9 @@ xfs_dio_write_end_io(
>  	 */
>  	XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size);
>  
> +	if (flags & IOMAP_DIO_OVERWRITE)
> +		return 0;
> +
>  	/*
>  	 * We can allocate memory here while doing writeback on behalf of
>  	 * memory reclaim.  To avoid memory allocation deadlocks set the
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8b09463dae0d..009c5969a2ef 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -249,6 +249,14 @@ int iomap_writepages(struct address_space *mapping,
>  #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
>  #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
>  
> +/*
> + * IOMAP_DIO_OVERWRITE indicates a pure overwrite that requires no metadata
> + * updates on completion. It also indicates that the completion is running in
> + * the hardware IO completion context (e.g. softirq) rather than on a schedules
> + * workqueue.
> + */
> +#define IOMAP_DIO_OVERWRITE	(1 << 2)
> +
>  struct iomap_dio_ops {
>  	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
>  		      unsigned flags);
> 

Just run a quick test several times after applying the above patch, and looks it
does make a big difference in test './xfs_complete 512' wrt. fio io thread migration.

thanks,
Ming
Attachment:
config.gz

Description: application/gzip