Hi Guys, It is found that single AIO thread is migrated crazely by scheduler, and the migrate period can be < 10ms. Follows the test a): - run single job fio[1] for 30 seconds: ./xfs_complete 512 - observe fio io thread migration via bcc trace[2], and the migration times can reach 5k ~ 10K in above test. In this test, CPU utilization is 30~40% on the CPU running fio IO thread. - after applying the debug patch[3] to queue XFS completion work on other CPU(not current CPU), the above crazy fio IO thread migration can't be observed. And the similar result can be observed in the following test b) too: - set sched parameters: sysctl kernel.sched_min_granularity_ns=10000000 sysctl kernel.sched_wakeup_granularity_ns=15000000 which is usually done by 'tuned-adm profile network-throughput' - run single job fio aio[1] for 30 seconds: ./xfs_complete 4k - observe fio io thread migration[2], and similar crazy migration can be observed too. In this test, CPU utilization is close to 100% on the CPU for running fio IO thread - the debug patch[3] still makes a big difference on this test wrt. fio IO thread migration. For test b), I thought that load balance may be triggered when single fio IO thread takes the CPU by ~100%, meantime XFS's queue_work() schedules WQ worker thread on the current CPU, since all other CPUs are idle. When the fio IO thread is migrated to new CPU, the same steps can be repeated again. But for test a), I have no idea why fio IO thread is still migrated so frequently since the CPU isn't saturated at all. IMO, it is normal for user to saturate aio thread, since this way may save context switch. Guys, any idea on the crazy aio thread migration? BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the fio test is created for simulating one real performance report which is proved to be caused by frequent aio submission thread migration. [1] xfs_complete: one fio script for running single job overwrite aio on XFS #!/bin/bash BS=$1 NJOBS=1 QD=128 DIR=/mnt/xfs BATCH=1 VERIFY="sha3-512" sysctl kernel.sched_wakeup_granularity_ns sysctl kernel.sched_min_granularity_ns rmmod scsi_debug;modprobe scsi_debug dev_size_mb=6144 ndelay=41000 dix=1 dif=2 DEV=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename` DEV="/dev/"$DEV mkfs.xfs -f $DEV [ ! -d $DIR ] && mkdir -p $DIR mount $DEV $DIR fio --readwrite=randwrite --filesize=5g \ --overwrite=1 \ --filename=$DIR/fiofile \ --runtime=30s --time_based \ --ioengine=libaio --direct=1 --bs=4k --iodepth=$QD \ --iodepth_batch_submit=$BATCH \ --iodepth_batch_complete_min=$BATCH \ --numjobs=$NJOBS \ --verify=$VERIFY \ --name=/hana/fsperf/foo umount $DEV rmmod scsi_debug [2] observe fio migration via bcc trace: /usr/share/bcc/tools/trace -C -t 't:sched:sched_migrate_task "%s/%d cpu %d->%d", args->comm,args->pid,args->orig_cpu,args->dest_cpu' | grep fio [3] test patch for queuing xfs completetion on other CPU diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 1fc28c2da279..bdc007a57706 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) blk_wake_io_task(waiter); } else if (dio->flags & IOMAP_DIO_WRITE) { struct inode *inode = file_inode(dio->iocb->ki_filp); + unsigned cpu = cpumask_next(smp_processor_id(), + cpu_online_mask); + + if (cpu >= nr_cpu_ids) + cpu = 0; INIT_WORK(&dio->aio.work, iomap_dio_complete_work); - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); } else { iomap_dio_complete_work(&dio->aio.work); } Thanks, Ming