On Tue, Dec 03, 2019 at 09:15:14PM +0800, Hillf Danton wrote: > > IOWs, we are trying to ensure that we run the data IO completion on > > the CPU with that has that data hot in cache. When we are running > > millions of IOs every second, this matters -a lot-. IRQ steering is > > just a mechansim that is used to ensure completion processing hits > > hot caches. > > Along the "CPU affinity" direction, a trade-off is made between CPU > affinity and cache affinity before lb can bear the ca scheme. > Completion works are queued in round robin on the CPUs that share > cache with the submission CPU. > > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -143,6 +143,42 @@ static inline void iomap_dio_set_error(s > cmpxchg(&dio->error, 0, ret); > } > > +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cnt); > +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cpu); > +#define IOMAP_DIO_BIO_END_IO_BATCH 7 > + > +static int iomap_dio_cpu_rr(void) > +{ > + int *io_cnt, *io_cpu; > + int cpu, this_cpu; > + > + io_cnt = get_cpu_ptr(&iomap_dio_bio_end_io_cnt); > + io_cpu = this_cpu_ptr(&iomap_dio_bio_end_io_cpu); > + this_cpu = smp_processor_id(); > + > + if (!(*io_cnt & IOMAP_DIO_BIO_END_IO_BATCH)) { > + for (cpu = *io_cpu + 1; cpu < nr_cpu_id; cpu++) > + if (cpu == this_cpu || > + cpus_share_cache(cpu, this_cpu)) > + goto update_cpu; > + > + for (cpu = 0; cpu < *io_cpu; cpu++) > + if (cpu == this_cpu || > + cpus_share_cache(cpu, this_cpu)) > + goto update_cpu; Linear scans like this just don't scale. We can have thousands of CPUs in a system and maybe only 8 cores that share a local cache. And we can be completing millions of direct IO writes a second these days. A linear scan of (thousands - 8) cpu ids every so often is going to show up as long tail latency for the unfortunate IO that has to scan those thousands of non-matching CPU IDs to find a sibling, and we'll be doing that every handful of IOs that are completed on every CPU. > + > + cpu = this_cpu; > +update_cpu: > + *io_cpu = cpu; > + } > + > + (*io_cnt)++; > + cpu = *io_cpu; > + put_cpu_ptr(&iomap_dio_bio_end_io_cnt); > + > + return cpu; > +} > > static void iomap_dio_bio_end_io(struct bio *bio) > { > struct iomap_dio *dio = bio->bi_private; > @@ -158,9 +194,10 @@ static void iomap_dio_bio_end_io(struct > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > + int cpu = iomap_dio_cpu_rr(); IMO, this sort of "limit work to sibling CPU cores" does not belong in general code. We have *lots* of workqueues that need this treatment, and it's not viable to add this sort of linear search loop to every workqueue and place we queue work. Besides.... > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); .... as I've stated before, this *does not solve the scheduler problem*. All this does is move the problem to the target CPU instead of seeing it on the local CPU. If we really want to hack around the load balancer problems in this way, then we need to add a new workqueue concurrency management type with behaviour that lies between the default of bound and WQ_UNBOUND. WQ_UNBOUND limits scheduling to within a numa node - see wq_update_unbound_numa() for how it sets up the cpumask attributes it applies to it's workers - but we need the work to be bound to within the local cache domain rather than a numa node. IOWs, set up the kworker task pool management structure with the right attributes (e.g. cpu masks) to define the cache domains, add all the hotplug code to make it work with CPU hotplug, then simply apply those attributes to the kworker task that is selected to execute the work. This allows the scheduler to migrate the kworker away from the local run queue without interrupting the currently scheduled task. The cpumask limits the task is configured with limit the scheduler to selecting the best CPU within the local cache domain, and we don't have to bind work to CPUs to get CPU cache friendly work scheduling. This also avoids overhead of per-queue_work_on() sibling CPU calculation, and all the code that wants to use this functionality needs to do is add a single flag at work queue init time (e.g. WQ_CACHEBOUND). IOWs, if the task migration behaviour cannot be easily fixed and so we need work queue users to be more flexible about work placement, then the solution needed here is "cpu cache local work queue scheduling" implemented in the work queue infrastructure, not in every workqueue user. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx