Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 06, 2020 at 04:07:27PM +0800, Ming Lei wrote:
> On Wed, May 06, 2020 at 08:28:03AM +0100, Will Deacon wrote:
> > On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> > > On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > > > handler. So if there is any request of this hctx submitted from somewhere,
> > > > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > > > 
> > > > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > > > the request is allocated, that is why I mentioned the chance of
> > > > > migration is very small.
> > > > 
> > > > "very small" does not cut it, it has to be zero.  And it seems the
> > > > new version still has this hack.
> > > 
> > > But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> > > 
> > > barrier() is enough when process migration doesn't happen.
> > 
> > Without numbers I would just make the smp_mb() unconditional. Your
> > questionable optimisation trades that for a load of the CPU ID and a
> > conditional branch, which isn't obviously faster to me. It's also very
> 
> The CPU ID is just percpu READ, and unlikely() has been used for
> optimizing the conditional branch. And smp_mb() could cause CPU stall, I
> guess, so it should be much slower than reading CPU ID.

Percpu accesses aren't uniformly cheap across architectures.

> Let's see the attached microbench[1], the result shows that smp_mb() is
> 10+ times slower than smp_processor_id() with one conditional branch.

Nobody said anything about smp_mb() in a tight loop, so this is hardly
surprising. Throughput of barrier instructions will hit a ceiling fairly
quickly, but they don't have to cause stalls in general use. I would expect
the numbers to converge if you added some back-off to the loops (e.g.
ndelay() or something). But I was really hoping for some numbers from the
block layer itself, since that's what we actually care about.

> [    1.239951] test_foo: smp_mb 738701907 smp_id 62904315 result 0 overflow 5120
> 
> The micronbench is run on simple 8cores KVM guest, and cpu is
> 'Model name:          Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz'.
> 
> Result is pretty stable in my 5 runs of VM boot.

Honestly, I get the impression that you're not particularly happy with me
putting in the effort to review your patches, so I'll leave it up to
Christoph as to whether he wants to predicate the concurrency design on
a hokey microbenchmark.

FWIW: I agree that the code should work as you have it in v10, I just think
it's unnecessarily complicated and fragile.

/me goes to review other things

Will



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux