On Tue, Apr 11, 2017 at 10:12:49AM +0800, Ming Lei wrote: > On Fri, Apr 07, 2017 at 08:45:41AM -0600, Jens Axboe wrote: > > On 04/06/2017 09:23 PM, Ming Lei wrote: > > > Hi Jens, > > > > > > Thanks for your comment! > > > > > > On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote: > > >> On 04/06/2017 02:23 AM, Ming Lei wrote: > > >>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote: > > >>>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote: > > >>>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote: > > >>>>>> From: Omar Sandoval <osandov@xxxxxx> > > >>>>>> > > >>>>>> While dispatching requests, if we fail to get a driver tag, we mark the > > >>>>>> hardware queue as waiting for a tag and put the requests on a > > >>>>>> hctx->dispatch list to be run later when a driver tag is freed. However, > > >>>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware > > >>>>>> queues if using a single-queue scheduler with a multiqueue device. If > > >>>>> > > >>>>> It can't perform well by using a SQ scheduler on a MQ device, so just be > > >>>>> curious why someone wants to do that in this way,:-) > > >>>> > > >>>> I don't know why anyone would want to, but it has to work :) The only > > >>>> reason we noticed this is because when the NBD device is created, it > > >>>> only has a single queue, so we automatically assign mq-deadline to it. > > >>>> Later, we update the number of queues, but it's still using mq-deadline. > > >>>> > > >>>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests > > >>>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead > > >>>>> of current hctx. > > >>>> > > >>>> Yup, that's right. It's weird, and I talked to Jens about just forcing > > >>>> the MQ device into an SQ mode when using an SQ scheduler, but this way > > >>>> works fine more or less. > > >>> > > >>> Or just switch the elevator to the MQ default one when the device becomes > > >>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current > > >>> hctx? > > >> > > >> No, that would be a really bad idea imho. First of all, I don't want > > >> kernel driven scheduler changes. Secondly, the framework should work > > >> with a non-direct mapping between hardware dispatch queues and > > >> scheduling queues. > > >> > > >> While we could force a single queue usage to make that a 1:1 mapping > > >> always, that loses big benefits on eg nbd, which uses multiple hardware > > >> queues to up the bandwidth. Similarly on nvme, for example, we still > > >> scale better with N submission queues and 1 scheduling queue compared to > > >> having just 1 submission queue. > > > > > > Looks that isn't what I meant. And my 2nd point is to make > > > mq-deadline's dispatch_request(hctx) just returns requests mapped to > > > the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and > > > blk-mq-sched.c. > > > > That would indeed be better. But let's assume that we have 4 hardware > > queues, we're asked to run queue X. But the FIFO dictates that the first > > request that should run is on queue Y, since it's expired. What do we > > do? Then we'd have to abort dispatching on queue X, and run queue Y > > appropriately. > > The previous rule is to run queue Y on the CPUs(sw queues) mapped to > queue Y, and that should still work by following this rule. But this > patch is starting to break the rule, :-( > > > > > This shuffle can happen all the time. I think it'd be worthwhile to work > > towards the goal of improving mq-deadline to deal with this, and > > subsequently cleaning up the interface. It would be a cleaner > > implementation, though efficiency might suffer further. > > > > >>>>> If that is true, it looks like a issue in usage of I/O scheduler, since > > >>>>> the mq-deadline scheduler just queues requests in one per-request_queue > > >>>>> linked list, for MQ device, the scheduler queue should have been per-hctx. > > >>>> > > >>>> That's an option, but that's a different scheduling policy. Again, I > > >>>> agree that it's strange, but it's reasonable behavior. > > >>> > > >>> IMO, the current mq-deadline isn't good/ready for MQ device, and it > > >>> doesn't make sense to use it for MQ. > > >> > > >> I don't think that's true at all. I do agree that it's somewhat quirky > > >> since it does introduce scheduling dependencies between the hardware > > >> queues, and we have to work at making that well understood and explicit, > > >> as not to introduce bugs due to that. But in reality, all multiqueue > > >> hardware we are deadling with are mapped to a single resource. As such, > > >> it makes a lot of sense to schedule it as such. Hence I don't think that > > >> a single queue deadline approach is necessarily a bad idea for even fast > > >> storage. > > > > > > When we map all mq into one single deadline queue, it can't scale well, for > > > example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one > > > commodity NVMe in a dual-socket(four cores) system: > > > > > > IO scheduler| CPU utilization > > > ------------------------------- > > > none | 60% > > > ------------------------------- > > > mq-deadline | 100% > > > ------------------------------- > > > > > > And IO throughput is basically same in two cases. > > > > > That's a given, for low blocksize and high iops, we'll be hitting the > > deadline lock pretty hard. We dispatch single requests at the time, to > > optimize for rotational storage, since it improves merging a lot. If we > > modified e->type->ops.mq.dispatch_request() to take a "dump this many > > requests" parameter and ramped up the depth quickly, then we could > > drastically reduce the overhead for your case. The initial version > > dumped the whole thing. It had lower overhead on flash, but didn't reach > > the performance of old deadline on !mq since we continually emptied the > > queue and eliminated merging opportunities. > > > > blk_mq_sched_dispatch_requests() could add logic that brought back the > > old behavior if NONROT is set. > > From my further test, looks the issue is related with the single > mq-deadline queue(includes the single lock) which need to be accessed > from different NUMA nodes. > > The following tree implements per-hctx mq-deadline queue, and basically > makes q->last_merge, elevator_queue->hash and 'struct deadline_data' > defined as per-hctx. With this patches, the issue disappered. Sorry for missing the link: https://github.com/ming1/linux/commits/my_v4.11-rcX Thanks, Ming