On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > On 1/18/18 7:32 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >> On 1/18/18 11:47 AM, Bart Van Assche wrote: > >>>> This is all very tiresome. > >>> > >>> Yes, this is tiresome. It is very annoying to me that others keep > >>> introducing so many regressions in such important parts of the kernel. > >>> It is also annoying to me that I get blamed if I report a regression > >>> instead of seeing that the regression gets fixed. > >> > >> I agree, it sucks that any change there introduces the regression. I'm > >> fine with doing the delay insert again until a new patch is proven to be > >> better. > > > > That way is still buggy as I explained, since rerun queue before adding > > request to hctx->dispatch_list isn't correct. Who can make sure the request > > is visible when __blk_mq_run_hw_queue() is called? > > That race basically doesn't exist for a 10ms gap. > > > Not mention this way will cause performance regression again. > > How so? It's _exactly_ the same as what you are proposing, except mine > will potentially run the queue when it need not do so. But given that > these are random 10ms queue kicks because we are screwed, it should not > matter. The key point is that it only should be if we have NO better > options. If it's a frequently occurring event that we have to return > BLK_STS_RESOURCE, then we need to get a way to register an event for > when that condition clears. That event will then kick the necessary > queue(s). Please see queue_delayed_work_on(), hctx->run_work is shared by all scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new scheduling can make progress during the 100ms. > > >> From the original topic of this email, we have conditions that can cause > >> the driver to not be able to submit an IO. A set of those conditions can > >> only happen if IO is in flight, and those cases we have covered just > >> fine. Another set can potentially trigger without IO being in flight. > >> These are cases where a non-device resource is unavailable at the time > >> of submission. This might be iommu running out of space, for instance, > >> or it might be a memory allocation of some sort. For these cases, we > >> don't get any notification when the shortage clears. All we can do is > >> ensure that we restart operations at some point in the future. We're SOL > >> at that point, but we have to ensure that we make forward progress. > > > > Right, it is a generic issue, not DM-specific one, almost all drivers > > call kmalloc(GFP_ATOMIC) in IO path. > > GFP_ATOMIC basically never fails, unless we are out of memory. The I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be possible, and it is mentioned[1] there is such code in mm allocation path, also OOM can happen too. if (some randomly generated condition) && (request is atomic) return NULL; [1] https://lwn.net/Articles/276731/ > exception is higher order allocations. If a driver has a higher order > atomic allocation in its IO path, the device driver writer needs to be > taken out behind the barn and shot. Simple as that. It will NEVER work > well in a production environment. Witness the disaster that so many NIC > driver writers have learned. > > This is NOT the case we care about here. It's resources that are more > readily depleted because other devices are using them. If it's a high > frequency or generally occurring event, then we simply must have a > callback to restart the queue from that. The condition then becomes > identical to device private starvation, the only difference being from > where we restart the queue. > > > IMO, there is enough time for figuring out a generic solution before > > 4.16 release. > > I would hope so, but the proposed solutions have not filled me with > a lot of confidence in the end result so far. > > >> That last set of conditions better not be a a common occurence, since > >> performance is down the toilet at that point. I don't want to introduce > >> hot path code to rectify it. Have the driver return if that happens in a > >> way that is DIFFERENT from needing a normal restart. The driver knows if > >> this is a resource that will become available when IO completes on this > >> device or not. If we get that return, we have a generic run-again delay. > > > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? It is from blk_get_request(underlying queue), see multipath_clone_and_map(). Thanks, Ming