On 02/28/2017 05:57 PM, Keith Busch wrote: > On Tue, Feb 28, 2017 at 08:42:19AM +0100, Artur Paszkiewicz wrote: >> >> I'm observing the same thing when hibernating during mdraid resync on >> nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot >> CPUs ...". > > The patch guarantees forward progress for blk-mq's hot-cpu notifier on > nvme request queues by failing all entered requests. It sounds like some > part of your setup needs those requests to succeed in order to hibernate. > > If your mdraid uses a stacking request_queue that submits retries while > it's request queue is entered, that may explain how you remain stuck > at blk_mq_freeze_queue_wait. > >> This patch did not help but when I put nvme_wait_freeze() >> right after nvme_start_freeze() it appeared to be working. Maybe the >> difference here is that requests are submitted from a non-freezable >> kernel thread (md sync_thread)? > > Wait freeze prior to quiescing the queue is ok when the controller is > functioning, but it'd be impossible to complete a reset if the controller > is in a failed or degraded state. > > We probably want to give those requests a chance to succeed, and I think > we'd need to be able to timeout the freeze wait. Below are two patches > I tested. Prior to these, the fio test would report IO errors from some > of its jobs; no errors with these. With these patches it works fine. I tested multiple iterations on 2 platforms and they were able to hibernate and resume without issues. Thanks, Artur