On Tue, May 19, 2020 at 11:27:50PM +0000, Anchal Agarwal wrote: > From: Munehisa Kamata <kamatam@xxxxxxxxxx> > > S4 power transition states are much different than xen > suspend/resume. Former is visible to the guest and frontend drivers should > be aware of the state transitions and should be able to take appropriate > actions when needed. In transition to S4 we need to make sure that at least > all the in-flight blkif requests get completed, since they probably contain > bits of the guest's memory image and that's not going to get saved any > other way. Hence, re-issuing of in-flight requests as in case of xen resume > will not work here. This is in contrast to xen-suspend where we need to > freeze with as little processing as possible to avoid dirtying RAM late in > the migration cycle and we know that in-flight data can wait. > > Add freeze, thaw and restore callbacks for PM suspend and hibernation > support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND > events, need to implement these xenbus_driver callbacks. The freeze handler > stops block-layer queue and disconnect the frontend from the backend while > freeing ring_info and associated resources. Before disconnecting from the > backend, we need to prevent any new IO from being queued and wait for existing > IO to complete. Freeze/unfreeze of the queues will guarantee that there are no > requests in use on the shared ring. However, for sanity we should check > state of the ring before disconnecting to make sure that there are no > outstanding requests to be processed on the ring. The restore handler > re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to > the backend, so that rest of the kernel can continue to use the block device > transparently. > > Note:For older backends,if a backend doesn't have commit'12ea729645ace' > xen/blkback: unmap all persistent grants when frontend gets disconnected, > the frontend may see massive amount of grant table warning when freeing > resources. > [ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff) > [ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use! > > In this case, persistent grants would need to be disabled. > > [Anchal Changelog: Removed timeout/request during blkfront freeze. > Reworked the whole patch to work with blk-mq and incorporate upstream's > comments] Please tag versions using vX and it would be helpful if you could list the specific changes that you performed between versions. There where 3 RFC versions IIRC, and there's no log of the changes between them. > > Signed-off-by: Anchal Agarwal <anchalag@xxxxxxxxxx> > Signed-off-by: Munehisa Kamata <kamatam@xxxxxxxxxx> > --- > drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++-- > 1 file changed, 115 insertions(+), 7 deletions(-) > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 3b889ea950c2..464863ed7093 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -48,6 +48,8 @@ > #include <linux/list.h> > #include <linux/workqueue.h> > #include <linux/sched/mm.h> > +#include <linux/completion.h> > +#include <linux/delay.h> > > #include <xen/xen.h> > #include <xen/xenbus.h> > @@ -80,6 +82,8 @@ enum blkif_state { > BLKIF_STATE_DISCONNECTED, > BLKIF_STATE_CONNECTED, > BLKIF_STATE_SUSPENDED, > + BLKIF_STATE_FREEZING, > + BLKIF_STATE_FROZEN Nit: adding a terminating ',' would prevent further additions from having to modify this line. > }; > > struct grant { > @@ -219,6 +223,7 @@ struct blkfront_info > struct list_head requests; > struct bio_list bio_list; > struct list_head info_list; > + struct completion wait_backend_disconnected; > }; > > static unsigned int nr_minors; > @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, > info->sector_size = sector_size; > info->physical_sector_size = physical_sector_size; > blkif_set_queue_limits(info); > + init_completion(&info->wait_backend_disconnected); > > return 0; > } > @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset) > case XEN_SCSI_DISK5_MAJOR: > case XEN_SCSI_DISK6_MAJOR: > case XEN_SCSI_DISK7_MAJOR: > - *offset = (*minor / PARTS_PER_DISK) + > + *offset = (*minor / PARTS_PER_DISK) + > ((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) + > EMULATED_SD_DISK_NAME_OFFSET; > *minor = *minor + > @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset) > case XEN_SCSI_DISK13_MAJOR: > case XEN_SCSI_DISK14_MAJOR: > case XEN_SCSI_DISK15_MAJOR: > - *offset = (*minor / PARTS_PER_DISK) + > + *offset = (*minor / PARTS_PER_DISK) + Unrelated changes, please split to a pre-patch. > ((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) + > EMULATED_SD_DISK_NAME_OFFSET; > *minor = *minor + > @@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend) > unsigned int i; > struct blkfront_ring_info *rinfo; > > + if (info->connected == BLKIF_STATE_FREEZING) > + goto free_rings; > /* Prevent new requests being issued until we fix things up. */ > info->connected = suspend ? > BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED; > @@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend) > if (info->rq) > blk_mq_stop_hw_queues(info->rq); > > +free_rings: > for_each_rinfo(info, rinfo, i) > blkif_free_ring(rinfo); > > @@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) > struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id; > struct blkfront_info *info = rinfo->dev_info; > > - if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) > - return IRQ_HANDLED; > + if (unlikely(info->connected != BLKIF_STATE_CONNECTED > + && info->connected != BLKIF_STATE_FREEZING)){ Extra tab and missing space between '){'. Also my preference would be for the && to go at the end of the previous line, like it's done elsewhere in the file. > + return IRQ_HANDLED; > + } > > spin_lock_irqsave(&rinfo->ring_lock, flags); > again: > @@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info) > unsigned int segs; > struct blkfront_ring_info *rinfo; > > + bool frozen = info->connected == BLKIF_STATE_FROZEN; Please put this together with the rest of the variable definitions, and leave the empty line as a split between variable definitions and code. I've already requested this on RFC v3 but you seem to have dropped some of the requests I've made there. > blkfront_gather_backend_features(info); > /* Reset limits changed by blk_mq_update_nr_hw_queues(). */ > blkif_set_queue_limits(info); > @@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info) > kick_pending_request_queues(rinfo); > } > > + if (frozen) > + return 0; > + > list_for_each_entry_safe(req, n, &info->requests, queuelist) { > /* Requeue pending requests (flush or discard) */ > list_del_init(&req->queuelist); > @@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info) > > return; > case BLKIF_STATE_SUSPENDED: > + case BLKIF_STATE_FROZEN: > /* > * If we are recovering from suspension, we need to wait > * for the backend to announce it's features before > @@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev, > break; > > case XenbusStateClosed: > - if (dev->state == XenbusStateClosed) > + if (dev->state == XenbusStateClosed) { > + if (info->connected == BLKIF_STATE_FREEZING) { > + blkif_free(info, 0); > + info->connected = BLKIF_STATE_FROZEN; > + complete(&info->wait_backend_disconnected); > + break; There's no need for the break here, you can rely on the break below. > + } > + > break; > + } > + > + /* > + * We may somehow receive backend's Closed again while thawing > + * or restoring and it causes thawing or restoring to fail. > + * Ignore such unexpected state regardless of the backend state. > + */ > + if (info->connected == BLKIF_STATE_FROZEN) { I think you can join this with the previous dev->state == XenbusStateClosed? Also, won't the device be in the Closed state already if it's in state frozen? > + dev_dbg(&dev->dev, > + "ignore the backend's Closed state: %s", > + dev->nodename); > + break; > + } > /* fall through */ > case XenbusStateClosing: > - if (info) > - blkfront_closing(info); > + if (info) { > + if (info->connected == BLKIF_STATE_FREEZING) > + xenbus_frontend_closed(dev); > + else > + blkfront_closing(info); > + } > break; > } > } > @@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode) > mutex_unlock(&blkfront_mutex); > } > > +static int blkfront_freeze(struct xenbus_device *dev) > +{ > + unsigned int i; > + struct blkfront_info *info = dev_get_drvdata(&dev->dev); > + struct blkfront_ring_info *rinfo; > + /* This would be reasonable timeout as used in xenbus_dev_shutdown() */ > + unsigned int timeout = 5 * HZ; > + unsigned long flags; > + int err = 0; > + > + info->connected = BLKIF_STATE_FREEZING; > + > + blk_mq_freeze_queue(info->rq); > + blk_mq_quiesce_queue(info->rq); > + > + for_each_rinfo(info, rinfo, i) { > + /* No more gnttab callback work. */ > + gnttab_cancel_free_callback(&rinfo->callback); > + /* Flush gnttab callback work. Must be done with no locks held. */ > + flush_work(&rinfo->work); > + } > + > + for_each_rinfo(info, rinfo, i) { > + spin_lock_irqsave(&rinfo->ring_lock, flags); > + if (RING_FULL(&rinfo->ring) > + || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) { '||' should go at the end of the previous line. > + xenbus_dev_error(dev, err, "Hibernation Failed. > + The ring is still busy"); > + info->connected = BLKIF_STATE_CONNECTED; > + spin_unlock_irqrestore(&rinfo->ring_lock, flags); You need to unfreeze the queues here, or else the device will be in a blocked state AFAICT. > + return -EBUSY; > + } > + spin_unlock_irqrestore(&rinfo->ring_lock, flags); > + } This block has indentation all messed up. > + /* Kick the backend to disconnect */ > + xenbus_switch_state(dev, XenbusStateClosing); > + > + /* > + * We don't want to move forward before the frontend is diconnected > + * from the backend cleanly. > + */ > + timeout = wait_for_completion_timeout(&info->wait_backend_disconnected, > + timeout); > + if (!timeout) { > + err = -EBUSY; Note err is only used here, and I think could just be dropped. > + xenbus_dev_error(dev, err, "Freezing timed out;" > + "the device may become inconsistent state"); Leaving the device in this state is quite bad, as it's in a closed state and with the queues frozen. You should make an attempt to restore things to a working state. > + } > + > + return err; > +} > + > +static int blkfront_restore(struct xenbus_device *dev) > +{ > + struct blkfront_info *info = dev_get_drvdata(&dev->dev); > + int err = 0; > + > + err = talk_to_blkback(dev, info); > + blk_mq_unquiesce_queue(info->rq); > + blk_mq_unfreeze_queue(info->rq); > + if (!err) > + blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings); Bad indentation. Also shouldn't you first update the queues and then unfreeze them? Thanks, Roger.