On 6/16/2023 1:06 AM, Tian, Kevin wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
From: Brett Creeley <brett.creeley@xxxxxxx>
Sent: Saturday, June 3, 2023 6:03 AM
Add live migration support via the VFIO subsystem. The migration
implementation aligns with the definition from uapi/vfio.h and uses
the pds_core PF's adminq for device configuration.
The ability to suspend, resume, and transfer VF device state data is
included along with the required admin queue command structures and
implementations.
PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
support
the VF device suspend operation.
PDS_LM_CMD_RESUME is added to support the VF device resume operation.
PDS_LM_CMD_STATUS is added to determine the exact size of the VF
device state data.
PDS_LM_CMD_SAVE is added to get the VF device state data.
PDS_LM_CMD_RESTORE is added to restore the VF device with the
previously saved data from PDS_LM_CMD_SAVE.
PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
a migration is in/not-in progress from the host's perspective.
Here is 'the device' referring to the PF or VF?
Device is referring to the DSC/firmware not the function. I will clarify
the wording here. Thanks.
and how would the device use this information?
+
+static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
+ union pds_core_adminq_cmd *req,
+ size_t req_len,
+ union pds_core_adminq_comp *resp,
+ u64 flags)
+{
+ union pds_core_adminq_cmd cmd = {};
+ size_t cp_len;
+ int err;
+
+ /* Wrap the client request */
+ cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
+ cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
+ cp_len = min_t(size_t, req_len,
sizeof(cmd.client_request.client_cmd));
'req_len' is kind of redundant. Looks all the callers use sizeof(req).
It does a memcpy based on the min size between req_len and the size of
the request.
+static int
+pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+ union pds_core_adminq_cmd cmd = {
+ .lm_suspend_status = {
+ .opcode = PDS_LM_CMD_SUSPEND_STATUS,
+ .vf_id = cpu_to_le16(pds_vfio->vf_id),
+ },
+ };
+ struct device *dev = pds_vfio_to_dev(pds_vfio);
+ union pds_core_adminq_comp comp = {};
+ unsigned long time_limit;
+ unsigned long time_start;
+ unsigned long time_done;
+ int err;
+
+ time_start = jiffies;
+ time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
+ do {
+ err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd,
sizeof(cmd),
+ &comp,
PDS_AQ_FLAG_FASTPOLL);
+ if (err != -EAGAIN)
+ break;
+
+ msleep(SUSPEND_CHECK_INTERVAL_MS);
+ } while (time_before(jiffies, time_limit));
pds_vfio_client_adminq_cmd() has the exactly same mechanism
with 5s timeout and 1ms poll interval when FASTPOLL is set.
probably you can introduce another flag to indicate retry on
-EAGAIN and then handle it fully in pds_vfio_client_adminq_cmd()?
That's the entire purpose of this command. It uses
pds_vfio_client_adminq_cmd() to poll the SUSPEND_STATUS. IMHO adding the
polling mechanism in pds_vfio_client_adminq_cmd() and using it depending
on a flag is just adding to the complexity and not offering any benefit.
I plan to keep this function as is to separate the functionality. Thanks.
+int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+ union pds_core_adminq_cmd cmd = {
+ .lm_suspend = {
+ .opcode = PDS_LM_CMD_SUSPEND,
+ .vf_id = cpu_to_le16(pds_vfio->vf_id),
+ },
+ };
+ struct device *dev = pds_vfio_to_dev(pds_vfio);
+ union pds_core_adminq_comp comp = {};
+ int err;
+
+ dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
+
+ err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
&comp,
+ PDS_AQ_FLAG_FASTPOLL);
+ if (err) {
+ dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
+ ERR_PTR(err));
+ return err;
+ }
+
+ return pds_vfio_suspend_wait_device_cmd(pds_vfio);
+}
The logic in this function is very confusing.
PDS_LM_CMD_SUSPEND has a completion record:
+struct pds_lm_suspend_comp {
+ u8 status;
+ u8 rsvd;
+ __le16 comp_index;
+ union {
+ __le64 state_size;
+ u8 rsvd2[11];
+ } __packed;
+ u8 color;
Presumably this function can look at the completion record to know whether
the suspend request succeeds.
Why do you require another wait_device step to query the suspend status?
The driver sends the initial suspend request to tell the DSC/firmware to
suspend the VF's data/control path. The DSC/firmware will ack/nack the
suspend request in the completion.
Then the driver polls the DSC/firmware to find when the VF's
data/control path has been fully suspended. When the DSC/firmware isn't
done suspending yet it will return -EAGAIN. Otherwise it will return
success/failure.
I will add some comments clarifying these details.
and I have another question. Is it correct to hard-code the 5s timeout in
the kernel w/o any input from the VMM? Note the guest has been stopped
at this point then very likely the 5s timeout will kill any reasonable SLA which
CSPs try to reach hard.
This gives the device a max of 5 seconds to suspend the VF's
data/control path.
Ideally the VMM has an estimation how long a VM can be paused based on
SLA, to-be-migrated state size, available network bandwidth, etc. and that
hint should be passed to the kernel so any state transition which may violate
that expectation can fail quickly to break the migration process and put the
VM back to the running state.
For QEMU there is a parameter that can specify the downtime-limit that's
used as you mentioned, but this does not include how long it takes the
device to STOP (i.e. suspend the data/control path).
Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
+
+int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
+{
+ union pds_core_adminq_cmd cmd = {
+ .lm_resume = {
+ .opcode = PDS_LM_CMD_RESUME,
+ .vf_id = cpu_to_le16(pds_vfio->vf_id),
+ },
+ };
+ struct device *dev = pds_vfio_to_dev(pds_vfio);
+ union pds_core_adminq_comp comp = {};
+
+ dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
+
+ return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
&comp,
+ 0);
'resume' is also in the blackout phase when the guest is not running.
So presumably FAST_POLL should be set otherwise the max 256ms
poll interval (PDSC_ADMINQ_MAX_POLL_INTERVAL) is really inefficient.
Yeah this is a good catch. I think setting fast_poll = true would be a
good idea here. Thanks.
+
+ if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
VFIO_DEVICE_STATE_RUNNING_P2P) {
+ pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
+
PDS_LM_STA_IN_PROGRESS);
+ err = pds_vfio_suspend_device_cmd(pds_vfio);
+ if (err)
+ return ERR_PTR(err);
+
+ return NULL;
+ }
+
+ if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
VFIO_DEVICE_STATE_RUNNING) {
+ err = pds_vfio_resume_device_cmd(pds_vfio);
+ if (err)
+ return ERR_PTR(err);
+
+ pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
PDS_LM_STA_NONE);
+ return NULL;
+ }
+
+ if (cur == VFIO_DEVICE_STATE_STOP && next ==
VFIO_DEVICE_STATE_RUNNING_P2P)
+ return NULL;
+
+ if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
VFIO_DEVICE_STATE_STOP)
+ return NULL;
I'm not sure whether P2P is actually supported here. By definition
P2P means the device is stopped but still responds to p2p request
from other devices. If you look at mlx example it uses different
cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
But in your case seems you simply move what is required in STOP
into P2P. Probably you can just remove the support of P2P like
hisilicon does.
In a previous review it was mentioned that P2P is more or less supported
and this is how we are able to support it. Ideally we would not set the
P2P feature and just implement the standard STOP/RUNNING states.
+
+/**
+ * struct pds_lm_comp - generic command completion
+ * @status: Status of the command (enum pds_core_status_code)
+ * @rsvd: Structure padding to 16 Bytes
+ */
+struct pds_lm_comp {
+ u8 status;
+ u8 rsvd[15];
+};
not used. Looks most comp structures are defined w/o an user
except struct pds_lm_status_comp.
I will look into this. Thanks.