The documentation describes the details of the NVMe hardware extension to support VFIO live migration. Signed-off-by: Lei Rao <lei.rao@xxxxxxxxx> Signed-off-by: Yadong Li <yadong.li@xxxxxxxxx> Signed-off-by: Chaitanya Kulkarni <kch@xxxxxxxxxx> Reviewed-by: Eddie Dong <eddie.dong@xxxxxxxxx> Reviewed-by: Hang Yuan <hang.yuan@xxxxxxxxx> --- drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 drivers/vfio/pci/nvme/nvme.txt diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt new file mode 100644 index 000000000000..eadcf2082eed --- /dev/null +++ b/drivers/vfio/pci/nvme/nvme.txt @@ -0,0 +1,278 @@ +=========================== +NVMe Live Migration Support +=========================== + +Introduction +------------ +To support live migration, NVMe device designs its own implementation, +including five new specific admin commands and a capability flag in +the vendor-specific field in the identify controller data structure to +support VF's live migration usage. Software can use these live migration +admin commands to get device migration state data size, save and load the +data, suspend and resume the given VF device. They are submitted by software +to the NVMe PF device's admin queue and ignored if placed in the VF device's +admin queue. This is due to the NVMe VF device being passed to the virtual +machine in the virtualization scenario. So VF device's admin queue is not +available for the hypervisor to submit VF device live migration commands. +The capability flag in the identify controller data structure can be used by +software to detect if the NVMe device supports live migration. The following +chapters introduce the detailed format of the commands and the capability flag. + +Definition of opcode for live migration commands +------------------------------------------------ + ++---------------------------+-----------+-----------+------------+ +| | | | | +| Opcode by Field | | | | +| | | | | ++--------+---------+--------+ | | | +| | | | Combined | Namespace | | +| 07 | 06:02 | 01:00 | Opcode | Identifier| Command | +| | | | | used | | ++--------+---------+--------+ | | | +|Generic | Function| Data | | | | +|command | |Transfer| | | | ++--------+---------+--------+-----------+-----------+------------+ +| | +| Vendor SpecificOpcode | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Query the | +| 1b | 10001 | 00 | 0xC4 | | data size | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Suspend the| +| 1b | 10010 | 00 | 0xC8 | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Resume the | +| 1b | 10011 | 00 | 0xCC | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Save the | +| 1b | 10100 | 10 | 0xD2 | |device data | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Load the | +| 1b | 10101 | 01 | 0xD5 | |device data | ++--------+---------+--------+-----------+-----------+------------+ + +Definition of QUERY_DATA_SIZE command +------------------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data size to query | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration. +When the NVMe firmware receives the command, it will return the size of NVMe VF internal +data. The data size depends on how many IO queues are created. + +Definition of SUSPEND command +----------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller to suspend | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives +this command, it will suspend the NVMe VF controller. + +Definition of RESUME command +---------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller to resume | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The RESUME command is used to resume the NVMe VF controller. When firmware receives this command, +it will restart the NVMe VF controller. + +Definition of SAVE_DATA command +-------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 23:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer | ++---------+------------------------------------------------------------------------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)| ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data to save | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware +receives this command, it will save the admin queue states, save some registers, drain IO SQs +and CQs, save every IO queue state, disable the VF controller, and transfer all data to the +host memory through DMA. + +Definition of LOAD_DATA command +-------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 23:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer | ++---------+------------------------------------------------------------------------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)| ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data to load | ++---------+------------------------------------------------------------------------------------+ +| 47:44 | Size: means the size of the device's internal data to be loaded | ++---------+------------------------------------------------------------------------------------+ +| 63:48 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this +command, it will read the device internal's data from the host memory through DMA, restore the +admin queue states and some registers, and restore every IO queue state. + +Extensions of the vendor-specific field in the identify controller data structure +--------------------------------------------------------------------------------- + ++---------+------+------+------+-------------------------------+ +| | | | | | +| Bytes | I/O |Admin | Disc | Description | +| | | | | | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 01:00 | M | M | R | PCI Vendor ID(VID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| ... | ... | ... | ... | ... | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 3072 | O | O | O | Live Migration Support | ++---------+------+------+------+-------------------------------+ +| | | | | | +|4095:3073| O | O | O | Vendor Specific | ++---------+------+------+------+-------------------------------+ + +According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields. +NVMe device uses the 3072 bytes in the identify controller data structure to indicate +whether live migration is supported. 0x0 means live migration is not supported. 0x01 means +live migration is supported, and other values are reserved. + +[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf -- 2.34.1