This patchset provides GVT vGPU with device states control and interfaces to get/set device data. Desgin of device state control and interfaces to get/set device data ==================================================================== CODE STRUCTURES --------------- /* Device State region type and sub-type */ #define VFIO_REGION_TYPE_DEVICE_STATE (1 << 1) #define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL (1) #define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG (2) #define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY (3) #define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4) #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1 #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1 #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2 #define VFIO_DEVICE_STATE_RUNNING 0 #define VFIO_DEVICE_STATE_STOP 1 #define VFIO_DEVICE_STATE_LOGGING 2 #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1 #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2 struct vfio_device_state_ctl { __u32 version; /* ro, version of device control interface*/ __u32 device_state; /* VFIO device state, wo */ __u32 caps; /* ro */ struct { __u32 action; /* wo, GET_BUFFER or SET_BUFFER */ __u64 size; /*rw, total size of device config*/ } device_config; struct { __u32 action; /* wo, GET_BUFFER or SET_BUFFER */ __u64 size; /* rw, total size of device memory*/ __u64 pos;/*chunk offset in total buffer of device memory*/ } device_memory; struct { __u64 start_addr; /* wo */ __u64 page_nr; /* wo */ } system_memory; }; DEVICE DATA ----------- A VFIO device's data can be divided into 3 categories: device config, device memory and system memory dirty pages. Device Config: such kind of data like MMIOs, page tables... Every device is supposed to possess device config data. Usually the size of device config data is small (no big than 10M), and it needs to be loaded in certain strict order. Therefore no dirty data logging is enabled for device config and it must be got/set as a whole. Device Memory: device's internal memory, standalone and outside system memory. It is usually very big. Not all device has device memory. Like IGD only uses system memory and has no device memory. System Memory Dirty Pages: A device can produce dirty pages in system memory. DEVICE STATE REGIONS --------------------- A VFIO device driver needs to register two mandatory regions and optionally another two regions if it plans to support device state management. So, there are up to four regions in total. one is control region (region CTL) which is accessed via read/write system call from user space; the left three are data regions which are mmaped into user space and can be accessed in the same way as accessing memory from user space. (If data regions failed to be mmaped into user space, the way of read/write system calls from user space is also valid). 1. region CTL: Mandatory. This is a control region. Its layout is defined in struct vfio_device_state_ctl. Reading from this region can get version, capabilities and data size of device state interfaces. Writing to this region can set device state, data size and choose which interface to use, i.e, among "get device config buffer", "set device config buffer", "get device memory buffer", "set device memory buffer", "get system memory dirty bitmap". 2. region DEVICE_CONFIG Mandatory. This is a data region that holds device config data. It is able to be mmaped into user space. 3. region DEVICE_MEMORY Optional. This is a data region that holds device memory data. It is able to be mmaped into user space. 4. region DIRTY_BITMAP Optional. This is a data region that holds bitmap of dirty pages in system memory that a VFIO devices produces. It is able to be mmaped into user space. DEVICE STATES ------------- Four states are defined for a VFIO device: RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP. They can be set by writing to device_state field of vfio_device_state_ctl region. LOGGING state is a special state that it CANNOT exist independently. It must be set alongside with state RUNNING or STOP, i.e, RUNNING & LOGGING, STOP & LOGGING It is used for dirty data logging both for device memory and system memory. LOGGING only impacts device/system memory. Device config should be always accessible and return whole config snapshot regardless of LOGGING state. Typical state transition flows for VFIO devices are: (a) RUNNING --> RUNNING & LOGGING --> STOP & LOGGING --> STOP (b) RUNNING --> STOP --> RUNNING RUNNING: In this state, a VFIO device is in active state ready to receive commands from device driver. interfaces includes "get device config buffer", "get device config size", "get device memory buffer", "get device memory size" return whole data snapshot. "get system memory dirty bitmap" returns empty bitmap. It is the default state that a VFIO device enters initially. STOP: In this state, a VFIO device is deactivated to interact with device driver. "get device config buffer", "get device config size" "get device memory buffer", "get device memory size", return whole data snapshot. "get system memory dirty bitmap" returns empty bitmap. RUNNING & LOGGING: In this state, a VFIO device is in active state. "get device config buffer", "get device config size" returns whole snapshot of device config. "get device memory buffer", "get device memory size", "get system memory dirty bitmap" returns dirty data since last call to those interfaces. STOP & LOGGING: In this state, the VFIO device is deactivated. "get device config buffer", "get device config size" returns whole snapshot of device config. "get device memory buffer", "get device memory size", "get system memory dirty bitmap" returns dirty data since last call to those interfaces. Note: The reason why RUNNING is the default state is that device's active state must not depend on device state interface. It is possible that region vfio_device_state_ctl fails to got registered. In that condition, a device needs be in active state by default. DEVICE DATA CAPS ------------------ Device Config capability is by default on, no need to set this cap. For devices that have devcie memory, it is required to expose DEVICE_MEMORY capability. #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1 For devices producing dirty pages in system memory, it is required to expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range of system memory. #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2 see section "DEVICE STATE INTERFACE" for "get caps" interface to get device data caps from userspace VFIO. DEVICE STATE INTERFACES ------------------------ 1. get version (1) user space calls read system call on "version" field of region CTL. (2) VFIO driver writes version number of device state interfaces to the "version" field of region CTL. 2. get caps (1) user space calls read system call on "caps" field of region CTL. (2) if a VFIO device has huge device memory, VFIO driver reports VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of region CTL. if a VFIO device produces dirty pages in system memory, VFIO driver reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of region CTL. 3. set device state (1) user space calls write system call on "device_state" field of region CTL. (2) device state transitions as: RUNNING -- start dirty data logging --> RUNNING & LOGGING RUNNING -- deactivate --> STOP RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING RUNNING & LOGGING -- stop dirty data logging --> RUNNING RUNNING & LOGGING -- deactivate --> STOP & LOGGING RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP STOP -- activate --> RUNNING STOP -- start dirty data logging --> STOP & LOGGING STOP -- activate & start dirty data logging --> RUNNING & LOGGING STOP & LOGGING -- stop dirty data logging --> STOP STOP & LOGGING -- activate --> RUNNING & LOGGING STOP & LOGGING -- activate & stop dirty data logging --> RUNNING 4. get device config size (1) user space calls read system call on "device_config.size" field of region CTL for the total size of device config snapshot. (2) VFIO driver writes device config data's total size in "device_config.size" field of region CTL. 5. set device config size (1) user space calls write system call. total size of device config snapshot --> "device_config.size" field of region CTL. (2) VFIO driver reads device config data's total size from "device_config.size" field of region CTL. 6 get device config buffer (1) user space calls write system call. "GET_BUFFER" --> "device_config.action" field of region CTL. (2) VFIO driver a. gets whole snapshot for device config b. writes whole device config snapshot to region DEVICE_CONFIG. (3) user space reads the whole of device config snapshot from region DEVICE_CONFIG. 7. set device config buffer (1) user space writes whole of device config data to region DEVICE_CONFIG. (2) user space calls write system call. "SET_BUFFER" --> "device_config.action" field of region CTL. (3) VFIO driver loads whole of device config from region DEVICE_CONFIG. 8. get device memory size (1) user space calls read system call on "device_memory.size" field of region CTL for device memory size. (2) VFIO driver a. gets device memory snapshot (in state RUNNING or STOP), or gets device memory dirty data (in state RUNNING & LOGGING or state STOP & LOGGING) b. writes size in "device_memory.size" field of region CTL 9. set device memory size (1) user space calls write system call on "device_memory.size" field of region CTL to set total size of device memory snapshot. (2) VFIO driver reads device memory's size from "device_memory.size" field of region CTL. 10. get device memory buffer (1) user space calls write system. pos --> "device_memory.pos" field of region CTL, "GET_BUFFER" --> "device_memory.action" field of region CTL. (pos must be 0 or multiples of length of region DEVICE_MEMORY). (2) VFIO driver writes N'th chunk of device memory snapshot/dirty data to region DEVICE_MEMORY. (N equals to pos/(region length of DEVICE_MEMORY)) (3) user space reads the N'th chunk of device memory snapshot/dirty data from region DEVICE_MEMORY. 11. set device memory buffer (1) user space writes N'th chunk of device memory snapshot/dirty data to region DEVICE_MEMORY. (N equals to pos/(region length of DEVICE_MEMORY)) (2) user space writes pos to "device_memory.pos" field and writes "SET_BUFFER" to "device_memory.action" field of region CTL. (3) VFIO driver loads N'th chunk of device memory snapshot/dirty data from region DEVICE_MEMORY. 12. get system memory dirty bitmap (1) user space calls write system call to specify a range of system memory that querying dirty pages. system memory's start address --> "system_memory.start_addr" field of region CTL, system memory's page count --> "system_memory.page_nr" field of region CTL. (2) if device state is not in RUNNING or STOP & LOGGING, VFIO driver returns empty bitmap; otherwise, VFIO driver checks the page_nr, if it's larger than the size that region DIRTY_BITMAP can support, error returns; if not, VFIO driver returns as bitmap to specify dirty pages that device produces since last query in this range of system memory . (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP. EXAMPLE USAGE ------------- Take live migration of a VFIO device as an example to use those device state interfaces. Live migration save path: (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE) MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING | MIGRATION_STATUS_SAVE_SETUP | .save_setup callback --> get device memory size (whole snapshot size) get device memory buffer (whole snapshot data) set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING | MIGRATION_STATUS_ACTIVE | .save_live_pending callback --> get device memory size (dirty data) .save_live_iteration callback --> get device memory buffer (dirty data) .log_sync callback --> get system memory dirty bitmap | (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING | .save_live_complete_precopy callback --> get device memory size (dirty data) get device memory buffer (dirty data) get device config size (whole snapshot size) get device config buffer (whole snapshot data) | .save_cleanup callback --> set device state --> VFIO_DEVICE_STATE_STOP MIGRATION_STATUS_COMPLETED MIGRATION_STATUS_CANCELLED or MIGRATION_STATUS_FAILED | (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING Live migration load path: (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE) MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING | (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP | MIGRATION_STATUS_ACTIVE | .load state callback --> set device memory size, set device memory buffer, set device config size, set device config buffer | (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING | MIGRATION_STATUS_COMPLETED Patch Orgnization ================= The first 6 patches let vGPU view its base ggtt address as starting from 0. Before vGPU submitting workloads to HW, trap vGPU's workloads, scan commands to patch them to start from base address of the ggtt partition assiggned to the vGPU. The latter two patches implements the VFIO device states interfaces. Patch 7 implements loading device config data from vGPU and restoring device config data into vGPU through GVT's internal interface intel_gvt_save_restore. Patch 8 exposes device states interfaces to userspace VFIO through VFIO regions of type VFIO_REGION_TYPE_DEVICE_STATE. Through those regions, user space VFIO can get/set device's state and data. Yan Zhao (2): drm/i915/gvt: vGPU device config data save/restore interface drm/i915/gvt: VFIO device states interfaces Yulei Zhang (6): drm/i915/gvt: Apply g2h adjust for GTT mmio access drm/i915/gvt: Apply g2h adjustment during fence mmio access drm/i915/gvt: Patch the gma in gpu commands during command parser drm/i915/gvt: Retrieve the guest gm base address from PVINFO drm/i915/gvt: Align the guest gm aperture start offset for live migration drm/i915/gvt: Apply g2h adjustment to buffer start gma for dmabuf drivers/gpu/drm/i915/gvt/Makefile | 2 +- drivers/gpu/drm/i915/gvt/aperture_gm.c | 6 +- drivers/gpu/drm/i915/gvt/cfg_space.c | 3 +- drivers/gpu/drm/i915/gvt/cmd_parser.c | 31 +- drivers/gpu/drm/i915/gvt/dmabuf.c | 3 + drivers/gpu/drm/i915/gvt/execlist.c | 2 +- drivers/gpu/drm/i915/gvt/gtt.c | 25 +- drivers/gpu/drm/i915/gvt/gtt.h | 3 + drivers/gpu/drm/i915/gvt/gvt.c | 1 + drivers/gpu/drm/i915/gvt/gvt.h | 48 +- drivers/gpu/drm/i915/gvt/kvmgt.c | 414 +++++++++++- drivers/gpu/drm/i915/gvt/migrate.c | 863 +++++++++++++++++++++++++ drivers/gpu/drm/i915/gvt/migrate.h | 97 +++ drivers/gpu/drm/i915/gvt/mmio.c | 13 + drivers/gpu/drm/i915/gvt/mmio.h | 1 + drivers/gpu/drm/i915/gvt/vgpu.c | 11 +- include/uapi/linux/vfio.h | 38 ++ 17 files changed, 1511 insertions(+), 50 deletions(-) create mode 100644 drivers/gpu/drm/i915/gvt/migrate.c create mode 100644 drivers/gpu/drm/i915/gvt/migrate.h -- 2.17.1