> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Tuesday, June 20, 2023 8:31 PM > > On Tue, Jun 20, 2023 at 02:02:44AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Monday, June 19, 2023 8:47 PM > > > > > > On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote: > > > > > > > Ideally the VMM has an estimation how long a VM can be paused based > on > > > > SLA, to-be-migrated state size, available network bandwidth, etc. and > that > > > > hint should be passed to the kernel so any state transition which may > > > violate > > > > that expectation can fail quickly to break the migration process and put > the > > > > VM back to the running state. > > > > > > > > Jason/Shameer, is there similar concern in mlx/hisilicon drivers? > > > > > > It is handled through the vfio_device_feature_mig_data_size mechanism.. > > > > that is only for estimation of copied data. > > > > IMHO the stop time when the VM is paused includes both the time of > > stopping the device and the time of migrating the VM state. > > > > For a software-emulated device the time of stopping the device is negligible. > > > > But certainly for assigned device the worst-case hard-coded 5s timeout as > > done in this patch will kill whatever reasonable 'VM dead time' SLA (usually > > in milliseconds) which CSPs try to meet purely based on the size of copied > > data. > > There is not alot that can be done here, the stop time cannot be > predicted in advance on these devices - the system relies on the > device having a reasonable time window. What is the criteria for 'reasonable'? How does CSPs judge that such device can guarantee a *reliable* reasonable window so live migration can be enabled in the production environment? I'm afraid that we are hiding a non-deterministic factor in current protocol. Looking at mlx5 case which has a even larger timeout: [MLX5_TO_CMD_MS] = 60000, > > > Wouldn't a user-specified stop-device timeout be required to at least allow > > breaking migration early according to the desired SLA? > > Not really, the device is going to still execute the stop regardless > of the timeout, and when it does the VM will be broken. > > With a FW approach like this it is pretty stuck, we need the FW to > remain in sync as the highest priority. This makes some sense. But still I don't think it's a good situation where the user has ZERO knowledge about the non-negligible time in the stopping path... > > > > We want new devices to get their architecture right, they need to > > > support P2P. Didn't we talk about this already and Brett was going to > > > fix it? > > > > Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch. > > That could be OK, it needs a comment explaining why it is OK > Yes, a comment is welcomed. having RUNNING_P2P->STOP as nop kind of suggest that the device has been fully stopped in RUNNING_P2P to meet the definition of the STOP state. But then it violates the definition of RUNNING_P2P.