> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Wednesday, June 21, 2023 9:27 PM > > On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote: > > > What is the criteria for 'reasonable'? How does CSPs judge that such > > device can guarantee a *reliable* reasonable window so live migration > > can be enabled in the production environment? > > The CSP needs to work with the device vendor to understand how it fits > into their system, I don't see how we can externalize this kind of > detail in a general way. > > > I'm afraid that we are hiding a non-deterministic factor in current protocol. > > Yes > > > But still I don't think it's a good situation where the user has ZERO > > knowledge about the non-negligible time in the stopping path... > > In any sane device design this will be a small period of time. These > timeouts should be to protect against a device that has gone wild. > Any example how 'small' it will be (e.g. <1ms)? Should we define a *reasonable* threshold in VFIO community which any new variant driver should provide information to judge against? If the worst-case stop time (assuming the device doesn't go wild) may exceed the threshold then it's time to consider whether a new interface is required to communicate such constraint to userspace. The reason why I keep discussing it is that IMHO achieving negligible stop time is a very challenging task for many accelerators. e.g. IDXD can be stopped only after completing all the pending requests. While it allows software to configure the max pending work size (and a reasonable setting could meet both migration SLA and performance SLA) the worst-case draining latency could be in 10's milliseconds which cannot be ignored by the VMM. Or do you think it's still better left to CSP working with the device vendor even in this case, given the worst-case latency could be affected by many factors hence not something which a kernel driver can accurately estimate? Thanks Kevin