> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Tuesday, June 27, 2023 2:14 AM > > On Mon, Jun 26, 2023 at 07:31:31AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Wednesday, June 21, 2023 9:27 PM > > > > > > On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote: > > > > > > > What is the criteria for 'reasonable'? How does CSPs judge that such > > > > device can guarantee a *reliable* reasonable window so live migration > > > > can be enabled in the production environment? > > > > > > The CSP needs to work with the device vendor to understand how it fits > > > into their system, I don't see how we can externalize this kind of > > > detail in a general way. > > > > > > > I'm afraid that we are hiding a non-deterministic factor in current > protocol. > > > > > > Yes > > > > > > > But still I don't think it's a good situation where the user has ZERO > > > > knowledge about the non-negligible time in the stopping path... > > > > > > In any sane device design this will be a small period of time. These > > > timeouts should be to protect against a device that has gone wild. > > > > > > > Any example how 'small' it will be (e.g. <1ms)? > > Not personally.. > > > Should we define a *reasonable* threshold in VFIO community which > > any new variant driver should provide information to judge against? > > Ah, I think we are just too new to get into such details. I think we > need some real world experience to see if this is really an issue. > > > The reason why I keep discussing it is that IMHO achieving negligible > > stop time is a very challenging task for many accelerators. e.g. IDXD > > can be stopped only after completing all the pending requests. While > > it allows software to configure the max pending work size (and a > > reasonable setting could meet both migration SLA and performance > > SLA) the worst-case draining latency could be in 10's milliseconds which > > cannot be ignored by the VMM. > > Well, what would you report here if you had the opportunity to report > something? Some big number? Then what? > > > Or do you think it's still better left to CSP working with the device vendor > > even in this case, given the worst-case latency could be affected by > > many factors hence not something which a kernel driver can accurately > > estimate? > > This is my fear, that it is so complicated that reducing it to any > sort of cross-vendor data is not feasible. At least I'd like to see > someone experiment with what information would be useful to qemu > before we add kernel ABI.. > OK. make sense.