Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

Max Gurtovoy <mgurtovoy@xxxxxxxxxx> · Sun, 11 Dec 2022 16:51:02 +0200

On 12/11/2022 3:21 PM, Rao, Lei wrote:

On 12/11/2022 8:05 PM, Max Gurtovoy wrote:

On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
use.
Beward:  The secondary function might as well be a physical function
as well.  In fact one of the major customers for "smart" multifunction
nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
symmetric dual ported devices are multi-PF as well).

So this isn't really about a VF live cycle, but how to manage life
migration, especially on the receive / restore side.  And restoring
the entire controller state is extremely invasive and can't be done
on a controller that is in any classic form live.  In fact a lot
of the state is subsystem-wide, so without some kind of virtualization
of the subsystem it is impossible to actually restore the state.

ohh, great !

I read this subsystem virtualization proposal of yours after I sent 
my proposal for subsystem virtualization in patch 1/5 thread.
I guess this means that this is the right way to go.
Lets continue brainstorming this idea. I think this can be the way to 
migrate NVMe controllers in a standard way.

To cycle back to the hardware that is posted here, I'm really confused
how it actually has any chance to work and no one has even tried
to explain how it is supposed to work.

I guess in vendor specific implementation you can assume some things 
that we are discussing now for making it as a standard.

Yes, as I wrote in the cover letter, this is a reference 
implementation to
start a discussion and help drive standardization efforts, but this 
series
works well for Intel IPU NVMe. As Jason said, there are two use cases:
shared medium and local medium. I think the live migration of the 
local medium
is complicated due to the large amount of user data that needs to be 
migrated.
I don't have a good idea to deal with this situation. But for Intel 
IPU NVMe,
each VF can connect to remote storage via the NVMF protocol to achieve 
storage
offloading. This is the shared medium. In this case, we don't need to 
migrate
the user data, which will significantly simplify the work of live 
migration.

I don't think that medium migration should be part of the SPEC. We can 
specify it's out of scope.

All the idea of live migration is to have a short downtime and I don't 
think we can guarantee short downtime if we need to copy few terabytes 
throw the networking.
If the media copy is taking few seconds, there is no need to do live 
migration of few milisecs downtime. Just do regular migration of a VM.

The series tries to solve the problem of live migration of shared medium.
But it still lacks dirty page tracking and P2P support, we are also 
developing
these features.

About the nvme device state, As described in my document, the VF 
states include
VF CSR registers, Every IO Queue Pair state, and the AdminQ state. 
During the
implementation, I found that the device state data is small per VF. 
So, I decided
to use the admin queue of the Primary controller to send the live 
migration
commands to save and restore the VF states like MLX5.

I think and hope we all agree that the AdminQ of the controlling NVMe 
function will be used to migrate the controlled NVMe function.

which document are you refereeing to ?

Thanks,
Lei