Re: [PATCH V3 vfio 5/7] vfio/virtio: Add support for the basic live migration functionality

Yishai Hadas <yishaih@xxxxxxxxxx> · Wed, 13 Nov 2024 13:44:49 +0200

On 12/11/2024 23:01, Alex Williamson wrote:
On Tue, 12 Nov 2024 10:37:27 +0200
Yishai Hadas <yishaih@xxxxxxxxxx> wrote:

diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
new file mode 100644
index 000000000000..3d5eaa1cbcdb
--- /dev/null
+++ b/drivers/vfio/pci/virtio/migrate.c
...
+static int virtiovf_add_migration_pages(struct virtiovf_data_buffer *buf,
+					unsigned int npages)
+{
+	unsigned int to_alloc = npages;
+	struct page **page_list;
+	unsigned long filled;
+	unsigned int to_fill;
+	int ret;
+
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
+	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
+	if (!page_list)
+		return -ENOMEM;
+
+	do {
+		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
+						page_list);
+		if (!filled) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		to_alloc -= filled;
+		ret = sg_alloc_append_table_from_pages(&buf->table, page_list,
+			filled, 0, filled << PAGE_SHIFT, UINT_MAX,
+			SG_MAX_SINGLE_ALLOC, GFP_KERNEL_ACCOUNT);
+
+		if (ret)
+			goto err;

Cleanup here has me a bit concerned.  I see this pattern in
mlx5vf_add_migration_pages() as well.  IIUC, if we hit the previous
ENOMEM condition we simply free page_list and return error because any
pages we've already added to the sg table will be freed in the next
function below.  But what happens here?  It looks like we've allocated
a bunch of pages, failed to added them to the sg table and we drop them
on the floor.  Am I wrong?

You are right, thanks for pointing that out.

In V4 we may have the below extra chunk [1].

[1]
index 70fae7de11e9..aaada7abfffb 100644
--- a/drivers/vfio/pci/virtio/migrate.c
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -69,6 +69,7 @@ static int virtiovf_add_migration_pages(struct 
virtiovf_data_buffer *buf,
        unsigned long filled;
        unsigned int to_fill;
        int ret;
+       int i;

        to_fill = min_t(unsigned int, npages, PAGE_SIZE / 
sizeof(*page_list));
        page_list = kvzalloc(to_fill * sizeof(*page_list), 
GFP_KERNEL_ACCOUNT);
@@ -88,7 +89,7 @@ static int virtiovf_add_migration_pages(struct 
virtiovf_data_buffer *buf,
                        SG_MAX_SINGLE_ALLOC, GFP_KERNEL_ACCOUNT);

                if (ret)
-                       goto err;
+                       goto err_append;
                buf->allocated_length += filled * PAGE_SIZE;
                /* clean input for another bulk allocation */
                memset(page_list, 0, filled * sizeof(*page_list));
@@ -99,6 +100,9 @@ static int virtiovf_add_migration_pages(struct 
virtiovf_data_buffer *buf,
        kvfree(page_list);
        return 0;

+err_append:
+       for (i = filled - 1; i >= 0; i--)
+               __free_page(page_list[i]);
 err:
        kvfree(page_list);
        return ret;


Regarding mlx5vf_add_migration_pages(), I may need to send a matching 
patch. However, this area is just WIP to be changed as part of the DMA 
API series from Leon, so it might not be applicable for the coming 
kernels but to old ones. I may hold on for a moment for mlx5 and see 
when/what to send here.

Note:
Once I was on that to enforce/test, I fixed another unwind flow at 
virtiovf_pci_save/resume_device_data() to not kfree the migf upon 'end' 
as it will be done as part of fput(migf->filp) -> virtiovf_release_file().

This will be part of V4 and I'll send a matching one for mlx5.



+		buf->allocated_length += filled * PAGE_SIZE;
+		/* clean input for another bulk allocation */
+		memset(page_list, 0, filled * sizeof(*page_list));
+		to_fill = min_t(unsigned int, to_alloc,
+				PAGE_SIZE / sizeof(*page_list));
+	} while (to_alloc > 0);
+
+	kvfree(page_list);
+	return 0;
+
+err:
+	kvfree(page_list);
+	return ret;

If ret were initialized to zero it looks like we wouldn't need
duplicate code for the success path, but that might change if we need
more unwind for the above.  Thanks,


Yes, as of the above extra unwind flow, this will stay as it's now.

Thanks,
Yishai