The patch titled Subject: mm: migrate: try again if THP split is failed due to page refcnt has been added to the -mm mm-unstable branch. Its filename is mm-migrate-try-again-if-thp-split-is-failed-due-to-page-refcnt.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-migrate-try-again-if-thp-split-is-failed-due-to-page-refcnt.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> Subject: mm: migrate: try again if THP split is failed due to page refcnt Date: Fri, 21 Oct 2022 18:16:24 +0800 When creating a virtual machine, we will use memfd_create() to get a file descriptor which can be used to create shared memory mappings using mmap(). The mmap() will set the MAP_POPULATE flag to allocate physical pages for the virtual machine. When allocating physical pages for the guest, the host can fall back to allocate some CMA pages for the guest when over half of the zone's free memory is in the CMA area. In guest os, when the application wants to do some data transaction with DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and create IOMMU mappings for the DMA pages. However, when calling VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will fail to longterm-pin sometimes. After some invetigation, we found the pages used to do DMA mapping can contain some CMA pages, and these CMA pages will cause a possible failure of the longterm-pin, due to failure to migrate the CMA pages. The reason for migration failure may be a temporary reference count or memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl to return an error, which makes the application fail to start. I observed one migration failure case (which is not easy to reproduce) is that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' count is also 1. That means when migrating a THP which is in a CMA area, but cannot allocate a new THP due to memory fragmentation, it will split the THP. However THP split also fails, probably the reason is a temporary reference count of this THP. And the temporary reference count can be caused by dropping page caches (I observed the drop caches operation in the system), but we cannot drop the shmem page caches because they are already dirty at that time. Especially for THP split failure, which is caused by temporary reference count, we can try again to mitigate the failure of migration in this case according to previous discussion [1]. [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@xxxxxxxxxx/ Link: https://lkml.kernel.org/r/88831f1764c8fbd5b5fdad27cd5ae3d2ca796e44.1666335603.git.baolin.wang@xxxxxxxxxxxxxxxxx Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> Cc: Alistair Popple <apopple@xxxxxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> Cc: Yang Shi <shy828301@xxxxxxxxx> Cc: Zi Yan <ziy@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/huge_memory.c | 4 ++-- mm/migrate.c | 19 ++++++++++++++++--- 2 files changed, 18 insertions(+), 5 deletions(-) --- a/mm/huge_memory.c~mm-migrate-try-again-if-thp-split-is-failed-due-to-page-refcnt +++ a/mm/huge_memory.c @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page * split PMDs */ if (!can_split_folio(folio, &extra_pins)) { - ret = -EBUSY; + ret = -EAGAIN; goto out_unlock; } @@ -2716,7 +2716,7 @@ fail: xas_unlock(&xas); local_irq_enable(); remap_page(folio, folio_nr_pages(folio)); - ret = -EBUSY; + ret = -EAGAIN; } out_unlock: --- a/mm/migrate.c~mm-migrate-try-again-if-thp-split-is-failed-due-to-page-refcnt +++ a/mm/migrate.c @@ -1506,9 +1506,22 @@ thp_subpage_migration: if (is_thp) { nr_thp_failed++; /* THP NUMA faulting doesn't split THP to retry. */ - if (!nosplit && !try_split_thp(page, &thp_split_pages)) { - nr_thp_split++; - break; + if (!nosplit) { + int ret = try_split_thp(page, &thp_split_pages); + + if (!ret) { + nr_thp_split++; + break; + } else if (reason == MR_LONGTERM_PIN && + ret == -EAGAIN) { + /* + * Try again to split THP to mitigate + * the failure of longterm pinning. + */ + thp_retry++; + nr_retry_pages += nr_subpages; + break; + } } } else if (!no_subpage_counting) { nr_failed++; _ Patches currently in -mm which might be from baolin.wang@xxxxxxxxxxxxxxxxx are mm-migrate-fix-return-value-if-all-subpages-of-thps-are-migrated-successfully.patch mm-migrate-try-again-if-thp-split-is-failed-due-to-page-refcnt.patch