Naoya, would appreciate any comments you have as this seems to be related to your changes to add huge page migration support. On 1/14/20 5:07 PM, Mike Kravetz wrote: > 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? > - Is that leftover from the the days when huge page migration was not > supported? > - Is it just because huge page migration is more likely to fail than > base page migration. According to man page, mbind was added to v2.6.7 kernel. git history does not go back that far, so I first looked at v2.6.12. Definitions from include/linux/mempolicy.h ------------------------------------------ /* Policies */ #define MPOL_DEFAULT 0 #define MPOL_PREFERRED 1 #define MPOL_BIND 2 #define MPOL_INTERLEAVE 3 /* Flags for mbind */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ Note the MPOL_MF_STRICT would simply verify current page placement. At this time, hugetlb pages were not part of this verification. >From v2.6.12 routine check_range() ... for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { if (!vma->vm_next && vma->vm_end < end) return ERR_PTR(-EFAULT); if (prev && prev->vm_end < vma->vm_start) return ERR_PTR(-EFAULT); if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { err = verify_pages(vma->vm_mm, vma->vm_start, vma->vm_end, nodes); if (err) { first = ERR_PTR(err); break; } } prev = vma; } ... man page history does not go back this far. The first mbind man page has some things that are interesting: 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge page mappings (even though the code does this). 2) Support for huge page policy was added with 2.6.16 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to move pages to the requested node with this flag. Next look at v2.6.16 kernel ========================== This kernel introduces the MPOL_MF_MOVE* flags #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ Once again, the up front check_range() routine will skip hugetlb vma's. Note that check_range() will also isolate pages. So since hugetlb vma's are skipped, no hugetlb pages will be isolated. Since no hugetlb pages are isolated, none will be on the list to be migrate. Therefore, hugetlb pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes sense as huge page migration support was not added until the v3.12 kernel. Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." was added. This was later changed to "MPOL_MF_STRICT is ignored on huge page mappings." Next look at v3.12 kernel ========================= The v3.12 code looks similiar to today's code. In the verify/isolation phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar to queue_pages_hugetlb(). Neither will cause errors at this point in the process. And, hugetlb pages with a mapcount == 1 will potentially be isolated and added to the list of pages to be migrated. In addition, if the subsequent call to migrate_pages() fails to migrate ANY page, we return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) that failed to migrate were hugetlb pages. It should also be noted that no mbind man page updates were made WRT hugetlb pages after migration support was added. Summary ======= It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page mappings." is left over from the original mbind implementation. When the huge page migration support was added, I can not be sure if ignoring MPOL_MF_STRICT for huge pages during the verify/isolation phase was intentional. It seems like it was as the return value from isolate_huge_page() is ignored. What should we do? ================== 1) Nothing more than optimizations by Li Xinhai. Behavior that could be seen as conflicting with man page has existed since v3.12 and I am not aware of any complaints. 2) In addition to optimizations by Li Xinhai, modify code to truly ignore MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do after a failure of migrate_pages(). We could simply traverse the list of pages that were not migrated looking for any non-hugetlb page. 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." and modify code accordingly. My suggestion would be for 1 or 2. Thoughts? -- Mike Kravetz