On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > On Mon, 2 Aug 2021, Yang Shi wrote: > > On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > On Fri, 30 Jul 2021, Yang Shi wrote: > > > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > > > > > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so > > > > > that a consistent set of checks can be applied, even when the inode is > > > > > accessed through read/write syscalls (with NULL vma) instead of mmaps > > > > > (the index argument is seldom of interest, but required by mount option > > > > > "huge=within_size"). Clean up and rearrange the checks a little. > > > > > > > > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp() > > > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's > > > > > still true that khugepaged's collapse_file() at that point wants a small > > > > > page, the race that might allocate it a huge page is too unlikely to be > > > > > worth optimizing against (we are there *because* there was at least one > > > > > small page in the way), and handled by a later PageTransCompound check. > > > > > > > > Yes, it seems too unlikely. But if it happens the PageTransCompound > > > > check may be not good enough since the page allocated by > > > > shmem_getpage() may be charged to wrong memcg (root memcg). And it > > > > won't be replaced by a newly allocated huge page so the wrong charge > > > > can't be undone. > > > > > > Good point on the memcg charge: I hadn't thought of that. Of course > > > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I > > > admit that a huge mischarge is hugely worse than a small mischarge. > > > > The small page could be collapsed to a huge page sooner or later, so > > the mischarge may be transient. But huge page can't be replaced. > > You're right, if all goes well, the mischarged small page could be > collapsed to a correctly charged huge page sooner or later (but all > may not go well), whereas the mischarged huge page is stuck there. > > > > > > > > > We could fix it by making shmem_getpage_gfp() non-static, and pointing > > > to the vma (hence its mm, hence its memcg) here, couldn't we? Easily > > > done, but I don't really want to make shmem_getpage_gfp() public just > > > for this, for two reasons. > > > > > > One is that the huge race it just so unlikely; and a mischarge to root > > > is not the end of the world, so long as it's not reproducible. It can > > > only happen on the very first page of the huge extent, and the prior > > > > OK, if so the mischarge is not as bad as what I thought in the first place. > > > > > "Stop if extent has been truncated" check makes sure there was one > > > entry in the extent at that point: so the race with hole-punch can only > > > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage() > > > looks up the page in the tree (and I say hole-punch not truncate, > > > because shmem_getpage()'s i_size check will reject when truncated). > > > I don't doubt that it could happen, but stand by not optimizing against. > > > > I agree the race is so unlikely and it may be not worth optimizing > > against it right now, but a note or a comment may be worth. > > Thanks, but despite us agreeing that the race is too unlikely to be worth > optimizing against, it does still nag at me ever since you questioned it: > silly, but I can't quite be convinced by my own dismissals. > > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those > huge allocation decisions remains the intention; but now think to add > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE - > to rule out that possibility of mischarge after racing hole-punch, > no matter whether it's huge or small. If any such race occurs, > collapse_file() should just give up. > > This being the "Stupid me" SGP_READ idea, except that of course would > not work: because half the point of that block in collapse_file() is > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so. > > There is, of course, the danger that in fixing this unlikely mischarge, > I've got the code wrong and am introducing a bug: here's what a 17/16 > would look like, though it will be better inserted early. I got sick > of all the "if (page "s, and was glad of the opportunity to fix that > outdated "bring it back from swap" comment - swap got done above. > > What do you think? Should I add this in or leave it out? Thanks for keeping investigating this. The patch looks good to me. I think we could go this way. Just a nit below. > > Thanks, > Hugh > > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, > /* Flag allocation requirements to shmem_getpage */ > enum sgp_type { > SGP_READ, /* don't exceed i_size, don't allocate page */ > + SGP_NOALLOC, /* like SGP_READ, but do use fallocated page */ The comment looks misleading, it seems SGP_NOALLOC does clear the Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish this difference? > SGP_CACHE, /* don't exceed i_size, may allocate page */ > SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ > SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm, > xas_unlock_irq(&xas); > /* swap in or instantiate fallocated page */ > if (shmem_getpage(mapping->host, index, &page, > - SGP_CACHE)) { > + SGP_NOALLOC)) { > result = SCAN_FAIL; > goto xa_unlocked; > } > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, > return error; > } > > - if (page) > + if (page) { > hindex = page->index; > - if (page && sgp == SGP_WRITE) > - mark_page_accessed(page); > - > - /* fallocated page? */ > - if (page && !PageUptodate(page)) { > + if (sgp == SGP_WRITE) > + mark_page_accessed(page); > + if (PageUptodate(page)) > + goto out; > + /* fallocated page */ > if (sgp != SGP_READ) > goto clear; > unlock_page(page); > put_page(page); > - page = NULL; > - hindex = index; > } > - if (page || sgp == SGP_READ) > - goto out; > + > + *pagep = NULL; > + if (sgp == SGP_READ) > + return 0; > + if (sgp == SGP_NOALLOC) > + return -ENOENT; > > /* > - * Fast cache lookup did not find it: > - * bring it back from swap or allocate. > + * Fast cache lookup and swap lookup did not find it: allocate. > */ > > if (vma && userfaultfd_missing(vma)) {