Re: [PATCH RFC v2 1/2] mm: Don't skip swap entry even if zap_details specified

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 12, 2022 at 09:18:09PM +0800, Peter Xu wrote:
> On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> > On Mon, 15 Nov 2021, Peter Xu wrote:
> > 
> > > This check existed since the 1st git commit of Linux repository, but at that
> > > time there's no page migration yet so I think it's okay.
> > 
> > //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> > helps with the history.  Yes, the check was okay back then,
> > but a lot of changes have come in since: I'll tell more of those below.
> 
> Thanks for looking at this.  By the way, the link is greatly helpful. It's
> always good to be able to read into the history.
> 
> > 
> > You are absolutely right to clean this up and fix the bugs that
> > have crept in, but I think the patch itself is not quite right yet.
> 
> Do you mean the pmd path on checking mapping?  If so I agree, and I'll add that
> in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
> analyzed below).
> 
> Let me know if I missed anything else..
> 
> > 
> > > 
> > > With page migration enabled, it should logically be possible that we zap some
> > > shmem pages during migration.
> > 
> > Yes.
> > 
> > > When that happens, IIUC the old code could have
> > > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > > without decreasing the counters for the migrating entries.  I have no unit test
> > > to prove it as I don't know an easy way to trigger this condition, though.
> > 
> > In the no-details case, yes, it does look like that. I ought to try
> > and reproduce that, but responding to mails seems more important.
> 
> Please let me know if you know how to reproduce it, since I don't know yet a
> real reproducer.
> 
> What I can do, though, is if with below patch applied to current linux master:
> 
> =============
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..51fe02a22ea9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 }
>  
>                 /* If details->check_mapping, we leave swap entries. */
> -               if (unlikely(details))
> +               if (unlikely(details)) {
> +                       WARN_ON_ONCE(is_migration_entry(entry));
>                         continue;
> +               }
>  
>                 if (!non_swap_entry(entry))
>                         rss[MM_SWAPENTS]--;
> =============
> 
> Then I can easily trigger it if with the help of a test program attached
> (zap-migrate.c).

(Attaching for real; also copy Matthew)

> 
> IMHO it won't really reproduce because it seems all relevant shmem operations
> (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> migration (which needs the page lock too), as mentioned in previous cover
> letters when I was still working on the old versions of this.  But it could be
> possible that I missed something.
> 
> While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> punching we have the pre-unmap without lock:
> 
> 		if ((u64)unmap_end > (u64)unmap_start)
> 			unmap_mapping_range(mapping, unmap_start,
> 					    1 + unmap_end - unmap_start, 0);
> 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> which is with-lock.  Hence we have a small window to at least trigger the
> warning above, even if it won't really mess up the accounting finally.
> 
> > 
> > > 
> > > Besides, the optimization itself is already confusing IMHO to me in a few points:
> > 
> > It is confusing and unnecessary and wrong, I agree.
> > 
> > > 
> > >   - The wording "skip swap entries" is confusing, because we're not skipping all
> > >     swap entries - we handle device private/exclusive pages before that.
> > 
> > I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> > but of course it's good if you can bring the cases closer together.
> > 
> > > 
> > >   - The skip behavior is enabled as long as zap_details pointer passed over.
> > >     It's very hard to figure that out for a new zap caller because it's unclear
> > >     why we should skip swap entries when we have zap_details specified.
> > 
> > History below will clarify that.
> > 
> > > 
> > >   - With modern systems, especially performance critical use cases, swap
> > >     entries should be rare, so I doubt the usefulness of this optimization
> > >     since it should be on a slow path anyway.
> > > 
> > >   - It is not aligned with what we do with huge pmd swap entries, where in
> > >     zap_huge_pmd() we'll do the accounting unconditionally.
> > 
> > The patch below does not align with what's done in zap_huge_pmd() either;
> > but I think zap_huge_pmd() is okay without "details" because its only swap
> > entries are migration entries, and we do not use huge pages when COWing
> > from file huge pages.
> > 
> > > 
> > > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > > should do the same mapping check upon migration entries too.
> > > 
> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > ---
> > >  mm/memory.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f1de811a1dc..e454f3c6aeb9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > >  			continue;
> > >  		}
> > >  
> > > -		/* If details->check_mapping, we leave swap entries. */
> > > -		if (unlikely(details))
> > > -			continue;
> > > -
> > 
> > Good.
> > 
> > >  		if (!non_swap_entry(entry))
> > >  			rss[MM_SWAPENTS]--;
> > >  		else if (is_migration_entry(entry)) {
> > >  			struct page *page;
> > >  
> > >  			page = pfn_swap_entry_to_page(entry);
> > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > +				continue;
> > 
> > Good; though I don't think it's worth an "unlikely", and I say again,
> 
> Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> "unlikely" too around this check?
> 
> > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > 
> > David suggested naming its inversion should_zap_page(), yes, that's
> > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > but should_zap_page() for the inversion sounds good to me.
> 
> Ah sure, I'll rename it to should_zap_page() in a new patch before this.
> 
> > 
> > And I'm pleased to see in one of Matthew's diffs that he intends to
> > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > from include/linux/mm.h.
> 
> Yeah it's only used in memory.c.  I put it there initially because I wanted
> zap_details user can reference what's that check mapping is all about, but
> maybe that's not necessary.
> 
> If you or Matthew could provide a link to that patch, I'll be happy to pick
> that up too with this series.  Or I can also do nothing assuming Matthew will
> handle it elsewhere.
> 
> > 
> > >  			rss[mm_counter(page)]--;
> > >  		}
> > >  		if (unlikely(!free_swap_and_cache(entry)))
> > > -- 
> > > 2.32.0
> > 
> > History.  zap_details came in 2.6.6, and it was mostly to manage
> > truncation on the non-linear vmas we had at that time (remap_file_pages
> > syscall packing non-sequential offsets into a single vma, with pte_file
> > entries), where index could not be deduced from position of pte in vma:
> > truncation range had to be passed down in zap_details; and an madvise
> > needed it too, so it could not be private to mm/memory.c then.
> > 
> > But at the same time, I added the even_cows argument to
> > unmap_mapping_range(), to distinguish truncation (which POSIX requires
> > to unmap even COWed pages) from invalidation (for page cache coherence,
> > which shouldn't touch private COWs).  However, there appear to be no
> > users of that in 2.6.6, though I wouldn't have added that complication
> > just for the fun of confusing everyone: best guess would be that there
> > was parallel development, and the use for !even_cows got removed in
> > the very same release that it was being added.
> > 
> > (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> > of comparing details->check_mapping, or maybe there's some other case
> > I've forgotten which actually needs the exact mapping check.)
> > 
> > Eventually a few drivers came to use unmap_shared_mapping_range(),
> > the !even_cows caller; but more to the point, hole-punching came in,
> > and I felt very strongly that hole-punching on a file had no right
> > to delete private user data.  So then !even_cows became useful.
> > 
> > IIRC, I've seen Linus say he *detests* zap_details.  It had much better
> > justification in the days of non-linear, and I was sad to add
> > single_page to it quite recently; but hope that can go away later
> > (when try_to_unmap_one() is properly extended to THPs).
> > 
> > Now, here's what may clear up a lot of the confusion.
> > The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> > 	if (details && !details->check_mapping && !details->nonlinear_vma)
> > 		details = NULL;
> > which paired with the
> > 		/*
> > 		 * If details->check_mapping, we leave swap entries;
> > 		 * if details->nonlinear_vma, we leave file entries.
> > 		 */
> > 		if (unlikely(details))
> > 			continue;
> > lower down.  I haven't followed up the actual commits, but ChangeLog-2.6.7
> > implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> > In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> > It was intended, not so much to optimize, as to simplify the flow;
> > but in fact has caused all this confusion.
> > 
> > When Kirill discontinued non-linear mapping support in 4.0 (no tears
> > were shed and) the nonlinear_vma comment above got deleted, leaving
> > just the then more puzzling check_mapping comment.
> > 
> > Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> > aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> > details (ignore_dirty and check_swap_entries); which got reverted in
> > 4.11, but apparently the "details = NULL;" not restored.
> > 
> > My stamina is suddenly exhausted, without actually pinpointing a commit
> > for "Fixes:" in your eventual cleanup.  Sorry, I've had enough!
> 
> Yeah it's in most cases a pain for digging all these trivial details, thanks
> for digging already most of it out of the mist.
> 
> That's really what I hope this patch can help: not only because the uffd work
> will rely on it, but also on resolving this early (we do use some wordings like
> "technical debt" sometimes, I think it's the same here but different form) when
> the above "history.git" is still functional so we can still reference.
> 
> With your help and the history.git I can try a better commit message because
> obviously some of the contents needs amending (it's not a pure optimization at
> all), but I assume the patch content will be mostly the same, with the tweaks
> applied.
> 
> Per stated so far I don't know any real reproducer so maybe it's not a real
> issue in any production system?  Maybe that'll make it a bit easier, because
> then we don't strongly require a Fixes (which could be another hard question to
> answer besides the issue itself).
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu
#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <fcntl.h>
#include <stdio.h>
#include <numaif.h>
#include <unistd.h>
#include <sys/mman.h>
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>

#define BSIZE (1UL << 28)       /* 64MB */
#define DELAY 30                /* Something bigger than 10 */

int page_size, npages, *status, *nodes;
pthread_t thread;
char *buffer;
void **pages;
sem_t sem;
char str[4096];
int shmem_fd;

void *evictor_thread(void *data)
{
    int i, ret;

    printf("evictor created\n");

    for (i = 0; i < npages; i++) {
        sem_wait(&sem);
        usleep(random() % DELAY);
        ret = fallocate(shmem_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                        i * page_size, page_size);
        assert(ret == 0);
    }

    printf("evictor quitted\n");

    return NULL;
}

void do_init(void)
{
    int i, ret;

    srand(time(NULL));

    sem_init(&sem, 0, 0);
    page_size = getpagesize();
    npages = BSIZE / page_size;

    shmem_fd = memfd_create("zap-migrate", 0);
    assert(shmem_fd >= 0);

    ret = ftruncate(shmem_fd, BSIZE);
    assert(ret == 0);

    buffer = mmap(NULL, BSIZE, PROT_READ | PROT_WRITE,
                  MAP_SHARED, shmem_fd, 0);
    assert(buffer != MAP_FAILED);

    pages = calloc(npages, sizeof(void *));
    status = calloc(npages, sizeof(int));
    nodes = calloc(npages, sizeof(int));
    assert(pages && status && nodes);

    for (i = 0; i < npages; i++) {
        *(buffer + page_size * i) = i;
    }

    ret = pthread_create(&thread, NULL, evictor_thread, NULL);
    assert(ret == 0);
}

void do_quit(void)
{
    free(pages);
    free(status);
    free(nodes);
    munmap(buffer, BSIZE);
    close(shmem_fd);
}

void move_all_to_node(int node)
{
    long ret;
    int i;

    for (i = 0; i < npages; i++) {
        pages[i] = buffer + page_size * i;
        nodes[i] = node;
        status[i] = 0;
    }

    ret = move_pages(0, npages, pages, nodes, status, MPOL_MF_MOVE);
    assert(ret == 0);

    printf("%s node=%d\n", __func__, node);
}

void move_each_to_node(int node)
{
    long ret;
    int i;

    for (i = 0; i < npages; i++) {
        pages[0] = buffer + page_size * i;
        nodes[0] = node;
        status[0] = -1;

        /* Kick the evictor */
        sem_post(&sem);
        usleep(10);
        ret = move_pages(0, 1, pages, nodes, status, MPOL_MF_MOVE);
        assert(ret == 0);
    }

    printf("%s node=%d\n", __func__, node);
}

void check(void)
{
    pid_t pid = getpid();
    int fd, ret;
    char *p, *end;

    sprintf(str, "/proc/%d/smaps_rollup", pid);
    printf("Dump file: %s\n==========\n", str);
    fd = open(str, O_RDONLY);
    assert(fd >= 0);

    while (1) {
        ret = read(fd, str, sizeof(str)-1);
        assert(ret >= 0);
        if (ret == 0)
            break;
    }

    p = strstr(str, "Pss_Shmem:");
    end = strchr(p, '\n');
    *end = 0;
    printf("%s\n", p);

    close(fd);
}

void main(void)
{
    do_init();
    move_all_to_node(0);
    move_each_to_node(1);
    /* Make sure madvise() all ran */
    pthread_join(thread, NULL);
    /* Should be "0KB" */
    check();
    do_quit();
}

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux