Re: [PATCH v7] mm: shrink skip folio mapped by an exiting process

Barry Song <21cnbao@xxxxxxxxx> · Wed, 10 Jul 2024 19:11:29 +1200

On Wed, Jul 10, 2024 at 6:47 PM zhiguojiang <justinjiang@xxxxxxxx> wrote:
>
>
>
> 在 2024/7/10 12:44, Barry Song 写道:
> > [Some people who received this message don't often get email from 21cnbao@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >> On 10.07.24 06:02, Barry Song wrote:
> >>> On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>> On 10.07.24 05:32, Barry Song wrote:
> >>>>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> >>>>>> On Tue,  9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@xxxxxxxx> wrote:
> >>>>>>
> >>>>>>> The releasing process of the non-shared anonymous folio mapped solely by
> >>>>>>> an exiting process may go through two flows: 1) the anonymous folio is
> >>>>>>> firstly is swaped-out into swapspace and transformed into a swp_entry
> >>>>>>> in shrink_folio_list; 2) then the swp_entry is released in the process
> >>>>>>> exiting flow. This will result in the high cpu load of releasing a
> >>>>>>> non-shared anonymous folio mapped solely by an exiting process.
> >>>>>>>
> >>>>>>> When the low system memory and the exiting process exist at the same
> >>>>>>> time, it will be likely to happen, because the non-shared anonymous
> >>>>>>> folio mapped solely by an exiting process may be reclaimed by
> >>>>>>> shrink_folio_list.
> >>>>>>>
> >>>>>>> This patch is that shrink skips the non-shared anonymous folio solely
> >>>>>>> mapped by an exting process and this folio is only released directly in
> >>>>>>> the process exiting flow, which will save swap-out time and alleviate
> >>>>>>> the load of the process exiting.
> >>>>>> It would be helpful to provide some before-and-after runtime
> >>>>>> measurements, please.  It's a performance optimization so please let's
> >>>>>> see what effect it has.
> >>>>> Hi Andrew,
> >>>>>
> >>>>> This was something I was curious about too, so I created a small test program
> >>>>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set
> >>>>> up a small machine with only 300MB of RAM to trigger kswapd.
> >>>>>
> >>>>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \
> >>>>>     -smp cpus=4 -cpu max \
> >>>>>     -m 300M -kernel arch/arm64/boot/Image
> >>>>>
> >>>>> The test program will be randomly terminated by its subprocess to trigger
> >>>>> the use case of this patch.
> >>>>>
> >>>>> #include <stdio.h>
> >>>>> #include <stdlib.h>
> >>>>> #include <unistd.h>
> >>>>> #include <string.h>
> >>>>> #include <sys/types.h>
> >>>>> #include <sys/wait.h>
> >>>>> #include <time.h>
> >>>>> #include <signal.h>
> >>>>>
> >>>>> #define MEMORY_SIZE (256 * 1024 * 1024)
> >>>>>
> >>>>> unsigned char *memory;
> >>>>>
> >>>>> void allocate_and_write_memory()
> >>>>> {
> >>>>>        memory = (unsigned char *)malloc(MEMORY_SIZE);
> >>>>>        if (memory == NULL) {
> >>>>>            perror("malloc");
> >>>>>            exit(EXIT_FAILURE);
> >>>>>        }
> >>>>>
> >>>>>        while (1)
> >>>>>            memset(memory, 0x11, MEMORY_SIZE);
> >>>>> }
> >>>>>
> >>>>> int main()
> >>>>> {
> >>>>>        pid_t pid;
> >>>>>        srand(time(NULL));
> >>>>>
> >>>>>        pid = fork();
> >>>>>
> >>>>>        if (pid < 0) {
> >>>>>            perror("fork");
> >>>>>            exit(EXIT_FAILURE);
> >>>>>        }
> >>>>>
> >>>>>        if (pid == 0) {
> >>>>>            int delay = (rand() % 10000) + 10000;
> >>>>>            usleep(delay * 1000);
> >>>>>
> >>>>>         /* kill parent when it is busy on swapping */
> >>>>>            kill(getppid(), SIGKILL);
> >>>>>            _exit(0);
> >>>>>        } else {
> >>>>>            allocate_and_write_memory();
> >>>>>
> >>>>>            wait(NULL);
> >>>>>
> >>>>>            free(memory);
> >>>>>        }
> >>>>>
> >>>>>        return 0;
> >>>>> }
> >>>>>
> >>>>> I tracked the number of folios that could be redundantly
> >>>>> swapped out by adding a simple counter as shown below:
> >>>>>
> >>>>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio,
> >>>>>                        check_stable_address_space(vma->vm_mm)) &&
> >>>>>                        folio_test_swapbacked(folio) &&
> >>>>>                        !folio_likely_mapped_shared(folio)) {
> >>>>> +                       static long i, size;
> >>>>> +                       size += folio_size(folio);
> >>>>> +                       pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size);
> >>>>>                            pra->referenced = -1;
> >>>>>                            page_vma_mapped_walk_done(&pvmw);
> >>>>>                            return false;
> >>>>>
> >>>>>
> >>>>> This is what I have observed:
> >>>>>
> >>>>> / # /home/barry/develop/linux/skip_swap_out_test
> >>>>> [   82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536
> >>>>> [   82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072
> >>>>> [   82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608
> >>>>> [   82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144
> >>>>> [   82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680
> >>>>> [   82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216
> >>>>> ...
> >>>>> [   88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848
> >>>>> [   88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384
> >>>>> [   89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480
> >>>>>
> >>>>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB
> >>>>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during
> >>>>> the process exit.
> >>>>>
> >>>>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still
> >>>>> quite valuable. Please consider pulling his v9 into the mm tree for testing.
> >>>> BTW, we dropped the folio_test_anon() check, but what about shmem? They
> >>>> also do __folio_set_swapbacked()?
> >>> my point is that the purpose is skipping redundant swap-out, if shmem is single
> >>> mapped, they could be also skipped.
> >> But they won't get necessarily *freed* when unmapping them. They might
> >> just continue living in tmpfs? where some other process might just map
> >> them later?
> >>
> > You're correct. I overlooked this aspect, focusing on swap and thinking of shmem
> > solely in terms of swap.
> >
> >> IMHO, there is a big difference here between anon and shmem. (well,
> >> anon_shmem would actually be different :) )
> > Even though anon_shmem behaves similarly to anonymous memory when
> > releasing memory, it doesn't seem worth the added complexity?
> >
> > So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon()
> > back? Sorry for my bad, Zhiguo.
> If folio_test_anon(folio) && folio_test_swapbacked(folio) condition is
> used, can
> it means that the folio is anonymous anther than shmem definitely? So does
> folio_likely_mapped_shared() need to be removed?

No, shared memory (shmem) isn't necessarily shared, and private anonymous
memory isn't necessarily unshared. There is no direct relationship between
them.

In the case of a fork, your private anonymous folio can be shared by
two or more processes before CoW.

> >
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
> > Thanks
> > Barry
> Thanks
> Zhiguo
>