On Wed, Jul 10, 2024 at 4:04 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 10.07.24 06:02, Barry Song wrote: > > On Wed, Jul 10, 2024 at 3:59 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 10.07.24 05:32, Barry Song wrote: > >>> On Wed, Jul 10, 2024 at 9:23 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> On Tue, 9 Jul 2024 20:31:15 +0800 Zhiguo Jiang <justinjiang@xxxxxxxx> wrote: > >>>> > >>>>> The releasing process of the non-shared anonymous folio mapped solely by > >>>>> an exiting process may go through two flows: 1) the anonymous folio is > >>>>> firstly is swaped-out into swapspace and transformed into a swp_entry > >>>>> in shrink_folio_list; 2) then the swp_entry is released in the process > >>>>> exiting flow. This will result in the high cpu load of releasing a > >>>>> non-shared anonymous folio mapped solely by an exiting process. > >>>>> > >>>>> When the low system memory and the exiting process exist at the same > >>>>> time, it will be likely to happen, because the non-shared anonymous > >>>>> folio mapped solely by an exiting process may be reclaimed by > >>>>> shrink_folio_list. > >>>>> > >>>>> This patch is that shrink skips the non-shared anonymous folio solely > >>>>> mapped by an exting process and this folio is only released directly in > >>>>> the process exiting flow, which will save swap-out time and alleviate > >>>>> the load of the process exiting. > >>>> > >>>> It would be helpful to provide some before-and-after runtime > >>>> measurements, please. It's a performance optimization so please let's > >>>> see what effect it has. > >>> > >>> Hi Andrew, > >>> > >>> This was something I was curious about too, so I created a small test program > >>> that allocates and continuously writes to 256MB of memory. Using QEMU, I set > >>> up a small machine with only 300MB of RAM to trigger kswapd. > >>> > >>> qemu-system-aarch64 -M virt,gic-version=3,mte=off -nographic \ > >>> -smp cpus=4 -cpu max \ > >>> -m 300M -kernel arch/arm64/boot/Image > >>> > >>> The test program will be randomly terminated by its subprocess to trigger > >>> the use case of this patch. > >>> > >>> #include <stdio.h> > >>> #include <stdlib.h> > >>> #include <unistd.h> > >>> #include <string.h> > >>> #include <sys/types.h> > >>> #include <sys/wait.h> > >>> #include <time.h> > >>> #include <signal.h> > >>> > >>> #define MEMORY_SIZE (256 * 1024 * 1024) > >>> > >>> unsigned char *memory; > >>> > >>> void allocate_and_write_memory() > >>> { > >>> memory = (unsigned char *)malloc(MEMORY_SIZE); > >>> if (memory == NULL) { > >>> perror("malloc"); > >>> exit(EXIT_FAILURE); > >>> } > >>> > >>> while (1) > >>> memset(memory, 0x11, MEMORY_SIZE); > >>> } > >>> > >>> int main() > >>> { > >>> pid_t pid; > >>> srand(time(NULL)); > >>> > >>> pid = fork(); > >>> > >>> if (pid < 0) { > >>> perror("fork"); > >>> exit(EXIT_FAILURE); > >>> } > >>> > >>> if (pid == 0) { > >>> int delay = (rand() % 10000) + 10000; > >>> usleep(delay * 1000); > >>> > >>> /* kill parent when it is busy on swapping */ > >>> kill(getppid(), SIGKILL); > >>> _exit(0); > >>> } else { > >>> allocate_and_write_memory(); > >>> > >>> wait(NULL); > >>> > >>> free(memory); > >>> } > >>> > >>> return 0; > >>> } > >>> > >>> I tracked the number of folios that could be redundantly > >>> swapped out by adding a simple counter as shown below: > >>> > >>> @@ -879,6 +880,9 @@ static bool folio_referenced_one(struct folio *folio, > >>> check_stable_address_space(vma->vm_mm)) && > >>> folio_test_swapbacked(folio) && > >>> !folio_likely_mapped_shared(folio)) { > >>> + static long i, size; > >>> + size += folio_size(folio); > >>> + pr_err("index: %d skipped folio:%lx total size:%d\n", i++, (unsigned long)folio, size); > >>> pra->referenced = -1; > >>> page_vma_mapped_walk_done(&pvmw); > >>> return false; > >>> > >>> > >>> This is what I have observed: > >>> > >>> / # /home/barry/develop/linux/skip_swap_out_test > >>> [ 82.925645] index: 0 skipped folio:fffffdffc0425400 total size:65536 > >>> [ 82.925960] index: 1 skipped folio:fffffdffc0425800 total size:131072 > >>> [ 82.927524] index: 2 skipped folio:fffffdffc0425c00 total size:196608 > >>> [ 82.928649] index: 3 skipped folio:fffffdffc0426000 total size:262144 > >>> [ 82.929383] index: 4 skipped folio:fffffdffc0426400 total size:327680 > >>> [ 82.929995] index: 5 skipped folio:fffffdffc0426800 total size:393216 > >>> ... > >>> [ 88.469130] index: 6112 skipped folio:fffffdffc0390080 total size:97230848 > >>> [ 88.469966] index: 6113 skipped folio:fffffdffc038d000 total size:97296384 > >>> [ 89.023414] index: 6114 skipped folio:fffffdffc0366cc0 total size:97300480 > >>> > >>> I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB > >>> mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during > >>> the process exit. > >>> > >>> Despite the numerous mistakes Zhiguo made in sending this patch, it is still > >>> quite valuable. Please consider pulling his v9 into the mm tree for testing. > >> > >> BTW, we dropped the folio_test_anon() check, but what about shmem? They > >> also do __folio_set_swapbacked()? > > > > my point is that the purpose is skipping redundant swap-out, if shmem is single > > mapped, they could be also skipped. > > But they won't get necessarily *freed* when unmapping them. They might > just continue living in tmpfs? where some other process might just map > them later? > You're correct. I overlooked this aspect, focusing on swap and thinking of shmem solely in terms of swap. > IMHO, there is a big difference here between anon and shmem. (well, > anon_shmem would actually be different :) ) Even though anon_shmem behaves similarly to anonymous memory when releasing memory, it doesn't seem worth the added complexity? So unfortunately it seems Zhiguo still needs v10 to take folio_test_anon() back? Sorry for my bad, Zhiguo. > > -- > Cheers, > > David / dhildenb > Thanks Barry