On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote: > On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote: > > Andrea Righi <andrea.righi@xxxxxxxxxxxxx> writes: > > > > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: > > >> Andrea Righi <andrea.righi@xxxxxxxxxxxxx> writes: > > >> > > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > > >> >> Andrea Righi <andrea.righi@xxxxxxxxxxxxx> writes: > > >> >> > > >> >> [snip] > > >> >> > > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c > > >> >> > index ebed37bbf7a3..c71abc8df304 100644 > > >> >> > --- a/mm/swap_state.c > > >> >> > +++ b/mm/swap_state.c > > >> >> > @@ -20,6 +20,7 @@ > > >> >> > #include <linux/migrate.h> > > >> >> > #include <linux/vmalloc.h> > > >> >> > #include <linux/swap_slots.h> > > >> >> > +#include <linux/oom.h> > > >> >> > #include <linux/huge_mm.h> > > >> >> > > > >> >> > #include <asm/pgtable.h> > > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > > >> >> > max_pages = 1 << READ_ONCE(page_cluster); > > >> >> > if (max_pages <= 1) > > >> >> > return 1; > > >> >> > + /* > > >> >> > + * If current task is using too much memory or swapoff is running > > >> >> > + * simply use the max readahead size. Since we likely want to load a > > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can > > >> >> > + * give better performance in this case. > > >> >> > + */ > > >> >> > + if (oom_task_origin(current)) > > >> >> > + return max_pages; > > >> >> > > > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); > > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > > >> >> > > >> >> Thinks this again. If my understanding were correct, the accessing > > >> >> pattern during swapoff is sequential, why swap readahead doesn't work? > > >> >> If so, can you root cause that firstly? > > >> > > > >> > Theoretically if the pattern is sequential the current heuristic should > > >> > already select a big readahead size, but apparently it's not doing that. > > >> > > > >> > I'll repeat my tests tracing the readahead size during swapoff to see > > >> > exactly what's going on here. > > >> > > >> I haven't verify it. It may be helpful to call lookup_swap_cache() > > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is > > >> to update the swap readahead statistics via lookup_swap_cache(). > > > > > > I did more tests trying to collect some useful information. > > > > > > In particular I've been focusing at tracing the distribution of the > > > values returned by swapin_nr_pages() in different scenarios. > > > > > > To do so I made swapin_nr_pages() trace-able and I used the following > > > bcc command to measure the distrubution of the returned values: > > > > > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' > > > > > > I've collected this metric in the following scenarios: > > > - 5.6 vanilla > > > - 5.6 + lookup_swap_cache() before swapin_readahead() in > > > unuse_pte_range() > > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() > > > in unuse_pte_range() > > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) > > > before swapin_readahead() in unuse_pte_range() > > > > > > Each kernel has been tested both with swappiness=0 and swappiness=60. > > > Results are pretty much identical changing the swappiness, so I'm just > > > reporting the default case here (w/ swappiness=60). > > > > > > Result > > > ====== > > > > > > = swapoff performance (elapsed time) = > > > > > > vanilla 22.09s > > > lookup_swap_cache() 23.87s > > > hits++ 16.10s > > > hits=last_ra_pages 8.81s > > > > > > = swapin_nr_pages() $retval distribution = > > > > > > 5.6 vanilla: > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 36948 $retval = 8 > > > 44151 $retval = 4 > > > 49290 $retval = 1 > > > 527771 $retval = 2 > > > > > > 5.6 lookup_swap_cache() before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 13093 $retval = 1 > > > 56703 $retval = 8 > > > 123067 $retval = 2 > > > 366118 $retval = 4 > > > > > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 2589 $retval = 1 > > > 8016 $retval = 2 > > > 40021 $retval = 8 > > > 566038 $retval = 4 > > > > > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 785 $retval = 2 > > > 1072 $retval = 1 > > > 21844 $retval = 4 > > > 644168 $retval = 8 > > > > > > In the vanilla case, the readahead heuristic seems to choose 2 pages > > > most of the time. This is because we are not properly considering the > > > hits (hits are always 0 in the swapoff code path) and, as you correctly > > > pointed out, we can fix this by calling lookup_swap_cache() in > > > unuse_pte_range() before calling swapin_readahead(). > > > > > > With this change the distribution of the readahead size moves more > > > toward 4 pages, but we still have some 2s. That looks good, however it > > > doesn't seem to speed up swapoff very much... maybe because calling > > > lookup_swap_cache() introduces a small overhead? (still need to > > > investigate about this theory). > > > > > > In the next test I've tried to always increment hits by 1 before calling > > > swapin_readahead() in unuse_pte_range(). This is basically cheating, > > > because I'm faking the hit ratio, forcing the heuristic to use a larger > > > readahead size; in fact, the readahead size moves even more toward 4 > > > pages and swapoff performance are a little better now. > > > > > > Pushing even more the "cheating" I can pretend that the previous > > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so > > > I'm forcing the heuristic to move toward the max size and keep using it. > > > The result here is pretty much identical to my fixed-size patch, because > > > swapin_nr_pages() returns the max readahead size pretty much all the > > > time during swapoff (8 pages or, more in general, vm.page-cluster). > > > > > > Personally I don't like very much forcing the heuristic in this way, > > > it'd be nice if it would just work by accounting the proper hit ratio > > > (so just by adding lookup_swap_cache() as you correctly suggested), but > > > this solution doesn't seem to improve performance in reality. For this > > > reason I still think we should consider the swapoff scenario like a > > > special one and somehow bypass the readahead heuristic and always return > > > the max readahead size. > > > > > > Looking at the hits of the previous step in the swapoff case just > > > doesn't work, because we may have some misses, but they will become hits > > > very soon, since we are reading all the swapped out pages back into > > > memory. This is why using the max readahead size gives better > > > swapoff performance. > > > > > > What do you think? > > > > >From your description, it appears that you are using cluster readahead > > instead of vma readahead. Can you verify this via, > > > > # cat /sys/kernel/mm/swap/vma_ra_enabled > > # cat /sys/kernel/mm/swap/vma_ra_enabled > true > > However, it's still using the cluster readahead because I'm using a > swap file and nr_rotate_swap=1, so, according to the following, it's > never using the vma readahead: > > static inline bool swap_use_vma_readahead(void) > { > return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); > } > > I'll investigate more on this, I think there's no reason to prevent the > usage of vma readahead if the underlying device is non-rotational. Few more details about this. Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it looks like my virtio_blk device is considered rotational by default: $ cat /sys/block/vda/queue/rotational 1 Therefore the vma readahead is not used. If I change this to "rotational" then the vma readahead is used; this is confirmed by the fact that swapin_nr_pages isn't called anymore: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 13 $retval = 1 18 $retval = 2 23 $retval = 4 29 $retval = 8 swapoff time: 16.44s In terms of swapoff performance vma readahead works better than the vanilla cluster readahead (~21s), but the "improved" cluster readahead (with lookup_swap_cache() in unuse_pte_range()) still gives slightly better results (~12s). -Andrea