On 12/11/2015 04:09 PM, Christian Borntraeger wrote:
if a user has more than one swap disk with different priorities, the swap code will fill up the hight prio disk until the last block is used. The swap code will continue to scan the first disk also when its already filling the 2nd or 3rd disk. We have seen kswapd running at 100% CPU, with the majority of hits in the scanning code of scan_swap_map, even for non-rotational disks when this happens. For example with 3 disks disk1 99.9% disk2 10% disk3 0% it will scan the bitmap of disk1 (and as the disk is full the cluster optimization does not trigger) for every page that will likely go to disk2 anyway. By doing a first scan that only uses up to 98%, we force the swap code to use the 2nd disk slightly earlier, but it reduces kswapd cpu usage significantly. The 2nd scan will then allow to fill the remaining 2%, again starting with the highest prio disk. The code does not affect cases with all the same swap priorities, unless all disks are about 98% full. There is one issue with mythis approach: If there is a mix between same and different priorities, the code will loop too often due to the requeue, so and idea for a better fix is welcome. Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx>
IMHO you should resend with CCing the relevant people directly (e.g. via ./scripts/get_maintainers.pl) or this might simply get lost in high-volume mailing lists.
Note that I'm not familiar with this code. But my first thought would be to put a cache with batch-refill/free before the bitmap. During the "first" round only consider si's with enough free to satisfy the whole batch-refill.
--- mm/swapfile.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index 5887731..d3817cf 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -640,6 +640,7 @@ swp_entry_t get_swap_page(void) { struct swap_info_struct *si, *next; pgoff_t offset; + bool first = true; if (atomic_long_read(&nr_swap_pages) <= 0) goto noswap; @@ -653,6 +654,12 @@ start_over: plist_requeue(&si->avail_list, &swap_avail_head); spin_unlock(&swap_avail_lock); spin_lock(&si->lock); + /* at 98% usage lets try the other swaps */ + if (first && si->inuse_pages / 98 * 100 > si->pages) { + spin_lock(&swap_avail_lock); + spin_unlock(&si->lock); + goto nextsi; + } if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { spin_lock(&swap_avail_lock); if (plist_node_empty(&si->avail_list)) { @@ -692,6 +699,10 @@ nextsi: if (plist_node_empty(&next->avail_list)) goto start_over; } + if (first) { + first = false; + goto start_over; + } spin_unlock(&swap_avail_lock);
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>