On Mon, 16 Apr 2012, Mel Gorman wrote: > Changelog since V8 > o Rebase to 3.4-rc2 > o Use page flag instead of slab fields to keep structures the same size > o Properly detect allocations from softirq context that use PF_MEMALLOC > o Ensure kswapd does not sleep while processes are throttled > o Do not accidentally throttle !_GFP_FS processes indefinitely > > Changelog since V7 > o Rebase to 3.3-rc2 > o Take greater care propagating page->pfmemalloc to skb > o Propagate pfmemalloc from netdev_alloc_page to skb where possible > o Release RCU lock properly on preempt kernel > > Changelog since V6 > o Rebase to 3.1-rc8 > o Use wake_up instead of wake_up_interruptible() > o Do not throttle kernel threads > o Avoid a potential race between kswapd going to sleep and processes being > throttled > > Changelog since V5 > o Rebase to 3.1-rc5 > > Changelog since V4 > o Update comment clarifying what protocols can be used (Michal) > o Rebase to 3.0-rc3 > > Changelog since V3 > o Propogate pfmemalloc from packet fragment pages to skb (Neil) > o Rebase to 3.0-rc2 > > Changelog since V2 > o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC (Neil) > o Use wait_event_interruptible (Neil) > o Use !! when casting to bool to avoid any possibilitity of type > truncation (Neil) > o Nicer logic when using skb_pfmemalloc_protocol (Neil) > > Changelog since V1 > o Rebase on top of mmotm > o Use atomic_t for memalloc_socks (David Miller) > o Remove use of sk_memalloc_socks in vmscan (Neil Brown) > o Check throttle within prepare_to_wait (Neil Brown) > o Add statistics on throttling instead of printk > > When a user or administrator requires swap for their application, they > create a swap partition and file, format it with mkswap and activate it > with swapon. Swap over the network is considered as an option in diskless > systems. The two likely scenarios are when blade servers are used as part > of a cluster where the form factor or maintenance costs do not allow the > use of disks and thin clients. > > The Linux Terminal Server Project recommends the use of the > Network Block Device (NBD) for swap according to the manual at > https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download > There is also documentation and tutorials on how to setup swap over NBD > at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP > The nbd-client also documents the use of NBD as swap. Despite this, the > fact is that a machine using NBD for swap can deadlock within minutes if > swap is used intensively. This patch series addresses the problem. > > The core issue is that network block devices do not use mempools like > normal block devices do. As the host cannot control where they receive > packets from, they cannot reliably work out in advance how much memory > they might need. Some years ago, Peter Ziljstra developed a series of > patches that supported swap over an NFS that at least one distribution > is carrying within their kernels. This patch series borrows very heavily > from Peter's work to support swapping over NBD as a pre-requisite to > supporting swap-over-NFS. The bulk of the complexity is concerned with > preserving memory that is allocated from the PFMEMALLOC reserves for use > by the network layer which is needed for both NBD and NFS. > > Patch 1 serialises access to min_free_kbytes. It's not strictly needed > by this series but as the series cares about watermarks in > general, it's a harmless fix. It could be merged independently > and may be if CMA is merged in advance. > > Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to > preserve access to pages allocated under low memory situations > to callers that are freeing memory. > > Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC > reserves without setting PFMEMALLOC. > > Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves > for later use by network packet processing. > > Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set. > > Patches 6-13 allows network processing to use PFMEMALLOC reserves when > the socket has been marked as being used by the VM to clean pages. If > packets are received and stored in pages that were allocated under > low-memory situations and are unrelated to the VM, the packets > are dropped. > > Patch 11 reintroduces __netdev_alloc_page which the networking > folk may object to but is needed in some cases to propogate > pfmemalloc from a newly allocated page to an skb. If there is a > strong objection, this patch can be dropped with the impact being > that swap-over-network will be slower in some cases but it should > not fail. > > Patch 13 is a micro-optimisation to avoid a function call in the > common case. > > Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use > PFMEMALLOC if necessary. > > Patch 15 notes that it is still possible for the PFMEMALLOC reserve > to be depleted. To prevent this, direct reclaimers get throttled on > a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is > expected that kswapd and the direct reclaimers already running > will clean enough pages for the low watermark to be reached and > the throttled processes are woken up. > > Patch 16 adds a statistic to track how often processes get throttled > > Some basic performance testing was run using kernel builds, netperf > on loopback for UDP and TCP, hackbench (pipes and sockets), iozone > and sysbench. Each of them were expected to use the sl*b allocators > reasonably heavily but there did not appear to be significant > performance variances. > > For testing swap-over-NBD, a machine was booted with 2G of RAM with a > swapfile backed by NBD. 8*NUM_CPU processes were started that create > anonymous memory mappings and read them linearly in a loop. The total > size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under > memory pressure. > > Without the patches and using SLUB, the machine locks up within minutes and > runs to completion with them applied. With SLAB, the story is different > as an unpatched kernel run to completion. However, the patched kernel > completed the test 40% faster. > > 3.4.0-rc2 3.4.0-rc2 > vanilla-slab swapnbd > Sys Time Running Test (seconds) 87.90 73.45 > User+Sys Time Running Test (seconds) 91.93 76.91 > Total Elapsed Time (seconds) 4174.37 2953.96 > I have tested these with an artificial swap benchmark and with a large project compile on a beagle board. They work great for me. My tests only used this set via swap over NFS so it probably wasn't very thorough coverage. Tested-by: Eric B Munson <emunson@xxxxxxxxx>
Attachment:
signature.asc
Description: Digital signature