On Fri, 2019-07-12 at 17:28 +0200, Greg KH wrote: > On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote: > > This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564. > > > > That commit caused a severe memory leak in nfs_readdir_make_qstr(). > > > > When listing a directory with more than 100 files (this is how many > > struct nfs_cache_array_entry elements fit in one 4kB page), all > > allocated file name strings past those 100 leak. > > > > The root of the leakage is that those string pointers are managed > > in > > pages which are never linked into the page cache. > > > > fs/nfs/dir.c puts pages into the page cache by calling > > read_cache_page(); the callback function nfs_readdir_filler() will > > then fill the given page struct which was passed to it, which is > > already linked in the page cache (by do_read_cache_page() calling > > add_to_page_cache_lru()). > > > > Commit be4c2d4723a4 added another (local) array of allocated pages, > > to > > be filled with more data, instead of discarding excess items > > received > > from the NFS server. Those additional pages can be used by the > > next > > nfs_readdir_filler() call (from within the same nfs_readdir() > > call). > > > > The leak happens when some of those additional pages are never used > > (copied to the page cache using copy_highpage()). The pages will > > be > > freed by nfs_readdir_free_pages(), but their contents will > > not. The > > commit did not invoke nfs_readdir_clear_array() (and doing so would > > have been dangerous, because it did not track which of those pages > > were already copied to the page cache, risking double free bugs). > > > > How to reproduce the leak: > > > > - Use a kernel with CONFIG_SLUB_DEBUG_ON. > > > > - Create a directory on a NFS mount with more than 100 files with > > names long enough to use the "kmalloc-32" slab (so we can easily > > look up the allocation counts): > > > > for i in `seq 110`; do touch ${i}_0123456789abcdef; done > > > > - Drop all caches: > > > > echo 3 >/proc/sys/vm/drop_caches > > > > - Check the allocation counter: > > > > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > > 30564391 nfs_readdir_add_to_array+0x73/0xd0 > > age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1 > > > > - Request a directory listing and check the allocation counters > > again: > > > > ls > > [...] > > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > > 30564511 nfs_readdir_add_to_array+0x73/0xd0 > > age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1 > > > > There are now 120 new allocations. > > > > - Drop all caches and check the counters again: > > > > echo 3 >/proc/sys/vm/drop_caches > > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > > 30564401 nfs_readdir_add_to_array+0x73/0xd0 > > age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1 > > > > 110 allocations are gone, but 10 have leaked and will never be > > freed. > > > > Unhelpfully, those allocations are explicitly excluded from > > KMEMLEAK, > > that's why my initial attempts with KMEMLEAK were not successful: > > > > /* > > * Avoid a kmemleak false positive. The pointer to the name is > > stored > > * in a page cache page which kmemleak does not scan. > > */ > > kmemleak_not_leak(string->name); > > > > It would be possible to solve this bug without reverting the whole > > commit: > > > > - keep track of which pages were not used, and call > > nfs_readdir_clear_array() on them, or > > - manually link those pages into the page cache > > > > But for now I have decided to just revert the commit, because the > > real > > fix would require complex considerations, risking more dangerous > > (crash) bugs, which may seem unsuitable for the stable branches. > > > > Signed-off-by: Max Kellermann <mk@xxxxxxxxxx> > > --- > > fs/nfs/dir.c | 90 ++++--------------------------------------- > > ---- > > fs/nfs/internal.h | 3 +- > > 2 files changed, 7 insertions(+), 86 deletions(-) > > No cc: stable@vger on this patch to get it backported? > I've added one. I've also backed out the 3 fixes for other problems with the same patch that were in the linux-next tree. (St. Stephen Rothwell please forgive me, for I have sinned...) -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx