On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote: > This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564. > > That commit caused a severe memory leak in nfs_readdir_make_qstr(). > > When listing a directory with more than 100 files (this is how many > struct nfs_cache_array_entry elements fit in one 4kB page), all > allocated file name strings past those 100 leak. > > The root of the leakage is that those string pointers are managed in > pages which are never linked into the page cache. > > fs/nfs/dir.c puts pages into the page cache by calling > read_cache_page(); the callback function nfs_readdir_filler() will > then fill the given page struct which was passed to it, which is > already linked in the page cache (by do_read_cache_page() calling > add_to_page_cache_lru()). > > Commit be4c2d4723a4 added another (local) array of allocated pages, to > be filled with more data, instead of discarding excess items received > from the NFS server. Those additional pages can be used by the next > nfs_readdir_filler() call (from within the same nfs_readdir() call). > > The leak happens when some of those additional pages are never used > (copied to the page cache using copy_highpage()). The pages will be > freed by nfs_readdir_free_pages(), but their contents will not. The > commit did not invoke nfs_readdir_clear_array() (and doing so would > have been dangerous, because it did not track which of those pages > were already copied to the page cache, risking double free bugs). > > How to reproduce the leak: > > - Use a kernel with CONFIG_SLUB_DEBUG_ON. > > - Create a directory on a NFS mount with more than 100 files with > names long enough to use the "kmalloc-32" slab (so we can easily > look up the allocation counts): > > for i in `seq 110`; do touch ${i}_0123456789abcdef; done > > - Drop all caches: > > echo 3 >/proc/sys/vm/drop_caches > > - Check the allocation counter: > > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > 30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1 > > - Request a directory listing and check the allocation counters again: > > ls > [...] > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > 30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1 > > There are now 120 new allocations. > > - Drop all caches and check the counters again: > > echo 3 >/proc/sys/vm/drop_caches > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls > 30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1 > > 110 allocations are gone, but 10 have leaked and will never be freed. > > Unhelpfully, those allocations are explicitly excluded from KMEMLEAK, > that's why my initial attempts with KMEMLEAK were not successful: > > /* > * Avoid a kmemleak false positive. The pointer to the name is stored > * in a page cache page which kmemleak does not scan. > */ > kmemleak_not_leak(string->name); > > It would be possible to solve this bug without reverting the whole > commit: > > - keep track of which pages were not used, and call > nfs_readdir_clear_array() on them, or > - manually link those pages into the page cache > > But for now I have decided to just revert the commit, because the real > fix would require complex considerations, risking more dangerous > (crash) bugs, which may seem unsuitable for the stable branches. > > Signed-off-by: Max Kellermann <mk@xxxxxxxxxx> > --- > fs/nfs/dir.c | 90 ++++------------------------------------------- > fs/nfs/internal.h | 3 +- > 2 files changed, 7 insertions(+), 86 deletions(-) No cc: stable@vger on this patch to get it backported? thanks, greg k-h