On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > On Tue, May 5, 2020 at 4:54 AM SeongJae Park <sjpark@xxxxxxxxxx> wrote: > > > > CC-ing stable@xxxxxxxxxxxxxxx and adding some more explanations. > > > > On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park <sjpark@xxxxxxxxxx> wrote: > > > > > From: SeongJae Park <sjpark@xxxxxxxxx> > > > > > > The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the > > > deallocation of 'socket_alloc' to be done asynchronously using RCU, as > > > same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate > > > socket_sq with socket itself") made those to have same life cycle. > > > > > > The changes made the code much more simple, but also made 'socket_alloc' > > > live longer than before. For the reason, user programs intensively > > > repeating allocations and deallocations of sockets could cause memory > > > pressure on recent kernels. > > > > I found this problem on a production virtual machine utilizing 4GB memory while > > running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls > > and closes those. This test is repeated 10,000 times. Therefore it should > > consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is > > about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could > > consume up to 10,000,000 objects (about 8 GiB). On the test machine, I > > confirmed it consuming about 4GB of the system memory and results in OOM. > > > > [1] https://github.com/LinuxPerfStudy/LEBench > > To be fair, I have not backported Al patches to Google production > kernels, nor I have tried this benchmark. > > Why do we have 10,000,000 objects around ? Could this be because of > some RCU problem ? Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case. As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this. [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf > > Once Al patches reverted, do you have 10,000,000 sock_alloc around ? Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem. Thanks, SeongJae Park > > Thanks. > > > > > > > > > To avoid the problem, this commit reverts the changes. > > > > I also tried to make fixup rather than reverts, but I couldn't easily find > > simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code > > refactoring rather than performance optimization, I thought introducing complex > > fixup for this problem would make no sense. Meanwhile, the memory pressure > > regression could affect real machines. To this end, I decided to quickly > > revert the commits first and consider better refactoring later. > > > > > > Thanks, > > SeongJae Park > > > > > > > > SeongJae Park (2): > > > Revert "coallocate socket_wq with socket itself" > > > Revert "sockfs: switch to ->free_inode()" > > > > > > drivers/net/tap.c | 5 +++-- > > > drivers/net/tun.c | 8 +++++--- > > > include/linux/if_tap.h | 1 + > > > include/linux/net.h | 4 ++-- > > > include/net/sock.h | 4 ++-- > > > net/core/sock.c | 2 +- > > > net/socket.c | 23 ++++++++++++++++------- > > > 7 files changed, 30 insertions(+), 17 deletions(-) > > > > > > -- > > > 2.17.1