Hi Jeff- > On Jul 5, 2022, at 2:48 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2022-07-05 at 14:50 +0000, Chuck Lever III wrote: >> Hello Dai - >> >> I agree that tackling resource management is indeed an appropriate >> next step for courteous server. Thanks for tackling this! >> >> More comments are inline. >> >> >>> On Jul 4, 2022, at 3:05 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote: >>> >>> Currently the idle timeout for courtesy client is fixed at 1 day. If >>> there are lots of courtesy clients remain in the system it can cause >>> memory resource shortage that effects the operations of other modules >>> in the kernel. This problem can be observed by running pynfs nfs4.0 >>> CID5 test in a loop. Eventually system runs out of memory and rpc.gssd >>> fails to add new watch: >>> >>> rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e: >>> No space left on device >>> >>> and alloc_inode also fails with out of memory: >>> >>> Call Trace: >>> <TASK> >>> dump_stack_lvl+0x33/0x42 >>> dump_header+0x4a/0x1ed >>> oom_kill_process+0x80/0x10d >>> out_of_memory+0x237/0x25f >>> __alloc_pages_slowpath.constprop.0+0x617/0x7b6 >>> __alloc_pages+0x132/0x1e3 >>> alloc_slab_page+0x15/0x33 >>> allocate_slab+0x78/0x1ab >>> ? alloc_inode+0x38/0x8d >>> ___slab_alloc+0x2af/0x373 >>> ? alloc_inode+0x38/0x8d >>> ? slab_pre_alloc_hook.constprop.0+0x9f/0x158 >>> ? alloc_inode+0x38/0x8d >>> __slab_alloc.constprop.0+0x1c/0x24 >>> kmem_cache_alloc_lru+0x8c/0x142 >>> alloc_inode+0x38/0x8d >>> iget_locked+0x60/0x126 >>> kernfs_get_inode+0x18/0x105 >>> kernfs_iop_lookup+0x6d/0xbc >>> __lookup_slow+0xb7/0xf9 >>> lookup_slow+0x3a/0x52 >>> walk_component+0x90/0x100 >>> ? inode_permission+0x87/0x128 >>> link_path_walk.part.0.constprop.0+0x266/0x2ea >>> ? path_init+0x101/0x2f2 >>> path_lookupat+0x4c/0xfa >>> filename_lookup+0x63/0xd7 >>> ? getname_flags+0x32/0x17a >>> ? kmem_cache_alloc+0x11f/0x144 >>> ? getname_flags+0x16c/0x17a >>> user_path_at_empty+0x37/0x4b >>> do_readlinkat+0x61/0x102 >>> __x64_sys_readlinkat+0x18/0x1b >>> do_syscall_64+0x57/0x72 >>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >> >> These details are a little distracting. IMO you can summarize >> the above with just this: >> >>>> Currently the idle timeout for courtesy client is fixed at 1 day. If >>>> there are lots of courtesy clients remain in the system it can cause >>>> memory resource shortage. This problem can be observed by running >>>> pynfs nfs4.0 CID5 test in a loop. >> >> >> >> Now I'm going to comment in reverse order here. To add context >> for others on-list, when we designed courteous server, we had >> assumed that eventually a shrinker would be used to garbage >> collect courtesy clients. Dai has found some issues with that >> approach: >> >> >>> The shrinker method was evaluated and found it's not suitable >>> for this problem due to these reasons: >>> >>> . destroying the NFSv4 client on the shrinker context can cause >>> deadlock since nfsd_file_put calls into the underlying FS >>> code and we have no control what it will do as seen in this >>> stack trace: >> >> [ ... stack trace snipped ... ] >> >> I think I always had in mind that only the laundromat would be >> responsible for harvesting courtesy clients. A shrinker might >> trigger that activity, but as you point out, a deadlock is pretty >> likely if the shrinker itself had to do the harvesting. >> >> >>> . destroying the NFSv4 client has significant overhead due to >>> the upcall to user space to remove the client records which >>> might access storage device. There is potential deadlock >>> if the storage subsystem needs to allocate memory. >> >> The issue is that harvesting a courtesy client will involve >> an upcall to nfsdcltracker, and that will result in I/O that >> updates the tracker's database. Very likely this will require >> further allocation of memory and thus it could deadlock the >> system. >> >> Now this might also be all the demonstration that we need >> that managing courtesy resources cannot be done using the >> system's shrinker facility -- expiring a client can never >> be done when there is a direct reclaim waiting on it. I'm >> interested in other opinions on that. Neil? Bruce? Trond? >> > > That is potentially an ugly problem, but if you hit it then you really > are running the host at the redline. Exactly. I'm just not sure how much we can do to keep a system stable once it is pushed to that point, therefore I don't think we should be optimizing for that state. My concern is whether larger systems can be pushed to that state by drive-by DoS attacks. > Do you need to "shrink" synchronously? The scan_objects routine is > supposed to return the number of entries freed. We could (in principle) > always return 0, and wake up the laundromat to do the "real" shrinking. > It might not help out as much with direct reclaim, but it might still > help. I suggested that as well. IIRC Dai said it still doesn't keep the server from toppling over. I would like more information about what is the final straw and whether a "return 0 and kick the laundromat" shrinker still provides some benefit. >>> . due to the overhead associated with removing client record, >>> there is a limit of 128 clients to be trimmed for each >>> laundromat run. This is done to prevent the laundromat from >>> spending too long destroying the clients and misses performing >>> its other tasks in a timely manner. >>> >>> . the laundromat is scheduled to run sooner if there are more >>> courtesy clients need to be destroyed. >> >> Both of these last two changes seem sensible. Can they be >> broken out so they can be applied immediately? >> > > I forget...is there a hard (or soft) cap on the number of courtesy > clients that can be in play at a time? Adding such a cap might be > another option if we're concerned about this. The current cap is courtesy clients stay around no longer than 24 hours. The server doesn't cap the number of courtesy clients, though it could limit based on the physical memory size of the host, as we do with other resources. Also imperfect, but might be better than nothing. -- Chuck Lever