Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote: > >> I think I have an alternate idea that could work. Add some extra code >> into proc_task_readdir, that would look for dentries that no longer >> point to tasks and d_invalidate them. With the same logic probably >> being called from a few more places as well like proc_pid_readdir, >> proc_task_lookup, and proc_pid_lookup. >> >> We could even optimize it and have a process died flag we set in the >> superblock. >> >> That would would batch up the freeing work until the next time someone >> reads from proc in a way that would create more dentries. So it would >> prevent dentries from reaped zombies from growing without bound. >> >> Hmm. Given the existence of proc_fill_cache it would really be a good >> idea if readdir and lookup performed some of the freeing work as well. >> As on readdir we always populate the dcache for all of the directory >> entries. > > First of all, that won't do a damn thing when nobody is accessing > given superblock. What's more, readdir in root of that procfs instance > is not enough - you need it in task/ of group leader. It should give a rough bound on the number of stale dentries a superblock can have. The same basic concept has been used very successfully in many incremental garbage collectors. In those malloc (or the equivalent) does a finite amount of garbage collection work to roughly balance out the amount of memory allocated. I am proposing something similar for proc instances. Further if no one is accessing a superblock we don't have a problem either. > What I don't understand is the insistence on getting those dentries > via dcache lookups. _IF_ we are willing to live with cacheline > contention (on ->d_lock of root dentry, if nothing else), why not > do the following: No insistence from this side. I was not seeing atomic_inc_not_zero(sb->s_active) from rcu context as option earlier. But it is an option. > * put all dentries of such directories ([0-9]* and [0-9]*/task/*) > into a list anchored in task_struct; have non-counting reference to > task_struct stored in them (might simplify part of get_proc_task() users, > BTW - avoids pid-to-task_struct lookups if we have a dentry and not just > the inode; many callers do) > * have ->d_release() remove from it (protecting per-task_struct lock > nested outside of all ->d_lock) > * on exit: > lock the (per-task_struct) list > while list is non-empty > pick the first dentry > remove from the list > sb = dentry->d_sb > try to bump sb->s_active (if non-zero, that is). > if failed > continue // move on to the next one - nothing to do here > grab ->d_lock > res = handle_it(dentry, &temp_list) > drop ->d_lock > unlock the list > if (!list_empty(&temp_list)) > shrink_dentry_list(&temp_list) > if (res) > d_invalidate(dentry) > dput(dentry) > deactivate_super(sb) > lock the list > unlock the list > > handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c > if ->d_count is negative // unlikely > return 0; > if ->d_count is positive, > increment ->d_count > return 1; > // OK, it's still alive, but ->d_count is 0 > __d_drop // equivalent of d_invalidate in this case > if not on a shrink list // otherwise it's not our headache > if on lru list > d_lru_del > d_shrink_add dentry to temp_list > return 0; > > And yeah, that'll dirty ->s_active for each procfs superblock that > has dentry for our process present in dcache. On exit()... I would thread the whole thing through the proc_inode instead of coming up with a new allocation per dentry so an extra memory allocation isn't needed. We already have i_dentry. So going from the vfs_inode to the dentry is trivial. But truthfully I don't like proc_flush_task. The problem is that proc_flush_task is a layering violation and magic code that pretty much no one understands. We have some very weird cases where dput or d_invalidate wound up triggering ext3 code. It has been fixed for a long time now, but it wasy crazy weird unexpected stuff. Al your logic above just feels very clever, and like many pieces of the kernel have to know how other pieces of the kernel work. If we can find something stupid and simple that also solves the problem I would be much happier. Than anyone could understand and fix it if something goes wrong. Eric