On 9/26/23, John Stoffel <john@xxxxxxxxxxx> wrote: >>>>>> "Mateusz" == Mateusz Guzik <mjguzik@xxxxxxxxx> writes: > >> Failed opens (mostly ENOENT) legitimately happen a lot, for example here >> are stats from stracing kernel build for few seconds (strace -fc make): > >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ------------------ >> 0.76 0.076233 5 15040 3688 openat > >> (this is tons of header files tried in different paths) > >> Apart from a rare corner case where the file object is fully constructed >> and we need to abort, there is a lot of overhead which can be avoided. > >> Most notably delegation of freeing to task_work, which comes with an >> enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for >> an example). > >> Benched with will-it-scale with a custom testcase based on >> tests/open1.c: >> [snip] >> while (1) { >> int fd = open("/tmp/nonexistent", O_RDONLY); >> assert(fd == -1); > >> (*iterations)++; >> } >> [/snip] > >> Sapphire Rapids, one worker in single-threaded case (ops/s): >> before: 1950013 >> after: 2914973 (+49%) > > > So what are the times in a multi-threaded case? Just wondering what > happens if you have a bunch of makes or other jobs like that all > running at once. > On my kernel they heavily bottleneck on apparmor, I already mailed the author: https://lore.kernel.org/all/CAGudoHFfG7mARwSqcoLNwV81-KX4Bici5FQHjoNG4f9m83oLyg@xxxxxxxxxxxxxx/ maybe i'll hack up the fix When running without that LSM and on *stock* kernel it heavily bottlenecks somewhere in bowels of SLUB + RCU. Without LSM and with the patch it scales almost perfectly, as one would expect. I don't have numbers nor perf output handy. > >> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> >> --- >> fs/file_table.c | 39 +++++++++++++++++++++++++++++++++++++++ >> fs/namei.c | 2 +- >> include/linux/file.h | 1 + >> 3 files changed, 41 insertions(+), 1 deletion(-) > >> diff --git a/fs/file_table.c b/fs/file_table.c >> index ee21b3da9d08..320dc1f9aa0e 100644 >> --- a/fs/file_table.c >> +++ b/fs/file_table.c >> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f) >> call_rcu(&f->f_rcuhead, file_free_rcu); >> } > >> +static inline void file_free_badopen(struct file *f) >> +{ >> + BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED)); > > eww... what a BUG_ON() here? This seems *way* overkill to crash the > system here, and you don't even check if f exists first as well, since > I assume the caller checks it or already knows it? > > Why not just return an error here and keep going? What happens if you do? > The only caller already checked these flags, so I think BUGing out is prudent. > >> + security_file_free(f); >> + put_cred(f->f_cred); >> + if (likely(!(f->f_mode & FMODE_NOACCOUNT))) >> + percpu_counter_dec(&nr_files); >> + kmem_cache_free(filp_cachep, f); >> +} >> + >> /* >> * Return the total number of open files in the system >> */ >> @@ -468,6 +478,35 @@ void __fput_sync(struct file *file) >> EXPORT_SYMBOL(fput); >> EXPORT_SYMBOL(__fput_sync); > >> +/* >> + * Clean up after failing to open (e.g., open(2) returns with -ENOENT). >> + * >> + * This represents opportunities to shave on work in the common case >> compared >> + * to the usual fput: >> + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is >> no >> + * need to delegate to task_work >> + * 2. if the above holds then we are guaranteed we have the only >> reference with >> + * nobody else seeing the file, thus no need to use atomics to release >> it >> + * 3. then there is no need to delegate freeing to RCU >> + */ >> +void fput_badopen(struct file *file) >> +{ >> + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { >> + fput(file); >> + return; >> + } >> + >> + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { >> + fput(file); >> + return; >> + } >> + >> + /* zero out the ref count to appease possible asserts */ >> + atomic_long_set(&file->f_count, 0); >> + file_free_badopen(file); >> +} >> +EXPORT_SYMBOL(fput_badopen); >> + >> void __init files_init(void) >> { >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, >> diff --git a/fs/namei.c b/fs/namei.c >> index 567ee547492b..67579fe30b28 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata >> *nd, >> WARN_ON(1); >> error = -EINVAL; >> } >> - fput(file); >> + fput_badopen(file); >> if (error == -EOPENSTALE) { >> if (flags & LOOKUP_RCU) >> error = -ECHILD; >> diff --git a/include/linux/file.h b/include/linux/file.h >> index 6e9099d29343..96300e27d9a8 100644 >> --- a/include/linux/file.h >> +++ b/include/linux/file.h >> @@ -15,6 +15,7 @@ >> struct file; > >> extern void fput(struct file *); >> +extern void fput_badopen(struct file *); > >> struct file_operations; >> struct task_struct; >> -- >> 2.39.2 > > -- Mateusz Guzik <mjguzik gmail.com>