On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote: > On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@xxxxxxx> wrote: > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > into cache. So I'm afraid that using full_fds_bits in the condition would > > destroy your performance gains. Thinking about this with a fresh head how > > about putting implementing your optimization like: > > > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > unsigned int bitbit = start / BITS_PER_LONG; > > > > + /* > > + * Optimistically search the first long of the open_fds bitmap. It > > + * saves us from loading full_fds_bits into cache in the common case > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > + * a good chance there's a bit free in there. > > + */ > > + if (start < BITS_PER_LONG) { > > + unsigned int bit; > > + > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > + if (bit < BITS_PER_LONG) > > + return bit; > > + } > > + > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > > if (bitbit >= maxfd) > > return maxfd; > > > > Plus your optimizations with likely / unlikely. This way the code flow in > > alloc_fd() stays more readable, we avoid loading the first open_fds long > > into cache if it is full, and we should get all the performance benefits? > > > > Huh. > > So when I read the patch previously I assumed this is testing the bit > word for the map containing next_fd (whatever it is), avoiding looking > at the higher level bitmap and inlining the op (instead of calling the > fully fledged func for bit scans). > > I did not mentally register this is in fact only checking for the > beginning of the range of the entire thing. So apologies from my end > as based on my feedback some work was done and I'm going to ask to > further redo it. > > blogbench spawns 100 or so workers, say total fd count hovers just > above 100. say this lines up with about half of more cases in practice > for that benchmark. > > Even so, that's a benchmark-specific optimization. A busy web server > can have literally tens of thousands of fds open (and this is a pretty > mundane case), making the 0-63 range not particularly interesting. > > That aside I think the patchset is in the wrong order -- first patch > tries to not look at the higher level bitmap, while second reduces > stores made there. This makes it quite unclear how much is it worth to > reduce looking there if atomics are conditional. > > So here is what I propose in terms of the patches: > 1. NULL check removal, sprinkling of likely/unlikely and expand_files > call avoidance; no measurements done vs stock kernel for some effort > saving, just denote in the commit message there is less work under the > lock and treat it as baseline > 2. conditional higher level bitmap clear as submitted; benchmarked against 1 > 3. open_fds check within the range containing fd, avoiding higher > level bitmap if a free slot is found. this should not result in any > func calls if successful; benchmarked against the above > > Optionally the bitmap routines can grow variants which always inline > and are used here. If so that would probably land between 1 and 2 on > the list. > > You noted you know about blogbench bugs and have them fixed. Would be > good to post a link to a pull request or some other spot for a > reference. > > I'll be best if the vfs folk comment on what they want here. Optimizing only the < BIT_PER_LONG seems less desirable then making it work for arbitrary next_fd. Imho, it'll also be easier to follow if everything follows the same logic.