On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote: > On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > >> > >> Your variant has 1:64 ratio; obviously better than now, but we can actually > >> do 1:bits-per-cacheline quite easily. > > > > Ok, but in that case you end up needing a counter for each cacheline > > too (to count how many bits, in order to know when to say "cacheline > > is entirely full"). > > So here's a largely untested version of my "one bit per word" > approach. It seems to work, but looking at it, I'm unhappy with a few > things: > > - using kmalloc() for the .full_fds_bits[] array is simple, but > disgusting, since 99% of all programs just have a single word. > > I know I talked about just adding the allocation to the same one > that allocates the bitmaps themselves, but I got lazy and didn't do > it. Especially since that code seems to try fairly hard to make the > allocations nice powers of two, according to the comments. That may > actually matter from an allocation standpoint. > > - Maybe we could just use that "full_fds_bits_init" field for when a > single word is sufficient, and avoid the kmalloc that way? At least make sure the allocation uses a cache line, so that multiple processes do not share same cache line for this possibly hot field fdt->full_fds_bits = kzalloc(max_t(size_t, L1_CACHE_BYTES, BITBIT_SIZE(nr)), GFP_KERNEL); > > Anyway. This is a pretty simple patch, and I actually think that we > could just get rid of the "next_fd" logic entirely with this. That > would make this *patch* be more complicated, but it would make the > resulting *code* be simpler. > > Hmm? Want to play with this? Eric, what does this do to your test-case? Excellent results so far Linus, 500 % increase, thanks a lot ! Tested using 16 threads, 8 on Socket0, 8 on Socket1 Before patch : # ulimit -n 12000000 # taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 636870 After patch : taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3845134 (6 times better) Your patch out-performs the O_FD_FASTALLOC one on this particular test by ~ 9 % : taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 -f count=10000000 (check/increase ulimit -n) total = 3505252 If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of 3546666). Oh, and 48 threads without any patches : 383027 -> program runs one order of magnitude faster, congrats ! -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html