Hello Jens, On Sun, 2022-01-09 at 19:43 -0700, Jens Axboe wrote: > On 1/9/22 7:38 PM, Ming Lei wrote: > > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: > > > On 1/9/22 6:50 PM, Ming Lei wrote: > > > > Only the last sbitmap_word can have different depth, and all > > > > the others > > > > must have same depth of 1U << sb->shift, so not necessary to > > > > store it in > > > > sbitmap_word, and it can be retrieved easily and efficiently by > > > > adding > > > > one internal helper of __map_depth(sb, index). > > > > > > > > Not see performance effect when running iops test on null_blk. > > > > > > > > This way saves us one cacheline(usually 64 words) per each > > > > sbitmap_word. > > > > > > We probably want to kill the ____cacheline_aligned_in_smp from > > > 'word' as > > > well. > > > > But sbitmap_deferred_clear_bit() is called in put fast path, then > > the > > cacheline becomes shared with get path, and I guess this way isn't > > expected. > > Just from 'word', not from 'cleared'. They will still be in separate > cache lines, but usually doesn't make sense to have the leading > member > marked as cacheline aligned, that's a whole struct property at that > point. > while discussing this - is there any data about how many separate cache lines (for either "word" or "cleared") are beneficial for performance? For bitmap sizes between 4 and 512 bit (on x86_64), the code generates layouts with 4-8 cache lines, but above that, the number of cache lines grows linearly with bitmap size. I am wondering whether we should consider utilizing more of the allocated memory once a certain number of separate cache lines is exceeded, by accessing additional words in the existing cache lines. Could you comment on that? Thanks, Martin