On Mon, 17 Jan 2022, Zou Mingzhe wrote: > 在 2022/1/8 上午4:20, bcache@xxxxxxxxxxxxxxxxxx 写道: > > On Fri, 7 Jan 2022, Coly Li wrote: > > On 1/7/22 4:21 PM, mingzhe.zou@xxxxxxxxxxxx wrote: > > From: Zou Mingzhe <mingzhe.zou@xxxxxxxxxxxx> > > When attaching a cached device (a.k.a backing device) to a cache > device, bch_sectors_dirty_init() is called to count dirty sectors > and stripes (see what bcache_dev_sectors_dirty_add() does) on the > cache device. > > When bcache_dev_sectors_dirty_add() is called, set_bit(stripe, > d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes) > operation will always be performed. In full_dirty_stripes, each 1bit > represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512), > and each CPU cache line=64B=512bit=2048MB. When 20 threads process > a cached disk with 100G dirty data, a single thread processes about > 23M at a time, and 20 threads total 460M. These full_dirty_stripes > bits corresponding to the 460M data is likely to fall in the same CPU > cache line. When one of these threads performs a set_bit or clear_bit > operation, the same CPU cache line of other threads will become invalid > and must read the full_dirty_stripes from the main memory again. Compared > with single thread, the time of a bcache_dev_sectors_dirty_add() > call is increased by about 50 times in our test (100G dirty data, > 20 threads, bcache_dev_sectors_dirty_add() is called more than > 20 million times). > > This patch tries to test_bit before set_bit or clear_bit operation. > Therefore, a lot of force set and clear operations will be avoided, > and most of bcache_dev_sectors_dirty_add() calls will only read CPU > cache line. > > Hi Mingzhe, good catch on this. I'm curious: how much time does it save > attaching a cache? > > Our test, 100G dirty data: > > register caching register backing > before coly multithread patch about 11 seconds about 3 seconds > after coly multithread patch about 0.8 seconds about 2 seconds > (0-39 online cpus, so 20 threads work) > this patch about 0.8 seconds about 0.3 seconds > (0-39 online cpus, so 20 threads work) Wow, almost 10x faster registering the backing device! -- Eric Wheeler > > Mingzhe > > > Signed-off-by: Mingzhe Zou <mingzhe.zou@xxxxxxxxxxxx> > > Added into my testing queue. Thanks. > > Coly Li > > --- > drivers/md/bcache/writeback.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index 68e75c692dd4..4afe22875d4f 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -596,10 +596,13 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, > unsigned int inode, > > sectors_dirty = atomic_add_return(s, > d->stripe_sectors_dirty + stripe); > - if (sectors_dirty == d->stripe_size) > - set_bit(stripe, d->full_dirty_stripes); > - else > - clear_bit(stripe, d->full_dirty_stripes); > + if (sectors_dirty == d->stripe_size) { > + if (!test_bit(stripe, d->full_dirty_stripes)) > + set_bit(stripe, d->full_dirty_stripes); > + } else { > + if (test_bit(stripe, d->full_dirty_stripes)) > + clear_bit(stripe, d->full_dirty_stripes); > + } > > nr_sectors -= s; > stripe_offset = 0; > > > > >