Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Coly,

> Can I know In which kernel version did you test the patch?

I tested in both Linux kernels 5.10 and 6.1.

> I didn’t observe obvious performance advantage of this patch.

This patch doesn't improve bcache performance. Instead, it eliminates the IO stall in bcache that happens due to bch_allocator_thread() getting blocked and waiting on GC to finish when GC happens.

/*
* We've run out of free buckets, we need to find some buckets
* we can invalidate. First, invalidate them in memory and add
* them to the free_inc list:
*/
retry_invalidate:
allocator_wait(ca, ca->set->gc_mark_valid &&  <--------
       !ca->invalidate_needs_gc);
invalidate_buckets(ca);

From what you showed, it looks like your rebase is good. As you already noticed, the original patch was based on 4.x kernel so the bucket traversal in btree.c needs to be adapted for 5.x and 6.x kernels. I attached the patch rebased to 6.9 HEAD for your reference.

But to observe the IO stall before the patch, please test with a read-write workload so GC will happen periodically enough (read-only or read-mostly workload doesn't show the problem). For me, I used the "fio" utility to generate a random read-write workload as follows.

# Pre-generate a 900GB test file
$ truncate -s 900G test

# Run random read-write workload for 1 hour
$ fio --time_based --runtime=3600s --ramp_time=2s --ioengine=libaio --name=latency_test --filename=test --bs=8k --iodepth=1 --size=900G  --readwrite=randrw --verify=0 --filename=fio --write_lat_log=lat --log_avg_msec=1000 --log_max_value=1

We include the flags "--write_lat_log=lat --log_avg_msec=1000 --log_max_value=1" so fio will dump the second-by-second max latency into a log file at the end of test so we can when stall happens and for how long:

E.g.

$ more lat_lat.1.log
(format: <time-ms>,<max-latency-ns>,,,)
...
777000, 5155548, 0, 0, 0
778000, 105551, 1, 0, 0
802615, 24276019570, 0, 0, 0 <---- stalls for 24s with no IO possible
802615, 82134, 1, 0, 0
804000, 9944554, 0, 0, 0
805000, 7424638, 1, 0, 0

I used a 375 GB local SSD (cache device) and a 1 TB network-attached storage (backing device). In the 1-hr run, GC starts happening about 10 minutes into the run and then happens at ~ 5 minute intervals. The stall duration ranges from a few seconds at the beginning to close to 40 seconds towards the end. Only about 1/2 to 2/3 of the cache is used by the end.

Note that this patch doesn't shorten the GC either. Instead, it just avoids GC from blocking the allocator thread by first sweeping the buckets and marking reclaimable ones quickly at the beginning of GC so the allocator can proceed while GC continues its actual job.

We are eagerly looking forward to this patch to be merged in this coming merge window that is expected to open in a week to two.

Thanks
Robert


On Fri, May 3, 2024 at 11:28 AM Coly Li <colyli@xxxxxxx> wrote:


> 2024年5月4日 02:23,Coly Li <colyli@xxxxxxx> 写道:
>
>
>
>> 2024年4月11日 14:44,Robert Pang <robertpang@xxxxxxxxxx> 写道:
>>
>> HI Coly
>>
>> Thank you for submitting it in the next merge window. This patch is
>> very critical because the long IO stall measured in tens of seconds
>> every hour is a serious issue making bcache unusable when it happens.
>> So we look forward to this patch.
>>
>> Speaking of this GC issue, we gathered the bcache btree GC stats after
>> our fio benchmark on a 375GB SSD cache device with 256kB bucket size:
>>
>> $ grep . /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_gc_*
>> /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_gc_average_duration_ms:45293
>> /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_gc_average_frequency_sec:286
>> /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_gc_last_sec:212
>> /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_gc_max_duration_ms:61986
>> $ more /sys/fs/bcache/31c945a7-d96c-499b-945c-d76a1ab0beda/internal/btree_nodes
>> 5876
>>
>> However, fio directly on the SSD device itself shows pretty good performance:
>>
>> Read IOPS 14,100 (110MiB/s)
>> Write IOPS 42,200 (330MiB/s)
>> Latency: 106.64 microseconds
>>
>> Can you shed some light on why CG takes so long (avg 45 seconds) given
>> the SSD speed? And is there any way or setting to reduce the CG time
>> or lower the GC frequency?
>>
>> One interesting thing we observed is when the SSD is encrypted via
>> dm-crypt, the GC time is shortened ~80% to be under 10 seconds. Is it
>> possible that GC writes the blocks one-by-one synchronously, and
>> dm-crypt's internal queuing and buffering mitigates the GC IO latency?
>
> Hi Robert,
>
> Can I know In which kernel version did you test the patch?
>

Sorry I missed a bit more information here.

> I do a patch rebase and apply it on Linux v6.9. With a 4TB SSD as cache device, I didn’t observe obvious performance advantage of this patch.

When I didn’t see obvious performance advantage, the testing was on a 512G Intel Optane memory (with pmem driver) as cache device.


> And occasionally I a bit more GC time. It might be from my rebase modification in bch_btree_gc_finish(),

And for the above situation, it was on a 4TB NVMe SSD.


I guess maybe it was from my improper patch rebase. Once Dongsheng posts a new version for the latest upstream kernel bcache code, I will test the patch again.


Thanks.

Coly Li
---
 drivers/md/bcache/alloc.c  | 11 +++++------
 drivers/md/bcache/bcache.h |  1 +
 drivers/md/bcache/btree.c  | 11 +++++++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index ce13c272c387..982b36d12907 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -129,12 +129,11 @@ static inline bool can_inc_bucket_gen(struct bucket *b)
 
 bool bch_can_invalidate_bucket(struct cache *ca, struct bucket *b)
 {
-	BUG_ON(!ca->set->gc_mark_valid);
-
-	return (!GC_MARK(b) ||
+	return ((b->reclaimable_in_gc || ca->set->gc_mark_valid) &&
+		((!GC_MARK(b) ||
 		GC_MARK(b) == GC_MARK_RECLAIMABLE) &&
 		!atomic_read(&b->pin) &&
-		can_inc_bucket_gen(b);
+		can_inc_bucket_gen(b)));
 }
 
 void __bch_invalidate_one_bucket(struct cache *ca, struct bucket *b)
@@ -148,6 +147,7 @@ void __bch_invalidate_one_bucket(struct cache *ca, struct bucket *b)
 	bch_inc_gen(ca, b);
 	b->prio = INITIAL_PRIO;
 	atomic_inc(&b->pin);
+	b->reclaimable_in_gc = 0;
 }
 
 static void bch_invalidate_one_bucket(struct cache *ca, struct bucket *b)
@@ -352,8 +352,7 @@ static int bch_allocator_thread(void *arg)
 		 */
 
 retry_invalidate:
-		allocator_wait(ca, ca->set->gc_mark_valid &&
-			       !ca->invalidate_needs_gc);
+		allocator_wait(ca, !ca->invalidate_needs_gc);
 		invalidate_buckets(ca);
 
 		/*
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 4e6afa89921f..1d33e40d26ea 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -200,6 +200,7 @@ struct bucket {
 	uint8_t		gen;
 	uint8_t		last_gc; /* Most out of date gen in the btree */
 	uint16_t	gc_mark; /* Bitfield used by GC. See below for field */
+	uint16_t	reclaimable_in_gc:1;
 };
 
 /*
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 196cdacce38f..ded55958782d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1740,18 +1740,21 @@ static void btree_gc_start(struct cache_set *c)
 
 	mutex_lock(&c->bucket_lock);
 
-	c->gc_mark_valid = 0;
 	c->gc_done = ZERO_KEY;
 
 	ca = c->cache;
 	for_each_bucket(b, ca) {
 		b->last_gc = b->gen;
+		if (bch_can_invalidate_bucket(ca, b))
+			b->reclaimable_in_gc = 1;
+
 		if (!atomic_read(&b->pin)) {
 			SET_GC_MARK(b, 0);
 			SET_GC_SECTORS_USED(b, 0);
 		}
 	}
 
+	c->gc_mark_valid = 0;
 	mutex_unlock(&c->bucket_lock);
 }
 
@@ -1768,6 +1771,11 @@ static void bch_btree_gc_finish(struct cache_set *c)
 	c->gc_mark_valid = 1;
 	c->need_gc	= 0;
 
+	ca = c->cache;
+	for_each_bucket(b, ca)
+	    if (b->reclaimable_in_gc)
+		b->reclaimable_in_gc = 0;
+
 	for (i = 0; i < KEY_PTRS(&c->uuid_bucket); i++)
 		SET_GC_MARK(PTR_BUCKET(c, &c->uuid_bucket, i),
 			    GC_MARK_METADATA);
@@ -1795,7 +1803,6 @@ static void bch_btree_gc_finish(struct cache_set *c)
 
 	c->avail_nbuckets = 0;
 
-	ca = c->cache;
 	ca->invalidate_needs_gc = 0;
 
 	for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++)
-- 

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux