Re: [PATCH v3] bcache: dynamic incremental gc

Zou Mingzhe <mingzhe.zou@xxxxxxxxxxxx> · Fri, 20 May 2022 16:22:38 +0800

在 2022/5/12 21:41, Coly Li 写道:
On 5/11/22 3:39 PM, mingzhe.zou@xxxxxxxxxxxx wrote:
From: ZouMingzhe <mingzhe.zou@xxxxxxxxxxxx>

Currently, GC wants no more than 100 times, with at least
100 nodes each time, the maximum number of nodes each time
is not limited.

```
static size_t btree_gc_min_nodes(struct cache_set *c)
{
         ......
         min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
         if (min_nodes < MIN_GC_NODES)
                 min_nodes = MIN_GC_NODES;

         return min_nodes;
}
```

According to our test data, when nvme is used as the cache,
it takes about 1ms for GC to handle each node (block 4k and
bucket 512k). This means that the latency during GC is at
least 100ms. During GC, IO performance would be reduced by
half or more.

I want to optimize the IOPS and latency under high pressure.
This patch hold the inflight peak. When IO depth up to maximum,
GC only process very few(10) nodes, then sleep immediately and
handle these requests.

bch_bucket_alloc() maybe wait for bch_allocator_thread() to
wake up, and and bch_allocator_thread() needs to wait for gc
to complete, in which case gc needs to end quickly. So, add
bucket_alloc_inflight to cache_set in v3.

```
long bch_bucket_alloc(struct cache *ca, unsigned int reserve, bool wait)
{
         ......
         do {
prepare_to_wait(&ca->set->bucket_wait, &w,
                                 TASK_UNINTERRUPTIBLE);

                 mutex_unlock(&ca->set->bucket_lock);
                 schedule();
                 mutex_lock(&ca->set->bucket_lock);
         } while (!fifo_pop(&ca->free[RESERVE_NONE], r) &&
                  !fifo_pop(&ca->free[reserve], r));
         ......
}

static int bch_allocator_thread(void *arg)
{
    ......
    allocator_wait(ca, bch_allocator_push(ca, bucket));
    wake_up(&ca->set->btree_cache_wait);
    wake_up(&ca->set->bucket_wait);
    ......
}

static void bch_btree_gc(struct cache_set *c)
{
    ......
    bch_btree_gc_finish(c);
    wake_up_allocators(c);
    ......
}
```

Apply this patch, each GC maybe only process very few nodes,
GC would last a long time if sleep 100ms each time. So, the
sleep time should be calculated dynamically based on gc_cost.

At the same time, I added some cost statistics in gc_stat,
hoping to provide useful information for future work.

Hi Mingzhe,

From the first glance, I feel this change may delay the small GC 
period, and finally result a large GC period, which is not expected.

But it is possible that my feeling is incorrect. Do you have detailed 
performance number about both I/O latency  and GC period, then I can 
have more understanding for this effort.

BTW, I will add this patch to my testing set and experience myself.

Thanks.

Coly Li

Hi Coly,

First, your feeling is right. Then, I have some performance number abort 
before and after this patch.
Since the mailing list does not accept attachments, I put them on the gist.

Please visit the page for details:
https://gist.github.com/zoumingzhe/69a353e7c6fffe43142c2f42b94a67b5
mingzhe

[snipped]