Re: [PATCH 1/2] bcache: fixup btree_cache_wait list damage

Coly Li <colyli@xxxxxxx> · Fri, 8 Apr 2022 16:39:16 +0800

On 4/8/22 3:22 PM, Zou Mingzhe wrote:

在 2022/4/7 23:53, Coly Li 写道:
On 4/1/22 8:27 PM, mingzhe.zou@xxxxxxxxxxxx wrote:
From: ZouMingzhe <mingzhe.zou@xxxxxxxxxxxx>

We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c. 
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
   next = 0xffffae586d8afe68,
   prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
   next = 0x0,
   prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: 
ffffae586d8afe68  type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041]  prepare_to_wait+0x8a/0xb0
[230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788]  mca_alloc+0x2ae/0x450 [escache]
[230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382]  ? finish_wait+0x80/0x80
[230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259]  kthread+0x112/0x130
[230469.138448]  ? kthread_flush_work_fn+0x10/0x10
[230469.149477]  ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() maybe call
mca_cannibalize() to cannibalize other cached btree nodes. Only
one thread can do it at a time, so the op of other threads will
be added to the btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait
before free it's memory address. Otherwise, the list will be
damaged. Also should call bch_cannibalize_unlock() to release
the btree_cache_alloc_lock and wake_up other waiters.

Signed-off-by: Mingzhe Zou <mingzhe.zou@xxxxxxxxxxxx>

Thank you for this fix, it is really cool to find such defect on 
cannibalize lock/unlock. It spent me a little time to understand how 
it happens, and reply you late.

I feel the root cause is not from where you patched in 
bch_btree_check() and bch_root_node_dirty_init(), something really 
fishing when using mca_cannibalize_lock(),

843 static int mca_cannibalize_lock(struct cache_set *c, struct 
btree_op *op)
 844 {
 845         spin_lock(&c->btree_cannibalize_lock);
 846         if (likely(c->btree_cache_alloc_lock == NULL)) {
 847                 c->btree_cache_alloc_lock = current;
 848         } else if (c->btree_cache_alloc_lock != current) {
 849                 if (op)
 850 prepare_to_wait(&c->btree_cache_wait, &op->wait,
 851 TASK_UNINTERRUPTIBLE);
 852 spin_unlock(&c->btree_cannibalize_lock);
 853                 return -EINTR;
 854         }
 855         spin_unlock(&c->btree_cannibalize_lock);
 856
 857         return 0;
 858 }

In line 849-851, if the cannibalized locking failed, insert current 
op->wait into c->btree_cache_wait. Then at line 852, return -EINTR to 
indicate the caller should retry. But it seems no caller checks 
whether the return value is -EINTR and handles it properly.

Your patch should work, but I feel the issue of 
bch_cannibalize_lock() is not solved yet. Maybe we should work on 
handling -EINTR returned from mca_cannibalize_lock() IMHO.

The patch 2 handle the return value.

BTW, when you observe the panic, how are the hardware configurations 
about,

- CPU cores
0-39, total 40 cpus

- Memory size
memory status:
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  32919429     125.6 GB         ----
         FREE   638133       2.4 GB    1% of TOTAL MEM
         USED  32281296     123.1 GB   98% of TOTAL MEM
       SHARED  1353791       5.2 GB    4% of TOTAL MEM
      BUFFERS   131366     513.1 MB    0% of TOTAL MEM
       CACHED  2022521       7.7 GB    6% of TOTAL MEM
         SLAB   590919       2.3 GB    1% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP        0            0         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE        0            0    0% of TOTAL SWAP

 COMMIT LIMIT  16459714      62.8 GB         ----
    COMMITTED  67485109     257.4 GB  410% of TOTAL LIMIT

- Cache size

cache disk 460G

-  Number of keys on the btree root node

c->root->keys->set->data info:

crash> btree 0xffff9c6bd873cc00|grep data
        data = 0xffff9c6bda6c0000
        data = 0xffff9c6bda6cd000
        data = 0x0
        data = 0x0
        data = {
      data = {
crash> bset 0xffff9c6bda6c0000
struct bset {
  csum = 4228267359687445853,
  magic = 15660900678624291974,
  seq = 15025931623832980119,
  version = 1,
  keys = 6621,
  {
    start = 0xffff9c6bda6c0020,
    d = 0xffff9c6bda6c0020
  }
}
crash> bset 0xffff9c6bda6cd000
struct bset {
  csum = 38040912,
  magic = 15660900678624291974,
  seq = 15025931623832980119,
  version = 0,
  keys = 0,
  {
    start = 0xffff9c6bda6cd020,
    d = 0xffff9c6bda6cd020
  }
} 

Thanks for the information. It seems a corner case of the caanibalize 
lock code is triggered with the parallel btree checking during boot 
time. Nice catch.

Coly Li