[RFC PATCH v2 00/16] bcache: fix journal no-space deadlock

Coly Li <colyli@xxxxxxx> · Sat, 20 Apr 2019 00:04:53 +0800

The initial journal no-space deadlock issue was known as several
kthreads or kworkers were reported by kernel to hang for quite long
time. The reason was a deadlock happened when there is no more journal
space avialable for new coming journal request.

In v1 RFC series, I though the journal no-space deadlock was from two
conditions, which was not the truth. After long time testing and
debugging, I realize the journal deadlock was a result of a series of
problems hidden in current code.

Now I make progress in v2 series, and all known problems related to the
journal no-space deadlock are fixed. I don't observe journal deadlock
and related I/O hang warning any more.

Unfortunately we can not apply this whole series at this moment, because
after fixing the journal no-space deadlock issue, I find a race in dirty
btree node flushing. Beside normal dirty btree node flushing, when there
is no journal space, btree_flush_write() will be called to write down
the oldest dirty btree node. Once the oldest dirty btree node is written
from memory into cache device, its associated journal reference will be
released, this operation is necessary to reclaim oldest busy journal
bucket when no-space in journal buckets.

The problem of this race is, when building c->flush_btree heap, all
dirty btree node from for_each_cached_btree() are not protected or
referenced, so there is a race that after the heap c->flush_btree is
built and before the oldest node is selected from the heap, the oldest
node is already written in normal code path, and the memory is
released/reused.

>From my testing, a kernel panic triggered by wild pointer deference or
un-paired mutex_lock/unlock can be observed from btree_flush_write(),
this is because the selected btree node was written and released
already, btree_flush_write() just references invalid memory object.

So far I don't have good idea to fix such race without hurting I/O
performance, and IMHO the bcache I/O hang by journal is kind of better
than kenrel panic. Therefore before the race of dirty btree nodes
writting gets fixed, I won't apply the whole series.

But there are still some helpful and non-major fixes which can go into
upstream, to reduce the whole patch set and avoid huge changes in a
single kernel merge.

The patch 'bcache: acquire c->journal.lock in bch_btree_leaf_dirty()` in
v1 series was removed from v2 series. I still feel this is a problem to
access journal pipo without any protection, but this fix is limited and
I need to think about a more thoughtful way to fix.

Any review comment or suggestion are warmly welcome.

Thanks in advance for your help.

Coly Li
---

Coly Li (16):
  bcache: move definition of 'int ret' out of macro read_bucket()
  bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
  bcache: reload jouranl key information during journal replay
  bcache: fix journal deadlock during jouranl replay
  bcache: reserve space for journal_meta() in run time
  bcache: add failure check to run_cache_set() for journal replay
  bcache: add comments for kobj release callback routine
  bcache: return error immediately in bch_journal_replay()
  bcache: add error check for calling register_bdev()
  bcache: Add comments for blkdev_put() in registration code path
  bcache: add comments for closure_fn to be called in closure_queue()
  bcache: add pendings_cleanup to stop pending bcache device
  bcache: fix fifo index swapping condition in btree_flush_write()
  bcache: try to flush btree nodes as many as possible
  bcache: improve bcache_reboot()
  bcache: introduce spinlock_t flush_write_lock in struct journal

 drivers/md/bcache/journal.c | 312 ++++++++++++++++++++++++++++++++++++++++----
 drivers/md/bcache/journal.h |   8 +-
 drivers/md/bcache/super.c   | 112 ++++++++++++++--
 3 files changed, 393 insertions(+), 39 deletions(-)

-- 
2.16.4