Re: [PATCH 06/29] bcache: fix race in btree_flush_write()

Coly Li <colyli@xxxxxxx> · Thu, 27 Jun 2019 20:45:01 +0800

On 2019/6/27 7:47 下午, Coly Li wrote:
> On 2019/6/27 5:16 下午, Yaowei Bai wrote:
>> On Fri, Jun 14, 2019 at 09:13:35PM +0800, Coly Li wrote:
>>> There is a race between mca_reap(), btree_node_free() and journal code
>>> btree_flush_write(), which results very rare and strange deadlock or
>>> panic and are very hard to reproduce.
>>>
>>> Let me explain how the race happens. In btree_flush_write() one btree
>>> node with oldest journal pin is selected, then it is flushed to cache
>>> device, the select-and-flush is a two steps operation. Between these two
>>> steps, there are something may happen inside the race window,
>>> - The selected btree node was reaped by mca_reap() and allocated to
>>>   other requesters for other btree node.
>>> - The slected btree node was selected, flushed and released by mca
>>>   shrink callback bch_mca_scan().
>>> When btree_flush_write() tries to flush the selected btree node, firstly
>>> b->write_lock is held by mutex_lock(). If the race happens and the
>>> memory of selected btree node is allocated to other btree node, if that
>>> btree node's write_lock is held already, a deadlock very probably
>>> happens here. A worse case is the memory of the selected btree node is
>>> released, then all references to this btree node (e.g. b->write_lock)
>>> will trigger NULL pointer deference panic.
>>>
>>> This race was introduced in commit cafe56359144 ("bcache: A block layer
>>> cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
>>> occupancy during journal"), which selected 128 btree nodes and flushed
>>> them one-by-one in a quite long time period.
>>>
>>> Such race is not easy to reproduce before. On a Lenovo SR650 server with
>>> 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
>>> device assembled by 3 NVMe SSDs as backing device, this race can be
>>> observed around every 10,000 times btree_flush_write() gets called. Both
>>> deadlock and kernel panic all happened as aftermath of the race.
>>>
>>> The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
>>> is set when selecting btree nodes, and cleared after btree nodes
>>> flushed. Then when mca_reap() selects a btree node with this bit set,
>>> this btree node will be skipped. Since mca_reap() only reaps btree node
>>> without BTREE_NODE_journal_flush flag, such race is avoided.
>>>
>>> Once corner case should be noticed, that is btree_node_free(). It might
>>> be called in some error handling code path. For example the following
>>> code piece from btree_split(),
>>> 	2149 err_free2:
>>> 	2150         bkey_put(b->c, &n2->key);
>>> 	2151         btree_node_free(n2);
>>> 	2152         rw_unlock(true, n2);
>>> 	2153 err_free1:
>>> 	2154         bkey_put(b->c, &n1->key);
>>> 	2155         btree_node_free(n1);
>>> 	2156         rw_unlock(true, n1);
>>> At line 2151 and 2155, the btree node n2 and n1 are released without
>>> mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
>>> If btree_node_free() is called directly in such error handling path,
>>> and the selected btree node has BTREE_NODE_journal_flush bit set, just
>>> wait for 1 jiffy and retry again. In this case this btree node won't
>>> be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
>>> and free the btree node memory.
>>>
>>> Wait for 1 jiffy inside btree_node_free() does not hurt too much
>>> performance here, the reasons are,
>>> - Error handling code path is not frequently executed, and the race
>>>   inside this cold path should be very rare. If the very rare race
>>>   happens in the cold code path, waiting 1 jiffy should be acceptible.
>>> - If bree_node_free() is called inside mca_reap(), it means the bit
>>>   BTREE_NODE_journal_flush is checked already, no wait will happen here.
>>>
>>> Beside the above fix, the way to select flushing btree nodes is also
>>> changed in this patch. Let me explain what changes in this patch.
>>
>> Then this change should be split into another patch. :)
> 
> Hi Bai,
> 
> Sure it makes sense. I also realize splitting it into two patches may be
> helpful for long term kernel maintainers to backport patches without
> breaking KABI.
> 
> I will send a two patches version in the for-5.3 submit.

I just realize KABI broken is unavoidable, but splitting this patch into
two still makes sense, the performance optimization should not go into
the race fix.

Thanks.

-- 

Coly Li