Re: bcache: workqueue lockup

Coly Li <colyli@xxxxxxx> · Thu, 9 Aug 2018 14:37:27 +0800

On 2018/8/8 7:52 PM, Stefan Priebe - Profihost AG wrote:
> Hi,
> Am 07.08.2018 um 16:35 schrieb Coly Li:
>> On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote:
>>> Am 06.08.2018 um 16:21 schrieb Coly Li:
>>>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Coly,
>>>>> Am 06.08.2018 um 15:06 schrieb Coly Li:
>>>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Hi Coly,
>>>>>>>
>>>>>>> while running the SLES15 kernel i observed a workqueue lockup and a
>>>>>>> totally crashed system today.
>>>>>>>
>>>>>>> dmesg output is about 3,5mb but it seems it just repeats the
>>>>>>> bch_data_insert_keys msg.
>>>>>>>
>>>>>>
>>>>>> Hi Stefan,
>>>>>>
>>>>>> Thanks for your information!
>>>>>>
>>>>>> Could you please to give me any hint on how to reproduce it ? Even it is
>>>>>> not stable reproducible, a detailed procedure may help me a lot.
>>>>>
>>>>> i'm sorry but i can't reproduce it. It happens just out of nothing in
>>>>> our ceph production cluster.
>>>>
>>>> I see. Could you please share the configuration information. E.g.
>>>
>>> sure.
>>>
>>>> - How many CPU cores
>>> 12
>>>
>>>> - How many physical memory
>>> 64GB
>>>
>>>> - How large the SSD size, NVMe or SATA
>>> bcache cache size: 250GB SATA SSD
>>>
>>>> - How many SSDs
>>> 1x
>>>
>>>> - How large (many) the backing hard drives are
>>> 2x 1TB
>>>
>>>> I try to simulate similar workload with fio, see how lucky I am.
>>>
>>> Thanks!
>>>
>>> Generally the workload at the timeframe was mostly read, fssync inside
>>> guests and fstrim.
>>
>> Hi Stefan,
>>
>> From your information, I suspect this was a journal related deadlocking.
>>
>> If there are too many small I/O to make the btree inside bcache grows
>> too fast, and in turn make the journal space to be exhausted, there is
>> probably a dead-lock-like hang happens.
>>
>> Junhui tries to fix it by increase journal slot size, but the root cause
>> is not fixed yet. The journal operation in bcache is not atomic, that
>> means a btree node write goes into journal firstly, then insert into
>> btree node by journal replay. If the btree node has to be split during
>> the journal replay, the split meta data needs to go into journal first,
>> if journal space is already exhausted, a dead-lock may happen.
>>
>> A real fix is to make bcache journal operation to be atomic, that means,
>> 1, Reserve estimated journal slots before a journal I/O
>> 2, If reservation succeed, go ahead; if failed wait and try again.
>> 3, If journal reply results btree split, journal slot for new meta data
>> is reserved in journal already and never failed.
>>
>> This fix is not simple, and I am currently working on other fixes (4Kn
>> hard drive and big endian...). If no one else helps on the fix, it would
>> be a while before I may focus on it.
>>
>> Because you mentioned fstrim happend in your guests, if the backing
>> device of bcache supports DISCARD/TRIM, bcache will also invalidate the
>> fstrim range in its internal btree, which may generate more btree metata
>> I/O. Therefore I guess it might be related to journal.
>>
>> Hmm, how about I compose a patch to display free journal slot number. If
>> next time such issue happens and you may still access sysfs, let's check
>> and see whether this is a journal issue. Maybe I am wrong, but it's good
>> to try.
> 
> I don't believe the journal was full - the workload at that time (02:00
> AM) is mostly read only and delete / truncate files and the journal is
> pretty big with 250GB. Ceph handles fstrim inside guests as truncate and
> file deletes outside guest. So the real workload for bcache was:
> - read (backup time)
> - delete file (xfs)
> - truncate files (xfs)

Hi Stefan,

I guess maybe we talked about different journals. I should explicit say:
bcache journal, sorry for misleading you.

There is a bcache journal too, which is around 500MB size for meta data.
If there are too many metadata operations, it is quite easy to be full
filled.

Coly Li
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html