Re: bcache: workqueue lockup

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Thu, 9 Aug 2018 12:12:42 +0200

Hi,
Am 09.08.2018 um 08:37 schrieb Coly Li:
> On 2018/8/8 7:52 PM, Stefan Priebe - Profihost AG wrote:
>> Hi,
>> Am 07.08.2018 um 16:35 schrieb Coly Li:
>>> On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote:
>>>> Am 06.08.2018 um 16:21 schrieb Coly Li:
>>>>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> Hi Coly,
>>>>>> Am 06.08.2018 um 15:06 schrieb Coly Li:
>>>>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> Hi Coly,
>>>>>>>>
>>>>>>>> while running the SLES15 kernel i observed a workqueue lockup and a
>>>>>>>> totally crashed system today.
>>>>>>>>
>>>>>>>> dmesg output is about 3,5mb but it seems it just repeats the
>>>>>>>> bch_data_insert_keys msg.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Stefan,
>>>>>>>
>>>>>>> Thanks for your information!
>>>>>>>
>>>>>>> Could you please to give me any hint on how to reproduce it ? Even it is
>>>>>>> not stable reproducible, a detailed procedure may help me a lot.
>>>>>>
>>>>>> i'm sorry but i can't reproduce it. It happens just out of nothing in
>>>>>> our ceph production cluster.
>>>>>
>>>>> I see. Could you please share the configuration information. E.g.
>>>>
>>>> sure.
>>>>
>>>>> - How many CPU cores
>>>> 12
>>>>
>>>>> - How many physical memory
>>>> 64GB
>>>>
>>>>> - How large the SSD size, NVMe or SATA
>>>> bcache cache size: 250GB SATA SSD
>>>>
>>>>> - How many SSDs
>>>> 1x
>>>>
>>>>> - How large (many) the backing hard drives are
>>>> 2x 1TB
>>>>
>>>>> I try to simulate similar workload with fio, see how lucky I am.
>>>>
>>>> Thanks!
>>>>
>>>> Generally the workload at the timeframe was mostly read, fssync inside
>>>> guests and fstrim.
>>>
>>> Hi Stefan,
>>>
>>> From your information, I suspect this was a journal related deadlocking.
>>>
>>> If there are too many small I/O to make the btree inside bcache grows
>>> too fast, and in turn make the journal space to be exhausted, there is
>>> probably a dead-lock-like hang happens.
>>>
>>> Junhui tries to fix it by increase journal slot size, but the root cause
>>> is not fixed yet. The journal operation in bcache is not atomic, that
>>> means a btree node write goes into journal firstly, then insert into
>>> btree node by journal replay. If the btree node has to be split during
>>> the journal replay, the split meta data needs to go into journal first,
>>> if journal space is already exhausted, a dead-lock may happen.
>>>
>>> A real fix is to make bcache journal operation to be atomic, that means,
>>> 1, Reserve estimated journal slots before a journal I/O
>>> 2, If reservation succeed, go ahead; if failed wait and try again.
>>> 3, If journal reply results btree split, journal slot for new meta data
>>> is reserved in journal already and never failed.
>>>
>>> This fix is not simple, and I am currently working on other fixes (4Kn
>>> hard drive and big endian...). If no one else helps on the fix, it would
>>> be a while before I may focus on it.
>>>
>>> Because you mentioned fstrim happend in your guests, if the backing
>>> device of bcache supports DISCARD/TRIM, bcache will also invalidate the
>>> fstrim range in its internal btree, which may generate more btree metata
>>> I/O. Therefore I guess it might be related to journal.
>>>
>>> Hmm, how about I compose a patch to display free journal slot number. If
>>> next time such issue happens and you may still access sysfs, let's check
>>> and see whether this is a journal issue. Maybe I am wrong, but it's good
>>> to try.
>>
>> I don't believe the journal was full - the workload at that time (02:00
>> AM) is mostly read only and delete / truncate files and the journal is
>> pretty big with 250GB. Ceph handles fstrim inside guests as truncate and
>> file deletes outside guest. So the real workload for bcache was:
>> - read (backup time)
>> - delete file (xfs)
>> - truncate files (xfs)
> 
> Hi Stefan,
> 
> I guess maybe we talked about different journals. I should explicit say:
> bcache journal, sorry for misleading you.
> 
> There is a bcache journal too, which is around 500MB size for meta data.
> If there are too many metadata operations, it is quite easy to be full
> filled.

ah OK perfect. The only problem is, that i wasn't able to connect to the
server anymore when this has happened. So i'm unable to get data from
sysfs. Would it be possible to add the value to the kprint line we
already get?

Greets,
Stefan

> Coly Li
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html