On 2018/8/8 7:52 PM, Stefan Priebe - Profihost AG wrote: > Hi, > Am 07.08.2018 um 16:35 schrieb Coly Li: >> On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote: >>> Am 06.08.2018 um 16:21 schrieb Coly Li: >>>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote: >>>>> Hi Coly, >>>>> Am 06.08.2018 um 15:06 schrieb Coly Li: >>>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote: >>>>>>> Hi Coly, >>>>>>> >>>>>>> while running the SLES15 kernel i observed a workqueue lockup and a >>>>>>> totally crashed system today. >>>>>>> >>>>>>> dmesg output is about 3,5mb but it seems it just repeats the >>>>>>> bch_data_insert_keys msg. >>>>>>> >>>>>> >>>>>> Hi Stefan, >>>>>> >>>>>> Thanks for your information! >>>>>> >>>>>> Could you please to give me any hint on how to reproduce it ? Even it is >>>>>> not stable reproducible, a detailed procedure may help me a lot. >>>>> >>>>> i'm sorry but i can't reproduce it. It happens just out of nothing in >>>>> our ceph production cluster. >>>> >>>> I see. Could you please share the configuration information. E.g. >>> >>> sure. >>> >>>> - How many CPU cores >>> 12 >>> >>>> - How many physical memory >>> 64GB >>> >>>> - How large the SSD size, NVMe or SATA >>> bcache cache size: 250GB SATA SSD >>> >>>> - How many SSDs >>> 1x >>> >>>> - How large (many) the backing hard drives are >>> 2x 1TB >>> >>>> I try to simulate similar workload with fio, see how lucky I am. >>> >>> Thanks! >>> >>> Generally the workload at the timeframe was mostly read, fssync inside >>> guests and fstrim. >> >> Hi Stefan, >> >> From your information, I suspect this was a journal related deadlocking. >> >> If there are too many small I/O to make the btree inside bcache grows >> too fast, and in turn make the journal space to be exhausted, there is >> probably a dead-lock-like hang happens. >> >> Junhui tries to fix it by increase journal slot size, but the root cause >> is not fixed yet. The journal operation in bcache is not atomic, that >> means a btree node write goes into journal firstly, then insert into >> btree node by journal replay. If the btree node has to be split during >> the journal replay, the split meta data needs to go into journal first, >> if journal space is already exhausted, a dead-lock may happen. >> >> A real fix is to make bcache journal operation to be atomic, that means, >> 1, Reserve estimated journal slots before a journal I/O >> 2, If reservation succeed, go ahead; if failed wait and try again. >> 3, If journal reply results btree split, journal slot for new meta data >> is reserved in journal already and never failed. >> >> This fix is not simple, and I am currently working on other fixes (4Kn >> hard drive and big endian...). If no one else helps on the fix, it would >> be a while before I may focus on it. >> >> Because you mentioned fstrim happend in your guests, if the backing >> device of bcache supports DISCARD/TRIM, bcache will also invalidate the >> fstrim range in its internal btree, which may generate more btree metata >> I/O. Therefore I guess it might be related to journal. >> >> Hmm, how about I compose a patch to display free journal slot number. If >> next time such issue happens and you may still access sysfs, let's check >> and see whether this is a journal issue. Maybe I am wrong, but it's good >> to try. > > I don't believe the journal was full - the workload at that time (02:00 > AM) is mostly read only and delete / truncate files and the journal is > pretty big with 250GB. Ceph handles fstrim inside guests as truncate and > file deletes outside guest. So the real workload for bcache was: > - read (backup time) > - delete file (xfs) > - truncate files (xfs) Hi Stefan, I guess maybe we talked about different journals. I should explicit say: bcache journal, sorry for misleading you. There is a bcache journal too, which is around 500MB size for meta data. If there are too many metadata operations, it is quite easy to be full filled. Coly Li -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html