Hi, Am 07.08.2018 um 16:35 schrieb Coly Li: > On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote: >> Am 06.08.2018 um 16:21 schrieb Coly Li: >>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote: >>>> Hi Coly, >>>> Am 06.08.2018 um 15:06 schrieb Coly Li: >>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote: >>>>>> Hi Coly, >>>>>> >>>>>> while running the SLES15 kernel i observed a workqueue lockup and a >>>>>> totally crashed system today. >>>>>> >>>>>> dmesg output is about 3,5mb but it seems it just repeats the >>>>>> bch_data_insert_keys msg. >>>>>> >>>>> >>>>> Hi Stefan, >>>>> >>>>> Thanks for your information! >>>>> >>>>> Could you please to give me any hint on how to reproduce it ? Even it is >>>>> not stable reproducible, a detailed procedure may help me a lot. >>>> >>>> i'm sorry but i can't reproduce it. It happens just out of nothing in >>>> our ceph production cluster. >>> >>> I see. Could you please share the configuration information. E.g. >> >> sure. >> >>> - How many CPU cores >> 12 >> >>> - How many physical memory >> 64GB >> >>> - How large the SSD size, NVMe or SATA >> bcache cache size: 250GB SATA SSD >> >>> - How many SSDs >> 1x >> >>> - How large (many) the backing hard drives are >> 2x 1TB >> >>> I try to simulate similar workload with fio, see how lucky I am. >> >> Thanks! >> >> Generally the workload at the timeframe was mostly read, fssync inside >> guests and fstrim. > > Hi Stefan, > > From your information, I suspect this was a journal related deadlocking. > > If there are too many small I/O to make the btree inside bcache grows > too fast, and in turn make the journal space to be exhausted, there is > probably a dead-lock-like hang happens. > > Junhui tries to fix it by increase journal slot size, but the root cause > is not fixed yet. The journal operation in bcache is not atomic, that > means a btree node write goes into journal firstly, then insert into > btree node by journal replay. If the btree node has to be split during > the journal replay, the split meta data needs to go into journal first, > if journal space is already exhausted, a dead-lock may happen. > > A real fix is to make bcache journal operation to be atomic, that means, > 1, Reserve estimated journal slots before a journal I/O > 2, If reservation succeed, go ahead; if failed wait and try again. > 3, If journal reply results btree split, journal slot for new meta data > is reserved in journal already and never failed. > > This fix is not simple, and I am currently working on other fixes (4Kn > hard drive and big endian...). If no one else helps on the fix, it would > be a while before I may focus on it. > > Because you mentioned fstrim happend in your guests, if the backing > device of bcache supports DISCARD/TRIM, bcache will also invalidate the > fstrim range in its internal btree, which may generate more btree metata > I/O. Therefore I guess it might be related to journal. > > Hmm, how about I compose a patch to display free journal slot number. If > next time such issue happens and you may still access sysfs, let's check > and see whether this is a journal issue. Maybe I am wrong, but it's good > to try. I don't believe the journal was full - the workload at that time (02:00 AM) is mostly read only and delete / truncate files and the journal is pretty big with 250GB. Ceph handles fstrim inside guests as truncate and file deletes outside guest. So the real workload for bcache was: - read (backup time) - delete file (xfs) - truncate files (xfs) Greets, Stefan > Thanks. > > Coly Li > > >>>>>> The beginning is: >>>>>> >>>>>> 2018-08-06 02:08:06 BUG: workqueue lockup - pool cpus=1 node=0 >>>>>> flags=0x1 nice=0 stuck for 51s! >>>>>> 2018-08-06 02:08:06 pending: memcg_kmem_cache_create_func >>>>>> 2018-08-06 02:08:06 delayed: memcg_kmem_cache_create_func >>>>>> 2018-08-06 02:08:06 workqueue bcache: flags=0x8 >>>>>> 2018-08-06 02:08:06 pwq 22: cpus=11 node=0 flags=0x0 nice=0 active=1/256 >>>>>> 2018-08-06 02:08:06 in-flight: 1764369:bch_data_insert_keys [bcache] >>>>>> 2018-08-06 02:08:06 pwq 18: cpus=9 node=0 flags=0x1 nice=0 >>>>>> active=256/256 MAYDAY >>>>>> 2018-08-06 02:08:06 in-flight: 1765894:bch_data_insert_keys >>>>>> [bcache], 1765908:bch_data_insert_keys [bcache], >>>>>> 1765931:bch_data_insert_keys [bcache], 1765984:bch_data_insert_keys >>>>>> [bcache], 1765815:bch_data_insert_keys [bcache], >>>>>> 1765893:bch_data_insert_keys [bcache], 1765981:bch_data_insert_keys >>>>>> [bcache], 1765875:bch_data_insert_keys [bcache], >>>>>> 1765963:bch_data_insert_keys [bcache], 1765960:bch_data_insert_keys >>>>>> [bcache], 1765889:bch_data_insert_keys [bcache], >>>>>> 1765989:bch_data_insert_keys [bcache], 1765897:bch_data_insert_keys >>>>>> [bcache], 1765911:bch_data_insert_keys [bcache], >>>>>> 1765924:bch_data_insert_keys [bcache], 1765808:bch_data_insert_keys >>>>>> [bcache], 1765879:bch_data_insert_keys [bcache], >>>>>> 1765948:bch_data_insert_keys [bcache], 1765970:bch_data_insert_keys >>>>>> [bcache], 1765859:bch_data_insert_keys [bcache], >>>>>> 1765884:bch_data_insert_keys [bcache] >>>>>> 2018-08-06 02:08:06 , 1765952:bch_data_insert_keys [bcache], >>>>>> 1765990:bch_data_insert_keys [bcache], 1765817:bch_data_insert_keys >>>>>> [bcache], 1765858:bch_data_insert_keys [bcache], >>>>>> 1765928:bch_data_insert_keys [bcache], 1765936:bch_data_insert_keys >>>>>> [bcache], 1762396:bch_data_insert_keys [bcache], >>>>>> 1765831:bch_data_insert_keys [bcache], 1765847:bch_data_insert_keys >>>>>> [bcache], 1765895:bch_data_insert_keys [bcache], >>>>>> 1765925:bch_data_insert_keys [bcache], 1765967:bch_data_insert_keys >>>>>> [bcache], 1765798:bch_data_insert_keys [bcache], >>>>>> 1765827:bch_data_insert_keys [bcache], 1765857:bch_data_insert_keys >>>>>> [bcache], 1765979:bch_data_insert_keys [bcache], >>>>>> 1765809:bch_data_insert_keys [bcache], 1765856:bch_data_insert_keys >>>>>> [bcache], 1765878:bch_data_insert_keys [bcache], >>>>>> 1765918:bch_data_insert_keys [bcache], 1765934:bch_data_insert_keys [bcache] >>>>>> 2018-08-06 02:08:06 , 1765982:bch_data_insert_keys [bcache], >>>>>> 1765813:bch_data_insert_keys [bcache], 1765883:bch_data_insert_keys >>>>>> [bcache], 1765993:bch_data_insert_keys [bcache], >>>>>> 1765834:bch_data_insert_keys [bcache], 1765920:bch_data_insert_keys >>>>>> [bcache], 1765962:bch_data_insert_keys [bcache], >>>>>> 1765788:bch_data_insert_keys [bcache], 1765882:bch_data_insert_keys >>>>>> [bcache], 1765942:bch_data_insert_keys [bcache], >>>>>> 1765825:bch_data_insert_keys [bcache], 1765854:bch_data_insert_keys >>>>>> [bcache], 1765902:bch_data_insert_keys [bcache], >>>>>> 1765838:bch_data_insert_keys [bcache], 1765868:bch_data_insert_keys >>>>>> [bcache], 1765932:bch_data_insert_keys [bcache], >>>>>> 1765944:bch_data_insert_keys [bcache], 1765975:bch_data_insert_keys >>>>>> [bcache], 1765983:bch_data_insert_keys [bcache], >>>>>> 1765810:bch_data_insert_keys [bcache], 1765863:bch_data_insert_keys [bcache] >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html