Am 21.08.2018 um 11:55 schrieb Coly Li: > On 2018/8/21 5:46 PM, Stefan Priebe - Profihost AG wrote: >> Hi Coly, >> >> today i had again the following deadlock where the whole systems get >> stuck and can't recover: >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666200] BUG: workqueue >> lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 35s! >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666228] BUG: workqueue >> lockup - pool cpus=4 node=0 flags=0x1 nice=0 stuck for 47s! >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666248] BUG: workqueue >> lockup - pool cpus=11 node=0 flags=0x1 nice=0 stuck for 47s! >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666267] Showing busy >> workqueues and worker pools: >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666279] workqueue events: >> flags=0x0 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666289] pwq 22: cpus=11 >> node=0 flags=0x1 nice=0 active=4/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666305] pending: >> mlx5e_rx_am_work [mlx5_core], kernfs_notify_workfn, mlx5e_rx_am_work >> [mlx5_core], free_work >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666346] pwq 18: cpus=9 >> node=0 flags=0x1 nice=0 active=3/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666362] pending: >> free_work, mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work [mlx5_core] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666390] pwq 8: cpus=4 >> node=0 flags=0x1 nice=0 active=8/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666405] pending: >> journal_write [bcache], mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work >> [mlx5_core], free_work, rht_deferred_worker, rht_deferred_worker, >> rht_deferred_worker, rht_deferred_worker >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666468] pwq 4: cpus=2 >> node=0 flags=0x1 nice=0 active=8/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666501] pending: >> mlx5e_rx_am_work [mlx5_core], mlx5e_rx_am_work [mlx5_core], >> rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, >> rht_deferred_worker, vmpressure_work_fn, free_work >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666600] workqueue bcache: >> flags=0x8 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666610] pwq 22: cpus=11 >> node=0 flags=0x1 nice=0 active=9/256 MAYDAY >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666626] in-flight: >> 375565:bch_data_insert_keys [bcache], 375566:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666653] pending: >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666710] pwq 18: cpus=9 >> node=0 flags=0x1 nice=0 active=12/256 MAYDAY >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666726] in-flight: >> 375501:bch_data_insert_keys [bcache], 373349:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666753] pending: >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache], >> bch_data_insert_keys [bcache], bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666837] pwq 16: cpus=8 >> node=0 flags=0x0 nice=0 active=1/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666854] in-flight: >> 375294:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666873] pwq 10: cpus=5 >> node=0 flags=0x0 nice=0 active=44/256 >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.666889] in-flight: >> 376496:bch_data_insert_keys [bcache], 376501:bch_data_insert_keys >> [bcache], 376520:bch_data_insert_keys [bcache], >> 376503:bch_data_insert_keys [bcache], 376492:bch_data_insert_keys >> [bcache], 376516:bch_data_insert_keys [bcache], >> 376526:bch_data_insert_keys [bcache], 376504:bch_data_insert_keys >> [bcache], 376524:bch_data_insert_keys [bcache], >> 376497:bch_data_insert_keys [bcache], 376507:bch_data_insert_keys >> [bcache], 376523:bch_data_insert_keys [bcache], >> 376522:bch_data_insert_keys [bcache], 376510:bch_data_insert_keys >> [bcache], 376498:bch_data_insert_keys [bcache], >> 376518:bch_data_insert_keys [bcache], 376506:bch_data_insert_keys >> [bcache], 376493:bch_data_insert_keys [bcache], >> 376513:bch_data_insert_keys [bcache], 376530:bch_data_insert_keys >> [bcache], 376499:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.669998] , >> 376519:bch_data_insert_keys [bcache], 376512:bch_data_insert_keys >> [bcache], 376532:bch_data_insert_keys [bcache], >> 376490:bch_data_insert_keys [bcache], 376527:bch_data_insert_keys >> [bcache], 376514:bch_data_insert_keys [bcache], >> 376491:bch_data_insert_keys [bcache], 376500:bch_data_insert_keys >> [bcache], 376502:bch_data_insert_keys [bcache], >> 376508:bch_data_insert_keys [bcache], 376525:bch_data_insert_keys >> [bcache], 376494:bch_data_insert_keys [bcache], >> 376509:bch_data_insert_keys [bcache], 376528:bch_data_insert_keys >> [bcache], 376517:bch_data_insert_keys [bcache], >> 376521:bch_data_insert_keys [bcache], 376495:bch_data_insert_keys >> [bcache], 376515:bch_data_insert_keys [bcache], >> 376531:bch_data_insert_keys [bcache], 376505:bch_data_insert_keys >> [bcache], 376529:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.674767] , >> 376533:bch_data_insert_keys [bcache], 376511:bch_data_insert_keys [bcache] >> Aug 21 00:28:36 cloud1-1472 kernel: [1150884.675501] pwq 8: cpus=4 >> node=0 flags=0x1 nice=0 active=256/256 >> Aug 21 00:28:37 cloud1-1472 kernel: [1150884.676303] in-flight: >> 377020:bch_data_insert_keys [bcache], 375498:bch_data_insert_keys >> [bcache], 377015:bch_data_insert_keys [bcache], >> 377019:bch_data_insert_keys [bcache], 377010:bch_data_insert_keys >> [bcache], 377004:bch_data_insert_keys [bcache], >> 377016:bch_data_insert_keys [bcache], 377012:bch_data_insert_keys >> [bcache], 377007:bch_data_insert_keys [bcache], >> 377014:bch_data_insert_keys [bcache], 377017:bch_data_insert_keys >> [bcache], 377013:bch_data_insert_keys [bcache], >> 377018:bch_data_insert_keys [bcache], 375496:bch_data_insert_keys >> [bcache], 377005:bch_data_insert_keys [bcache], >> 377009:bch_data_insert_keys [bcache], 362(RESCUER):bch_data_insert_keys >> [bcache] bch_data_insert_keys [bcache] bch_data_insert_keys [bcache] >> bch_data_insert_keys [bcache] bch_data_insert_keys [bcache] >> bch_data_insert_keys [bcache] >> kern > > Hi Stefan, > > Thanks for the informative log. I see journal write and key insert key > words from the log. So I keep to suspect the potential locking issue in > bcache journal code. > > This issue is on my todo list, and I am looking at this for now. OK great and thanks. I hope you can fix it. > > Coly Li > > >> >> Am 09.08.2018 um 12:22 schrieb Coly Li: >>> On 2018/8/9 6:12 PM, Stefan Priebe - Profihost AG wrote: >>>> Hi, >>>> Am 09.08.2018 um 08:37 schrieb Coly Li: >>>>> On 2018/8/8 7:52 PM, Stefan Priebe - Profihost AG wrote: >>>>>> Hi, >>>>>> Am 07.08.2018 um 16:35 schrieb Coly Li: >>>>>>> On 2018/8/7 3:41 AM, Stefan Priebe - Profihost AG wrote: >>>>>>>> Am 06.08.2018 um 16:21 schrieb Coly Li: >>>>>>>>> On 2018/8/6 9:33 PM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>> Hi Coly, >>>>>>>>>> Am 06.08.2018 um 15:06 schrieb Coly Li: >>>>>>>>>>> On 2018/8/6 2:33 PM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>>>> Hi Coly, >>>>>>>>>>>> >>>>>>>>>>>> while running the SLES15 kernel i observed a workqueue lockup and a >>>>>>>>>>>> totally crashed system today. >>>>>>>>>>>> >>>>>>>>>>>> dmesg output is about 3,5mb but it seems it just repeats the >>>>>>>>>>>> bch_data_insert_keys msg. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Stefan, >>>>>>>>>>> >>>>>>>>>>> Thanks for your information! >>>>>>>>>>> >>>>>>>>>>> Could you please to give me any hint on how to reproduce it ? Even it is >>>>>>>>>>> not stable reproducible, a detailed procedure may help me a lot. >>>>>>>>>> >>>>>>>>>> i'm sorry but i can't reproduce it. It happens just out of nothing in >>>>>>>>>> our ceph production cluster. >>>>>>>>> >>>>>>>>> I see. Could you please share the configuration information. E.g. >>>>>>>> >>>>>>>> sure. >>>>>>>> >>>>>>>>> - How many CPU cores >>>>>>>> 12 >>>>>>>> >>>>>>>>> - How many physical memory >>>>>>>> 64GB >>>>>>>> >>>>>>>>> - How large the SSD size, NVMe or SATA >>>>>>>> bcache cache size: 250GB SATA SSD >>>>>>>> >>>>>>>>> - How many SSDs >>>>>>>> 1x >>>>>>>> >>>>>>>>> - How large (many) the backing hard drives are >>>>>>>> 2x 1TB >>>>>>>> >>>>>>>>> I try to simulate similar workload with fio, see how lucky I am. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Generally the workload at the timeframe was mostly read, fssync inside >>>>>>>> guests and fstrim. >>>>>>> >>>>>>> Hi Stefan, >>>>>>> >>>>>>> From your information, I suspect this was a journal related deadlocking. >>>>>>> >>>>>>> If there are too many small I/O to make the btree inside bcache grows >>>>>>> too fast, and in turn make the journal space to be exhausted, there is >>>>>>> probably a dead-lock-like hang happens. >>>>>>> >>>>>>> Junhui tries to fix it by increase journal slot size, but the root cause >>>>>>> is not fixed yet. The journal operation in bcache is not atomic, that >>>>>>> means a btree node write goes into journal firstly, then insert into >>>>>>> btree node by journal replay. If the btree node has to be split during >>>>>>> the journal replay, the split meta data needs to go into journal first, >>>>>>> if journal space is already exhausted, a dead-lock may happen. >>>>>>> >>>>>>> A real fix is to make bcache journal operation to be atomic, that means, >>>>>>> 1, Reserve estimated journal slots before a journal I/O >>>>>>> 2, If reservation succeed, go ahead; if failed wait and try again. >>>>>>> 3, If journal reply results btree split, journal slot for new meta data >>>>>>> is reserved in journal already and never failed. >>>>>>> >>>>>>> This fix is not simple, and I am currently working on other fixes (4Kn >>>>>>> hard drive and big endian...). If no one else helps on the fix, it would >>>>>>> be a while before I may focus on it. >>>>>>> >>>>>>> Because you mentioned fstrim happend in your guests, if the backing >>>>>>> device of bcache supports DISCARD/TRIM, bcache will also invalidate the >>>>>>> fstrim range in its internal btree, which may generate more btree metata >>>>>>> I/O. Therefore I guess it might be related to journal. >>>>>>> >>>>>>> Hmm, how about I compose a patch to display free journal slot number. If >>>>>>> next time such issue happens and you may still access sysfs, let's check >>>>>>> and see whether this is a journal issue. Maybe I am wrong, but it's good >>>>>>> to try. >>>>>> >>>>>> I don't believe the journal was full - the workload at that time (02:00 >>>>>> AM) is mostly read only and delete / truncate files and the journal is >>>>>> pretty big with 250GB. Ceph handles fstrim inside guests as truncate and >>>>>> file deletes outside guest. So the real workload for bcache was: >>>>>> - read (backup time) >>>>>> - delete file (xfs) >>>>>> - truncate files (xfs) >>>>> >>>>> Hi Stefan, >>>>> >>>>> I guess maybe we talked about different journals. I should explicit say: >>>>> bcache journal, sorry for misleading you. >>>>> >>>>> There is a bcache journal too, which is around 500MB size for meta data. >>>>> If there are too many metadata operations, it is quite easy to be full >>>>> filled. >>>> >>>> ah OK perfect. The only problem is, that i wasn't able to connect to the >>>> server anymore when this has happened. So i'm unable to get data from >>>> sysfs. Would it be possible to add the value to the kprint line we >>>> already get? >>> >>> Hi Stefan, >>> >>> I see. Now I will try to compose some patches, to display the >>> information, and also try to reduce the possibility of such locking >>> issue. Give me some time. >>> >>> Maybe reboot the server and reload bcache and wait for a quite long time >>> may have chance to escape from such awkward situation. Sorry that I >>> don't have idea how to fix such deadlock on a running system... >>> >>> Coly Li >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >