> On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote: > > Hi Don & Thomas, > > > > Sorry for the late response. We just finished the tests for all proposed > patches. > > > > There are three proposed patches so far. > > Patch 1: The patch as above which speed up the hrtimer. > > Patch 2: Thomas's first proposal. > > https://patchwork.kernel.org/patch/9803033/ > > https://patchwork.kernel.org/patch/9805903/ > > Patch 3: my original proposal which increase the NMI watchdog timeout > > by 3X https://patchwork.kernel.org/patch/9802053/ > > > > According to our test, only patch 3 works well. > > The other two patches will hang the system eventually. > > For patch 1, the system hang after running our test case for ~1 hour. > > For patch 2, the system hang in running the overnight test. > > There is no error message shown when the system hang. So I don't know > > the root cause yet. > > Hi Kan, > > Thanks for the feedback. Odd that the different patches had different results. > What is more odd to me is the hang. I thought these were all false lockups > that prematurely panic'd and rebooted the box. > > Is the machine configured to panic on hardlockup and reboot? Perhaps > kdump is enabled to store the console log for review upon reboot? > > It almost implies that a hardlockup did happen but isnt' being detected until > later?? > > > > BTW: We set 1 to watchdog_thresh when we did the test. > > It's believed that can speed up the failure. > > Sure, you/they look for 1 second hangs instead of 10 second ones. But with > patch3 it is more like 3 seconds'ish vs 30 second'ish. > > As Thomas asked, I would also be interested in the way the test works. The > hang doesn't make sense. > Hi Don and Thomas, Sorry for the late response. We have confirmed that the hardlock with "speed up the hrtimer" patch is actually another issue. Tim has already proposed a patch to fix it. Here is his patch. https://lkml.org/lkml/2017/8/14/1000 This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685) is decent to fix the spurious hard lockups. Tested-by: Kan Liang <kan.liang@xxxxxxxxx> Please consider to merge it into both mainline and stable tree. Thanks, Kan