On Wed, May 29, 2019 at 06:41:46PM -0600, Alan Post wrote: > On Fri, May 24, 2019 at 11:31:55AM -0600, Alan Post wrote: > > On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote: > > > Have you tried upgrading to 4.19.44? There is a fix that went in not > > > too long ago that deals with a request leak that can cause stack traces > > > like the above that wait forever. > > > > > > > Following up on this. I have set aside a rack of machines and put > > Linux 4.19.44 on them. They ran jobs overnight and will do the > > same over the long weekend (Memorial day in the US). Given the > > error rate (both over time and over submitted jobs) we see across > > the cluster this well be enough time to draw a conclusion as to > > whether 4.19.44 exhibits this hang. > > > > In the six days I've run Linux 4.19.44 on a single rack, I've seen > no occurrences of this hang. Given the incident rate for this > issue across the cluster over the same period of time, I would have > expected to see one on two incidents on the rack running 4.19.44. > > This is promising--I'm going to deploy 4.19.44 to another rack > by the end of the day Friday May 31st and hope for more of the > same. > [snip] > > I'll report back no later than next week. > As far as I'm concerned the problem I've reported here is resolved. I have seen no evidence of this issue on any Linux 4.19.44 kernel on either the rack I originally set aside or on the second rack the same kernel was deployed to. In addition, we began rolling out the upstream Linux 4.19.37 I mentioned. The total incident rate across the cluster has trended down in near lockstep with that deployment, and none of those systems have shown any evidence of this hang either. It even happened in a tremendously satisfying way: late last week we went through a multi-day period of zero occurrences of this issue anywhere in the cluster, including on kernel versions where it should have been happening. That news was *too* good--everything I understand about the issue suggested it should have been been occurring less frequently but still ocurring. Therefor, expecting a regression to the mean, I calculated what our incident rate should be given our balance of kernel versions, socialized those numbers around here, and waited for the sampling period to close. (we have significant day-over-day load variance and by comparison little week-over-week load variance.) Monday when I revisited the problem not only had the incident rate regressed to the rate I expected, it had done so 'perfectly.' The actual incident count matched my 'best guess' inside the range of possible values for the entire sampling period, including the anomalous no-incident period. We've got operations work yet to put this issue behind us, but as best as I can tell the work that remains is a 'simple matter of effort.' Thank you Trond. If I can be of any help, please reach out. -A -- Alan Post | Xen VPS hosting for the technically adept PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/ email: adp@xxxxxxxxx