On Fri, 24 May 2019 at 13:32, Alan Post <adp@xxxxxxxxx> wrote: > > On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote: > > Have you tried upgrading to 4.19.44? There is a fix that went in not > > too long ago that deals with a request leak that can cause stack traces > > like the above that wait forever. > > > > Following up on this. I have set aside a rack of machines and put > Linux 4.19.44 on them. They ran jobs overnight and will do the > same over the long weekend (Memorial day in the US). Given the > error rate (both over time and over submitted jobs) we see across > the cluster this well be enough time to draw a conclusion as to > whether 4.19.44 exhibits this hang. > > Other than stack traces, what kind of information could I collect > that would be helpful for debugging or describing more precisely > what is happening to these hosts? I'd like to exit from the condition > of trying different kernels (as you no doubt saw in my initial message > I've done a lot of it) and enter the condition of debugging or > reproducing the problem. > > I'll report back early next week and appreciate your feedback, > Perhaps the output from 'cat /sys/kernel/debug/rpc_clnt/*/tasks'? Thanks Trond