Jon Pruente wrote: > On Wed, May 22, 2019 at 10:02 AM mark <m.roth@xxxxxxxxx> wrote: > > >> That seems unlikely. Foe one, I've seen that... but I *always* see >> entries in the log about the oom-killer being invoked. For another, this >> isn't a compute node, it's *only* a fileserver, serving projects, home >> directories, and backups (home-grown b/u, uses rsync), and backups >> don't start until well after midnight, and as we're business-hours only, >> there was less usage, and it does have 256G RAM.... >> > I have two servers that would lock up like this occasionally, and if I > let them sit at the console long enough sometimes they would give a login > prompt. It took a lot of time and frustration (these are prod servers) > but I tracked it down to a problem in the XFS driver, as it never occurred > on the systems with EXT4 filesystems. The XFS driver would hang, > preventing writes to the filesystem. I could identify exactly when that > happened as all system logging would suddenly stop at the same second. > Then OOMKiller > would come in and start killing off processes but that wouldn't be in the > logs on disk because the file system couldn't write. I rolled the servers > back to a 5xx series kernel and the issue didn't resurface. I recently > let them boot the newer 9xx series kernels and I'm hoping the XFS issue is > fixed. I have no idea if that's it... and the cluster nodes that would have it happen, a few years ago, were ext4. Crap - I just went to look on the system that died, and from sar, I see that it died between 18:10 and 18:20, and we found it unresponsive when I got in at 09:00. I'd think that was enuogh time to print something. mark _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos