Matt Garman wrote: > Is anyone on the list using kerberized-nfs on any kind of scale? > We use it here. I don't think I'm an expert - my manager is - but let me think about your issues. <snip> > Just to give a little insight into our issues: we have an > in-house-developed compute job dispatching system. Say a user has > 100s of analysis jobs he wants to run, he submits them to a central > master process, which in turn dispatches them to a "farm" of >100 > compute nodes. All these nodes have two different krb5p NFS mounts, > to which the jobs will read and write. So while the users can > technically log in directly to the compute nodes, in practice they > never do. The logins are only "implicit" when the job dispatching > system does a behind-the-scenes ssh to kick off these processes. I would strongly recommend that you look into slurm. It's being used here in both large and small scale, and is explicitly for that purpose. > > Just to give some "flavor" to the kinds of issues we're facing, what > tends to crop up are one of three things: > > (1) Random crashes. These are full-on kernel trace dumps followed > by an automatic reboot. This was really bad under CentOS 5. A random > kernel upgrade magically fixed it. It happens almost never under > CentOS 6. But happens fairly frequently under CentOS 7. (We're > completely off CentOS 5 now, BTW.) This may possibly be another issue. > > (2) Permission denied issues. I have user Kerberos tickets > configured for 70 days. But there is clearly some kind of > undocumented kernel caching going on. Looking at the Kerberos server > logs, it looks like it "could" be a performance issue, as I see 100s > of ticket requests within the same second when someone tries to launch > a lot of jobs. Many of these will fail with "permission denied" but > if they immediately re-try, it works. Related to this, I have been > unable to figure out what creates and deletes the > /tmp/krb5cc_uid_random files. Are they asking for *new* credentials each time? They should only be doing one kinit. > > (3) Kerberized NFS shares getting "stuck" for one or more users. > We have another monitoring app (in-house developed) that, among other > things, makes periodic checks of these NFS mounts. It does so by > forking and doing a simple "ls" command. This is to ensure that these > mounts are alive and well. Sometimes, the "ls" command gets stuck to > the point where it can't even be killed via "kill -9". Only a reboot > fixes it. But the mount is only stuck for the user running the > monitoring app. Or sometimes the monitoring app is fine, but an > actual user's processes will get stuck in "D" state (in top, means > waiting on IO), but everyone else's jobs (and access to the kerberizes > nfs shares) are OK. And there's nothing in the logs, correct? Have you tried attaching strace to one of those, and see if you can get a clue as to what's happening? <snip> mark _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos