On Fri, 2016-04-08 at 13:37 +0200, Marcel De Boer wrote: > Hi! > > > > Whatever the problem is it isn't access to either of these two > > > variables or the lists they may represent. > > > > > > They are both local variables of the mount_mount() function and so > > > cannot be accessed simultaneously by any other function. > > Too bad... that means my changes probably just mixed up the timing > enough > to avoid the problem. > > > Btw, there has been no actual RHEL release of revision 115. > > > > Only 113 in RHEL-6.7 and (probably) revision 122 will be RHEL-6.8. > > So I wonder what else went into revision 115. > <...> > > We probably shouldn't work with revision 122 yet so may be we should > > work with revision 113, not sure about that though. > > Ah wait... because it was just for local testing, I also changed the > patchlevel so yum wouldn't complain. Judging from the build machine > history, it actually is -113. Postponing writing this mail for too > long > made me forget too much... > > > I'm not sure I could reproduce this because I have a stress test > > (used > > for RHEL) that uses (IIRC) 8 concurrent threads to test mount > > concurrency and to test for mount to expire races. > > > > The maps used are somewhat more complex than what you have here so > > perhaps I missed this point with that test. > > The configuration for the server uses indirect maps from the local > filesystem. All other machines get a slightly different config through > NIS. > > > However, I've recently written another RHEL test (based on this > > test) > > that uses a simple indirect map with the 8 concurrent threads to try > > and > > duplicate a different problem. > > > > I would have though this test would expose this sort of problem but > > after (I can't actually remember the longest run) about three days > > of > > continuous running I didn't see any problems. > > > > Granted it was a different scenario to yours though. > > Of course it also looks timing-related, so there's no telling in > exactly > which configuration it'll pop up. For the machine I used for testing > (not > the same hardware as the server), the issue already disappeared when I > locally rebuilt the same RPM as the one that was already installed. > > I already noticed changes in the frequency when I changed the versions > of > supporting packages (libtirpc) or ran it in the foreground or with > debugging. > > > So I think we need to narrow down where this is occurring. > > > > To start with I'd add mutexes around just the parse_location() and > > prune_host_list() functions and then if that also resolves the > > problem > > drill down from there. > > I'll see if I can do that next week (even though the server is busy, > it's > not a disaster if it happens, but I prefer to be around to unwedge > it.) I can help with that by providing patches. To start with the change here is quite conservative, next one would be much less so (and quite a bit more difficult to write). The idea being much like a kernel bi-sect to narrow the search quickly. > > Thanks! > > Kind regards, > Marcel de Boer > -- To unsubscribe from this list: send the line "unsubscribe autofs" in