2006/3/7, Marc Grimme <grimme@xxxxxxx>: > Sebastien, > On Tuesday 07 March 2006 12:35, Sébastien DIDIER wrote: > > 2006/3/7, Marc Grimme <grimme@xxxxxxx>: > > > Hi, > > > to debug you could use strace. E.g. executing strace -p 14970 will > > > probably show you that the process is waiting for a lock. As the ps > > > already does. My first guess would be, that you use apache with php and > > > sessions. > > > > Thanks. But strace doesnt output anything and became Ctrl-C imune. It > > needs a sigkill to exit and the traced process stays in T state. I > > seems that it doesnt manage to get last system call where the process > > is in D state. > Hmm, sounds like I've heard that already. If you trace the root httpd with -f > and -t and lookout for great timeslices you'll propably find processes > waiting for locks. The D state is a good indicator (ps ax | grep " D " and > look at the pids). Do the pids of the D processes change from time to time or > do they stay the same pids? Marc, All the blocked processes have the same pid since the beginning of this issue. (22 hours by now) > > > > > If so, the phplib uses flocks for locking the session-ids. Normally it > > > happens that one process locks a session. If another process comes along > > > to get an flock on that session it has to wait until the further flock is > > > closed. It very often happens that the other process gets that flock when > > > the client and session are not available any more. Then the flock is held > > > until the apache process timesout. > > > > I don't think it is session related because I store sessions file > > outside the GFS mount point (/tmp) and I run a load balancer based > > upon the source adress (to always send requests to the same server and > > then keep sessions) > Yes, I agree. Sessions get lost if the the node fails, right? Yes. That may be a problem for some apps... But it is easier (and more efficient) than storing session data into SQL. > > > > But, we are using mysql query caching (with some libraries like AdoDb) > > inside the GFS mount point. Do you think it could be the cache files > > which are dead-locked ? > It depends on how those files are locked and how and when the locks are set > and released. If a lock is set at apache-child forktime and released at > process terminate time, then yes that could happen. If only accesses to data > of those files are protected with flocks then it should perform quite well. > > Is that query caching part of perl-adodb or is it implemented by yourselves? It appears that we are using a very common PHP AdoDB abstact class without any change in the code. When I run a "lsof -p" on each blocked process on the two nodes, each one has exactly the same file open : apache 23327 www-data 10r REG 253,0 2128 5053927 /home/sites/website/web/queryCache/ca/adodb_cad1702c2e5d18a71d765e95bf55ea3b.cache (deleted) > > Have a look and play with strace and watch out for great times and the > syscalls concerned with that. I would expect you ending up with > flock-timeouts. > > Hope that helps, > regards Marc. > > > > > We have made a patch for a better locking with php which you can find on > > > http:/www.open-sharedroot.org in the downloads section. > > > Hope that helps > > > Regards Marc. > > > > > > On Tuesday 07 March 2006 11:50, Sébastien DIDIER wrote: > > > > Hi, > > > > > > > > I'm running a two-nodes GFS cluster which hosts web sites. The GFS > > > > partition is over a Iscsi device and by now, i'm using manual fencing. > > > > > > > > Today, I got 5 httpd process on both nodes which got stuck in IO > > > > blocking state. I suspected a GFS filesystem corruption but I haven't > > > > got any output from the kernel. I ran a fsck two days ago after a > > > > power chute. > > > > > > > > Here's the wait state of the process. (idem for the other node) > > > > > > > > # ps -o pid,tt,user,fname,wchan -C apache > > > > PID TT USER COMMAND WCHAN > > > > 4426 ? root apache - > > > > 14970 ? www-data apache glock_wait_internal > > > > 15103 ? www-data apache glock_wait_internal > > > > 16780 ? www-data apache glock_wait_internal > > > > 16959 ? www-data apache glock_wait_internal > > > > 14936 ? www-data apache finish_stop > > > > 12859 ? www-data apache - > > > > 13005 ? www-data apache - > > > > 13311 ? www-data apache semtimedop > > > > 13390 ? www-data apache semtimedop > > > > > > > > How can I debug further this problem ? And how can I bring back home > > > > my httpd processes without a reboot ? > > > > > > > > Many thanks for your help. > > > > > > > > Regards, > > > > Sébastien DIDIER > > > > > > > > -- > > > > > > > > Linux-cluster@xxxxxxxxxx > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > > Gruss / Regards, > > > > > > Marc Grimme > > > Phone: +49-89 121 409-54 > > > http://www.atix.de/ http://www.open-sharedroot.org/ > > > > > > ** > > > ATIX - Ges. fuer Informationstechnologie und Consulting mbH > > > Einsteinstr. 10 - 85716 Unterschleissheim - Germany > > > > -- > > > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Gruss / Regards, > > Marc Grimme > Phone: +49-89 121 409-54 > http://www.atix.de/ http://www.open-sharedroot.org/ > > ** > ATIX - Ges. fuer Informationstechnologie und Consulting mbH > Einsteinstr. 10 - 85716 Unterschleissheim - Germany > > -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster