Wendy, Unfortunately our customer has (for the time being) moved their PHP sessions off of the GFS filesystem because of the instability. Our GFS performance has returned to normal, but our customer expects us to fix GFS so that they can have the PHP sessions on GFS. I'm *attempting* to reproduce the behavior on a lab GFS cluster. Assuming I can successfully do this I will send strace's of the issue as it occurs. Is Redhat aware of any issues with GFS and flock syscalls? Regarding the U7 kernel suggestion you made previously, is this going to help with the flock issue or is it strictly for keeping the number of cached locks down? Britt -----Original Message----- From: Wendy Cheng [mailto:wcheng@xxxxxxxxxx] Sent: Thursday, March 09, 2006 2:33 PM To: linux clustering Cc: Stanley, Jon; Treece, Britt Subject: Re: GFS load average and locking Marc Grimme wrote: >Although the strace does not show the output I know of the problem description >sounds like a deja vu. >We had loads of problems with having sessions on GFS and httpd s ending up >with "D" state for some time (at high load times we had ServerLimit httpd in >D per node which ended up in the service not being available). >As I posted already we think it is because of the "bad" locking of sessions >with php (as php sessions are on gfs and strace showed those timeouts with >the session files). When you issue a "session_start" or what ever that >function is called, the session_file is locked via an flock syscall. That >lock is held until you end the session which is implicitly done when the tcp >connection to the client is ended. Now comes another http process (on >whatever node) and calls a "session start" and trys an flock on that session >while another process already holds that lock. The process might end up in >the seen timeouts (30-60secs) which (as far as I remember relates to the >timeout of the tcp connection defined in the httpd.conf or some timeout in >the php.ini) - there is an explanation on this but I cannot rember ;-) ). >Nevertheless in our scenario the problems were the "bad" session handling by >php. We have made a patch for the phplib where you can disable the locking, >or just implicitly do locking and therefore keep consitency while session >data is read or written. We could make apache work as expected and now we >don't see any "D" process anymore since a year. >Oh yes the patch can be found at >www.opensharedroot.org in the download section. > >Besides: You will never encounter this on a localfilesystem or nfs (as nfs >ignores flocks). As nfs does not support flocks and silently ignores them. > > > Hi, This does look like the problem description sent out by savvis.net folks during our off-list email exchanges. However, without actually looking at the thread traces (when they are in D state), it is difficult to be sure. One way to obtain the exact thread trace is using "crash" tool to do a back trace (e.g. "bt <pid>", you need kernel debuginfo RPM though). Britt, do let us know whether this php patch helps and/or using crash command to obtain the thread trace output. On the other hand, I don't understand how a local (non-cluster) filesystem can be immune from this problem ? -- Wendy -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster