Hi, On Wednesday 08 March 2006 19:54, Stanley, Jon wrote: > I have a 7 node GFS cluster, plus 3 lock servers (RH AS3U5, GULM > locking) that do not mount the filesystem. I have a problem whereby the > load average on the system is extremely high (occasionally > astronomical), eventually leading to a complete site outage, via > inability to access the shared filesystem. I have a couple questions > about the innards of GFS that I would be most grateful for someone to > answer: > > The application is written in PHP, and the PHP sessioning is handled via > the GFS filesystem as well, if that's important. > > 1) I notice that I have a lot of processes in uninterruptible sleep. > When I attached strace to one of these processes, I obviously found it > doing nothing for a period of ~30-60 seconds. An excerpt of the strace > (using -r) follows: > > 0.001224 > stat64("/media/files/global/2/6/26c4f61c69117d55b352ce328babbff4.jpg", > {st_mode=S_IFREG|0644, st_size=9072, ...}) = 0 > 0.000251 > open("/media/files/global/2/6/26c4f61c69117d55b352ce328babbff4.jpg", > O_RDONLY) = 5 > 0.000108 mmap2(NULL, 9072, PROT_READ, MAP_PRIVATE, 5, 0) = > 0xaf381000 > 0.000069 writev(4, [{"HTTP/1.1 200 OK\r\nDate: Wed, 08 M"..., 318}, > {"\377\330\377\340\0\20JFIF\0\1\2\0\0d\0d\0\0\377\354\0\21"..., 9072}], > 2) = 9390 > 0.000630 close(5) = 0 > 0.000049 munmap(0xaf381000, 9072) = 0 > 0.000052 rt_sigaction(SIGUSR1, {0x81ef474, [], > SA_RESTORER|SA_INTERRUPT, 0x1b2eb8}, {SIG_IGN}, 8) = 0 > 0.000068 read(4, 0xa239b3c, 4096) = ? ERESTARTSYS (To be > restarted) > 6.546891 --- SIGALRM (Alarm clock) @ 0 (0) --- > 0.000119 close(4) = 0 > > What it looks like is it hangs out in read() for a period of time, thus > leading to the uninterruptible sleep. This particular example was 6 > seconds, however it seems that the time is variable. The particular > file in this instance is not large, only 9k. Although the strace does not show the output I know of the problem description sounds like a deja vu. We had loads of problems with having sessions on GFS and httpd s ending up with "D" state for some time (at high load times we had ServerLimit httpd in D per node which ended up in the service not being available). As I posted already we think it is because of the "bad" locking of sessions with php (as php sessions are on gfs and strace showed those timeouts with the session files). When you issue a "session_start" or what ever that function is called, the session_file is locked via an flock syscall. That lock is held until you end the session which is implicitly done when the tcp connection to the client is ended. Now comes another http process (on whatever node) and calls a "session start" and trys an flock on that session while another process already holds that lock. The process might end up in the seen timeouts (30-60secs) which (as far as I remember relates to the timeout of the tcp connection defined in the httpd.conf or some timeout in the php.ini) - there is an explanation on this but I cannot rember ;-) ). Nevertheless in our scenario the problems were the "bad" session handling by php. We have made a patch for the phplib where you can disable the locking, or just implicitly do locking and therefore keep consitency while session data is read or written. We could make apache work as expected and now we don't see any "D" process anymore since a year. Oh yes the patch can be found at www.opensharedroot.org in the download section. Besides: You will never encounter this on a localfilesystem or nfs (as nfs ignores flocks). As nfs does not support flocks and silently ignores them. Hope that helps and let us know about problems. Regards Marc. > > I've never seen ERESTARTSYS before, and some googling tells me that it's > basically telling the kernel to interrupt the current syscall in order > to handle a signal (SIGALRM in this case, which I'm not sure the > function of). I could be *way* off base here - I'm not a programmer by > any stretch of the imagination. > > 2) The locking statistics seems to be a huge mystery. The lock total > doesn't seem to correspond to the number of open files that I have (I > hope!). Here's the output of a 'cat /proc/gulm/lockspace - I can't > imagine that I have 300,000+ files open on this system at this point - > when are the locks released, or is this even an indication of how many > locks that are active at the current time? What does the 'pending' > number mean? > > [svadmin@s259830hz1sl01 gulm]$ cat lockspace > > lock counts: > total: 369822 > unl: 176518 > exl: 1555 > shd: 191501 > dfr: 0 > pending: 5 > lvbs: 2000 > lops: 21467433 > > [svadmin@s259830hz1sl01 gulm]$ > > Thanks for any help that anyone can provide on this! > > Thanks! > -Jon > > -- > > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 121 409-54 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster