Re: GFS load average and locking

Marc Grimme <grimme@xxxxxxx> · Thu, 9 Mar 2006 09:48:01 +0100

Hi,
On Wednesday 08 March 2006 19:54, Stanley, Jon wrote:
> I have a 7 node GFS cluster, plus 3 lock servers (RH AS3U5, GULM
> locking) that do not mount the filesystem.  I have a problem whereby the
> load average on the system is extremely high (occasionally
> astronomical), eventually leading to a complete site outage, via
> inability to access the shared filesystem.  I have a couple questions
> about the innards of GFS that I would be most grateful for someone to
> answer:
>
> The application is written in PHP, and the PHP sessioning is handled via
> the GFS filesystem as well, if that's important.
>
> 1)  I notice that I have a lot of processes in uninterruptible sleep.
> When I attached strace to one of these processes, I obviously found it
> doing nothing for a period of ~30-60 seconds.  An excerpt of the strace
> (using -r) follows:
>
>      0.001224
> stat64("/media/files/global/2/6/26c4f61c69117d55b352ce328babbff4.jpg",
> {st_mode=S_IFREG|0644, st_size=9072, ...}) = 0
>      0.000251
> open("/media/files/global/2/6/26c4f61c69117d55b352ce328babbff4.jpg",
> O_RDONLY) = 5
>      0.000108 mmap2(NULL, 9072, PROT_READ, MAP_PRIVATE, 5, 0) =
> 0xaf381000
>      0.000069 writev(4, [{"HTTP/1.1 200 OK\r\nDate: Wed, 08 M"..., 318},
> {"\377\330\377\340\0\20JFIF\0\1\2\0\0d\0d\0\0\377\354\0\21"..., 9072}],
> 2) = 9390
>      0.000630 close(5)                  = 0
>      0.000049 munmap(0xaf381000, 9072)  = 0
>      0.000052 rt_sigaction(SIGUSR1, {0x81ef474, [],
> SA_RESTORER|SA_INTERRUPT, 0x1b2eb8}, {SIG_IGN}, 8) = 0
>      0.000068 read(4, 0xa239b3c, 4096)  = ? ERESTARTSYS (To be
> restarted)
>      6.546891 --- SIGALRM (Alarm clock) @ 0 (0) ---
>      0.000119 close(4)                  = 0

>
> What it looks like is it hangs out in read() for a period of time, thus
> leading to the uninterruptible sleep.  This particular example was 6
> seconds, however it seems that the time is variable.  The particular
> file in this instance is not large, only 9k.
Although the strace does not show the output I know of the problem description 
sounds like a deja vu.
We had loads of problems with having sessions on GFS and httpd s ending up 
with "D" state for some time (at high load times we had ServerLimit httpd in 
D per node which ended up in the service not being available). 
As I posted already we think it is because of the "bad" locking of sessions 
with php (as php sessions are on gfs and strace showed those timeouts with 
the session files). When you issue a "session_start" or what ever that 
function is called, the session_file is locked via an flock syscall. That 
lock is held until you end the session which is implicitly done when the tcp 
connection to the client is ended. Now comes another http process (on 
whatever node) and calls a "session start" and trys an flock on that session 
while another process already holds that lock. The process might end up in 
the seen timeouts (30-60secs) which (as far as I remember relates to the 
timeout of the tcp connection defined in the httpd.conf or some timeout in 
the php.ini) - there is an explanation on this but I cannot rember ;-) ). 
Nevertheless in our scenario the problems were the "bad" session handling by 
php. We have made a patch for the phplib where you can disable the locking, 
or just implicitly do locking and therefore keep consitency while session 
data is read or written. We could make apache work as expected and now we 
don't see any "D" process anymore since a year.
Oh yes the patch can be found at
www.opensharedroot.org in the download section.

Besides: You will never encounter this on a localfilesystem or nfs (as nfs 
ignores flocks). As nfs does not support flocks and silently ignores them.

Hope that helps and let us know about problems.
Regards Marc.
>
> I've never seen ERESTARTSYS before, and some googling tells me that it's
> basically telling the kernel to interrupt the current syscall in order
> to handle a signal (SIGALRM in this case, which I'm not sure the
> function of).  I could be *way* off base here - I'm not a programmer by
> any stretch of the imagination.
>
> 2)  The locking statistics seems to be a huge mystery.  The lock total
> doesn't seem to correspond to the number of open files that I have (I
> hope!).  Here's the output of a 'cat /proc/gulm/lockspace - I can't
> imagine that I have 300,000+ files open on this system at this point -
> when are the locks released, or is this even an indication of how many
> locks that are active at the current time?  What does the 'pending'
> number mean?
>
> [svadmin@s259830hz1sl01 gulm]$ cat lockspace
>
> lock counts:
>   total: 369822
>     unl: 176518
>     exl: 1555
>     shd: 191501
>     dfr: 0
> pending: 5
>    lvbs: 2000
>    lops: 21467433
>
> [svadmin@s259830hz1sl01 gulm]$
>
> Thanks for any help that anyone can provide on this!
>
> Thanks!
> -Jon
>
> --
> 
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 121 409-54
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster