Every time we start our rendering applications on our gluster volumes,
the load starts climbing. At first, we thought it was our application,
but apparently our application is locked up (more like blocked waiting
on something). Top shows no active processes (e.g. load should be next
to 0). After killing the application, the load continues to climb until
we terminate and restart the glusterfs process. Glusterfs itself is not
busy at all. An strace shows it just on epoll_wait. Top shows no
processes using any cpu, thus it seems like the problem is in the kernel.
load average: 14.99, 14.93, 14.20
Before we had this problem, we were getting consistent kernel panics.
Applying
http://www.nabble.com/-fuse-devel--Kernel-oops-in-fuse_send_readpages()-t1374092.html
fixed those. We're stuck to using the 2.6.16 kernel on Amazon's EC2.
Fuse is version 2.6.3. We've disabled all performance optimizations out
of desperation to get something working.
Anything I can look for to track this down?
Thanks,
Erik Osterman
# Server config
volume brick0
type storage/posix
option directory /mnt/glusterfs/brick0
end-volume
volume server
type protocol/server
subvolumes brick0
option transport-type tcp/server
option bind-address 0.0.0.0
option listen-port 6996
option client-volume-filename /etc/glusterfs/client.vol
option auth.ip.brick0.allow *
end-volume
# Client config
volume ip0
type protocol/client
option transport-type tcp/client
option remote-host 10.253.59.65
option remote-port 6996
option remote-subvolume brick0
end-volume
volume ip1
type protocol/client
option transport-type tcp/client
option remote-host 10.253.58.240
option remote-port 6996
option remote-subvolume brick0
end-volume
volume ip2
type protocol/client
option transport-type tcp/client
option remote-host 10.253.58.239
option remote-port 6996
option remote-subvolume brick0
end-volume
volume afr
type cluster/afr
subvolumes ip0 ip1 ip2
option replicate *:2
end-volume
volume ip
type cluster/unify
subvolumes afr
option scheduler rr
option rr.limits.min-free-disk 2GB
end-volume