not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

rushian85 at gmail.com (Ken Randall) · Sun, 17 Jul 2011 19:56:57 -0500

I'll try to keep it brief, I've been testing GlusterFS for the last month or
so.  My production setup will be more complex than what I'm listing below,
but I've whittled things down to where the below setup will cause the
problem to happen.

I'm running GlusterFS 3.2.2 on two CentOS 5.6 boxes in a replicated volume.
I am connecting to it with a Windows Server 2008 R2 box over an SMB share.
Basically, the web app portion runs locally on the Windows box, but content
(e.g. HTML templates, images, CSS files, JS, etc.) is being pulled from the
Gluster volume.

I've performed a fair degree of load testing on the setup so far, scaling up
the load to nearly four times what our normal production environment sees in
primetime, and it seems to handle it fine.  We run tens of thousands of
websites, so this is pretty significant that it's able to handle that.

However, as a part of a different suite of tests is a Page of Death, which
contains tens of thousands of image references on a single page.  All I have
to do is load that page for a few seconds, and it will grind my web server's
SMB connection to a near complete standstill.  I can close the browser after
just a few seconds, and it still takes several minutes for the web server to
respond to any requests at all.  Connecting to the share over Explorer is
extremely slow from that same machine.  (I can connect to that same share
from another machine, which is an export of the same exact GlusterFS mount,
and it is just fine.  Similarly, accessing the Gluster mount on the Linux
boxes shows zero problems at all, it's as happy to respond to requests as
ever.)

Even if I scale it out to a swath of web servers, loading that single page,
one time, for just a few seconds will freeze every single web server, making
every website on the system inaccessible.

You may be asking, why am I asking here instead of on a Samba group, or even
a Windows group?  Here's why:  My control is that I have a Windows file
server that I can swap in Gluster's place, and I'm able to load that page
without it blinking an eye (it actually becomes a test of the computer that
the browser is on).  It does not affect any of the web servers' in the
slightest.  My second control is that I have exported the raw Gluster data
directory as an SMB share (with the same exact Samba configuration as the
Gluster one), and it performs equally as well as the Windows file server.  I
can load the Page of Death with no consequence.

I've pushed IO-threads all the way to the maximum 64 without any benefit.  I
can't see anything noteworthy in the Gluster or Samba logs, but perhaps I am
not sure what to look for.

Thank you to anybody who can point me the right direction.  I am hoping I
don't have to dive into Wireshark or tcpdump territory, but I'm open if you
can guide the way!  ;)

Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110717/c005e83d/attachment.htm>