not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

rushian85 at gmail.com (Ken Randall) · Mon, 18 Jul 2011 11:00:15 -0500

Whit,

Genius!

This morning I set out to remove as many variables as possible to whittle
down the repro case as much as possible.  I've become pretty good at
debugging memory dumps on the Windows side over the years, and even
inspected the web processes.  Nothing looked out of the ordinary there, just
a bunch of threads waiting to get file attribute data from the Gluster
share.

So then, to follow your lead, I reduced the Page of Death down from
thousands of images to just five.  I tried accessing the page, and boom,
everything's frozen for minutes.  Interesting.  So I reduced it to one
image, accessed the page, and boom, everything's dead instantly.  That one
image is a file that doesn't exist.

So now, knowing that GlusterFS is kicking into overdrive fretting about a
file it can't find, I decided to eliminate the web server altogether.  I
opened up Windows Explorer, and typed in a directory that didn't exist, and
sure enough, I'm unable to navigate through the share in another Explorer
window until it finally responds again a minute later.  I think the Page of
Death was exhibiting such a massive death (e.g. only able to respond again
upwards of five minutes later) because it was systematically trying to
access several files that weren't found, and each one it can't find causes
the SMB connection to hang for close to a minute.

I feel like this is a bit of major progress toward pinpointing the problem
for a possible resolution.  Here are some additional details that may help:

The GlusterFS directory in question, /storage, has about 80,000 subdirs in
it.  As such, I'm using ext4 to overcome the subdir limitations of ext3.
The non-existent image file that is able to cause everything to freeze
exists in a directory, /storage/thisdirdoesntexist/images/blah.gif, where
"thisdirdoesntexist" is in that storage directory along with those 80,000
real subdirs.  I know it's a pretty laborious thing for Gluster to piece
together a directory listing, and combined with Joseph's recognition of the
flood of "getdents", does it seem reasonable that Gluster or Samba is
freezing because it's for some reason generating a subdir listing of
/storage whenever it can't find one of its subdirs?

As another test, if I access a file inside a non-existent subdir of a dir
that only has five subdirs, and nothing freezes.

So the freezing seems to be a function of the number of subdirectories that
are siblings of the first part of the path that doesn't exist, if that makes
sense.  So in /this/is/a/long/path, if "is" doesn't exist, then Samba will
generate a list of subdirs under "/this".  And if "/this" has 100,000
immediate subdirs under it, then you're about to experience a world of hurt.

I read some where that FUSE's implementation of readdir() is a blocking
operation.  If true, the above explanation, plus FUSE's readdir(), are to
blame.

And I am therefore up a creek.  It is not feasible to enforce the system to
only have a few subdirs at any given level to prevent the lockup.  Unless
somebody, after reading this novel, has some ideas for me to try.  =)  Any
magical ways to not get FUSE to block, or any trickery on Samba's side?

Ken

On Sun, Jul 17, 2011 at 10:29 PM, Whit Blauvelt
<whit.gluster at transpect.com>wrote:

> On Sun, Jul 17, 2011 at 10:19:00PM -0500, Ken Randall wrote:
>
> > (The no such file or directory part is expected since some of the image
> > references don't exist.)
>
> Wild guess on that: Gluster may work harder at files it doesn't find than
> files it finds. It's going to look on one side or the other of the
> replicated file at first, and if it finds the file deliver it. But if it
> doesn't find the file, wouldn't it then check the other side of the
> replicated storage to make sure this wasn't a replication error?
>
> Might be interesting to run a version of the test where all the images
> referenced do exist, to see if it's the missing files that are driving up
> the CPU cycles.
>
> Whit
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110718/02324439/attachment.htm>