Re: [Linux-cluster] samba on top of GFS

Alan Wood <chekov@xxxxxxxx> · Mon, 15 Nov 2004 23:59:49 -0800 (PST)

Thanks for your help guys.  Sorry it has taken me a while to get back to 
you.  I have been trying to come up with intelligent things to add but have 
been stymied by what appears to be an ever-changing target.  details below

Here's what I think I know:

1.  Samba does indeed crash when running with "oplocks = no" AND on a 
single node of a two-node GFS cluster where the other node is not doing 
much of anything.  So Christopher the answer to your questions is that it 
seems to be _purely_ samba and gfs interacting that is setting up a crash.

2.  Samba sometimes crashes in such a way that a "kill -9" will not 
eliminate the crashed processes.  In this situation attempting to start 
samba again creates a [predictably] unstable situation in which kernel 
oopses are evident.

3.  Often samba simply starts to fail without outright crashing.  This 
renders both heartbeat monitoring and fencing rather useless.  On the 
client end this is evidenced by the ability to access a share and browse, 
but severely degredaded performance when accessing individual files and 
frequent crashes of windows explorer.  On the server end, more and more smb 
processes start up but old ones don't die...

Here are some observations.  <disclaimer> Some of these could be completly 
bogus because I am extrapolating on insifficient facts </disclaimer>.

1.  Crashes are not nescessarily caused by having multiple accesses to the 
same file.  In one situation having 7 computers read/write to the same 
directory (but never the same file) appears to have caused a complete 
crash or a server running samba on top of GFS.

2.  Crashes could be load related.  I seem to be exponentially more 
likely to see a crash with 50 concurrent users than with 5.  Since 
having many users increases the chances of a situation like #1 above, I can 
see a possible correlation, but this would not explain all crashes.

3.  Larger files and more full directories experience severe performance 
degredation in the samba/gfs scenario.  Simply right-clicking on a 
networked file that is a few hundred megabytes can take minutes to pop up 
a menu (assuming it doesn't crash windows explorer first).

4.  Crashes occur on quota-enforced and non-quota-enforced system with no 
discernable difference.  However access to files on quota-ed systems might 
be slower.

I experiemented with a lot of different settings in the smb.conf, and while 
the crashes were sometimes different I could not come up with a good 
mapping of configs<=>crash causes.  I have been very frustrated because 
every time I think I've discovered something a new crash happens that 
appears to prove me wrong.  I need to create a seperate test environment 
where I can set up idealized crash conditions in order to give this list 
some more credible data because my current environment has a lot of 
simultaneous access from multiple users whose actions aren't easily 
monitorably or consistent.  For instance, bob can't access a server drive 
on windows computer one, he simply logs into computer two and then three 
hoping to get a different result.  Server crashes, but did it crash because 
bob didn't wait long enough on the first access and exacerbated a fixable 
problem to a crash?  hard to tell...

hoping someone out there has seen something similar or can shed some light.

anyway,
my system details:
I'm still running samba 3.0.7 on top of kernel 2.6.8-1.521

I tried updating to the newest CVS releases but ran into compile errors and 
haven't had time to try again.  so the GFS build is still mid-september. 
I'd like to try the fixes that Patrick and David posted but think I am 
going to try to compile cleanly with a 2.6.9 kernel.

error details:
I tried turning the loglevel up to 3.  I get the following fairly often:
Nov 15 17:10:32 clu2 smbd[5624]:   Error writing 5 bytes to client. -1. (Connection reset by peer)
Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:send_smb(647)
Nov 15 17:10:32 clu2 smbd[5624]:   write_socket: Error writing 5 bytes to socket 24: ERRNO = Connection reset by peer
Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket(455)
Nov 15 17:10:32 clu2 smbd[5624]:   write_socket_data: write failure. Error = Connection reset by peer
Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket_data(430)

otherwise I just see runaway processes that won't die or I get a fence 
event with no apparent log entry leading to it.

-alan

On Mon, 1 Nov 2004, Christopher R. Hertel wrote:

On Mon, Nov 01, 2004 at 12:30:47PM -0800, Alan Wood wrote:
I am running a cluster with GFS-formatted file systems mounted on multiple
nodes.  What I was hoping to do was to set up one node running httpd to be
my webserver and another node running samba to share the same data
internally.
What I am getting when running that is instability.

Yeah.  This is a known problem.  The reason is that Samba must maintain a
great deal of metadata internally.  This works well enough with multiple
Samba processes running on a single machine dealing (more or less)
directly with the filesystem.

The problem is that Samba must keep track of translations between Posix
and Windows metadata, locking semantics, file sharing mode semantics, etc.

I had assumed that this would only be a problem if Samba was running on
multiple machines all GFS-sharing the same back-end block storage.  Your
report suggests that there's more to the interaction between Samba and GFS
than I had anticipated.  Interesting...

The samba serving node
keeps crashing.  I have heartbeat set up so that failover happens to the
webserver node, at which point the system apparently behaves well.

Which kind of failover?  Do you start Samba on the webserver node?  It
would be interesting to know if the two run well together on the same
node, but fail on separate nodes.

After reading a few articles on the list it seemed to me that the problem
might be samba using oplocks or some other caching mechanism that breaks
synchronization.

Yeah... that was my next question...

I tried turning oplocks=off in my smb.conf file, but that
made the system unusably slow (over 3 minutes to right-click on a two-meg
file).

Curious.

...but did it fix the other problems?

I'd really love to work with someone to figure all this out.  (Hint hint.)
:)

I am also not sure that is the extent of the problem, as I seem to be able
to re-create the crash simply by accessing the same file on multiple
clients just via samba (which locking should be able to handle).

Should be...

If the
problem were merely that the remote node and the samba node were both
accessing an oplocked file I could understand, but that doesn't always seem
to be the case.

There's more here than I can figure out just from the description.  It'd
take some digging along-side someone who knows GFS.

has anyone had any success running the same type of setup?  I am also
serving nfs on the samba server, though with very little load there.

Is there any overlap in the files they're serving?

below is the syslog output of a crash.  I'm running 2.6.8-1.521smp with a
GFS CVS dump from mid-september.
-alan

Wish I could be more help...

Chris -)-----