Here's what I think I know:
1. Samba does indeed crash when running with "oplocks = no" AND on a single node of a two-node GFS cluster where the other node is not doing much of anything. So Christopher the answer to your questions is that it seems to be _purely_ samba and gfs interacting that is setting up a crash.
2. Samba sometimes crashes in such a way that a "kill -9" will not eliminate the crashed processes. In this situation attempting to start samba again creates a [predictably] unstable situation in which kernel oopses are evident.
3. Often samba simply starts to fail without outright crashing. This renders both heartbeat monitoring and fencing rather useless. On the client end this is evidenced by the ability to access a share and browse, but severely degredaded performance when accessing individual files and frequent crashes of windows explorer. On the server end, more and more smb processes start up but old ones don't die...
Here are some observations. <disclaimer> Some of these could be completly bogus because I am extrapolating on insifficient facts </disclaimer>.
1. Crashes are not nescessarily caused by having multiple accesses to the same file. In one situation having 7 computers read/write to the same directory (but never the same file) appears to have caused a complete crash or a server running samba on top of GFS.
2. Crashes could be load related. I seem to be exponentially more likely to see a crash with 50 concurrent users than with 5. Since having many users increases the chances of a situation like #1 above, I can see a possible correlation, but this would not explain all crashes.
3. Larger files and more full directories experience severe performance degredation in the samba/gfs scenario. Simply right-clicking on a networked file that is a few hundred megabytes can take minutes to pop up a menu (assuming it doesn't crash windows explorer first).
4. Crashes occur on quota-enforced and non-quota-enforced system with no discernable difference. However access to files on quota-ed systems might be slower.
I experiemented with a lot of different settings in the smb.conf, and while the crashes were sometimes different I could not come up with a good mapping of configs<=>crash causes. I have been very frustrated because every time I think I've discovered something a new crash happens that appears to prove me wrong. I need to create a seperate test environment where I can set up idealized crash conditions in order to give this list some more credible data because my current environment has a lot of simultaneous access from multiple users whose actions aren't easily monitorably or consistent. For instance, bob can't access a server drive on windows computer one, he simply logs into computer two and then three hoping to get a different result. Server crashes, but did it crash because bob didn't wait long enough on the first access and exacerbated a fixable problem to a crash? hard to tell...
hoping someone out there has seen something similar or can shed some light.
anyway, my system details: I'm still running samba 3.0.7 on top of kernel 2.6.8-1.521
I tried updating to the newest CVS releases but ran into compile errors and haven't had time to try again. so the GFS build is still mid-september. I'd like to try the fixes that Patrick and David posted but think I am going to try to compile cleanly with a 2.6.9 kernel.
error details: I tried turning the loglevel up to 3. I get the following fairly often: Nov 15 17:10:32 clu2 smbd[5624]: Error writing 5 bytes to client. -1. (Connection reset by peer) Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:send_smb(647) Nov 15 17:10:32 clu2 smbd[5624]: write_socket: Error writing 5 bytes to socket 24: ERRNO = Connection reset by peer Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket(455) Nov 15 17:10:32 clu2 smbd[5624]: write_socket_data: write failure. Error = Connection reset by peer Nov 15 17:10:32 clu2 smbd[5624]: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket_data(430)
otherwise I just see runaway processes that won't die or I get a fence event with no apparent log entry leading to it.
-alan
On Mon, 1 Nov 2004, Christopher R. Hertel wrote:
On Mon, Nov 01, 2004 at 12:30:47PM -0800, Alan Wood wrote:I am running a cluster with GFS-formatted file systems mounted on multiple nodes. What I was hoping to do was to set up one node running httpd to be my webserver and another node running samba to share the same data internally. What I am getting when running that is instability.
Yeah. This is a known problem. The reason is that Samba must maintain a great deal of metadata internally. This works well enough with multiple Samba processes running on a single machine dealing (more or less) directly with the filesystem.
The problem is that Samba must keep track of translations between Posix and Windows metadata, locking semantics, file sharing mode semantics, etc.
I had assumed that this would only be a problem if Samba was running on multiple machines all GFS-sharing the same back-end block storage. Your report suggests that there's more to the interaction between Samba and GFS than I had anticipated. Interesting...
The samba serving node keeps crashing. I have heartbeat set up so that failover happens to the webserver node, at which point the system apparently behaves well.
Which kind of failover? Do you start Samba on the webserver node? It would be interesting to know if the two run well together on the same node, but fail on separate nodes.
After reading a few articles on the list it seemed to me that the problem might be samba using oplocks or some other caching mechanism that breaks synchronization.
Yeah... that was my next question...
I tried turning oplocks=off in my smb.conf file, but that made the system unusably slow (over 3 minutes to right-click on a two-meg file).
Curious.
...but did it fix the other problems?
I'd really love to work with someone to figure all this out. (Hint hint.) :)
I am also not sure that is the extent of the problem, as I seem to be able to re-create the crash simply by accessing the same file on multiple clients just via samba (which locking should be able to handle).
Should be...
If the problem were merely that the remote node and the samba node were both accessing an oplocked file I could understand, but that doesn't always seem to be the case.
There's more here than I can figure out just from the description. It'd take some digging along-side someone who knows GFS.
has anyone had any success running the same type of setup? I am also serving nfs on the samba server, though with very little load there.
Is there any overlap in the files they're serving?
below is the syslog output of a crash. I'm running 2.6.8-1.521smp with a GFS CVS dump from mid-september. -alan
Wish I could be more help...
Chris -)-----