It's a replicated volume, but only one client was writing one process to the cluster, so I don't understand how you could have a split brain. The other issue is that while making a tar of the static files on the replicated volume, I kept getting errors from tar that the file changed as we read it. This was content I had copied *to* the cluster, and only one client node was acting on it at a time, so there is no chance anyone or anything was updating the files. And this error was coming up every 6 to 10 files. All three nodes were part of a Linux-HA NFS cluster that worked flawlessly for weeks, so I feel pretty confident it's not the environment. I understand the hang could be un-related, but the two things above cause me concern. Previously when I worked with 3.2.6 and 3.2.6 I had a lot of problems with split brains, "No end-point connected" errors, etc., so I gave up on Gluster. The stuff above, in a test environment, makes me wonder. What could cause this in a closed dev env? sean On 06/17/2012 03:42 AM, Brian Candler wrote: > On Sat, Jun 16, 2012 at 04:47:51PM -0400, Sean Fulton wrote: >> 1) The split-brain message is strange because there are only two >> server nodes and 1 client node which has mounted the volume via NFS >> on a floating IP. This was done to guarantee that only one node gets >> written to at any point in time, so there is zero chance that two >> nodes were updated simultaneously. > Are you using a distributed volume, or a replicated volume? Writes to a > replicated volume go to both nodes. > >> [586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 seconds. >> [586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [586898.273295] flush-0:45 D ffff8806037592d0 0 633954 20 0x00000000 >> [586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 0000000000000000 >> [586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 ffff88000d1ebbf0 >> [586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 ffff88000d1ebfd8 >> [586898.273326] Call Trace: >> [586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20 >> [586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20 >> [586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20 > Are you using XFS by any chance? > > I started with XFS, because that was what the gluster docs recommend, but > eventually gave up on it. I can replicate those sort of kernel lockups on a > 24-disk MD array within a short space of time - without gluster, just by > throwing four bonnie++ processes at it. > > The same tests run with either ext4 or btrfs do not hang, at least not > during two days of continuous testing. > > Of course, any kernel problem cannot be the fault of glusterfs, since > glusterfs runs entirely in userland. > > Regards, > > Brian. > -- Sean Fulton GCN Publishing, Inc. Internet Design, Development and Consulting For Today's Media Companies http://www.gcnpublishing.com (203) 665-6211, x203