On 6 Mar 2013, at 09:02, Todd Stansell <todd at stansell.org> wrote: > In our recent testing, we saw all kinds of weird problems while testing > rebuilding a failed brick in the same 2 node replicate cluster. Several times > we had to kill off all gluster processes and restart things from scratch to > get the two sides talking correctly again (where both sides thought they were > happily talking to the other side, but self-heal wasn't doing anything). We'd > run a full heal or stat some files and they wouldn't replicate back to the > other side. After restarting the processes (not just glusterd, but all of the > glusterfs ones too), things would start working. Once things were running and > the nodes were properly replicating, it appeared to flow both ways nicely. Thanks. I've managed to fix it by deleting every bit of gluster I could find, reinstalling and copying all the data back on. I also saw that bug report of NFS hangs with 3.3.1, so I downgraded to 3.3.0 (which also meant switching to the older ppa). It would be good to have a definitive list of where gluster puts everything - after uninstalling and deleting everything I could find, it still clearly had some info about my old config. I did this after unmounting volume and stopping all gluster services: aptitude remove glusterfs-client glusterfs-server aptitude purge glusterfs-client glusterfs-server rm -rf /etc/glusterfs rm -rf /var/log/glusterfs rm -rf /var/lib/glusterfs rm -rf /usr/lib/glusterfs rm -rf /var/shared (my gluster storage area) and yet when I reinstalled, I saw entries in the logs mentioning the brick storage area from my old installation - I've no idea where that info was lurking for it to find it again. Incidentally, that also caused a bug of sorts. When starting glusterd it hung up, filling /var/log/glusterfs/etc-glusterfs-glusterd.vol.log with repeats of this: [2013-03-05 19:24:21.137209] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0 [2013-03-05 19:24:21.137393] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable) I couldn't find anything on google relating this error, but it seems it's caused when gluster can't find its storage area. I my case, creating /var/shared fixed this problem. I've no idea why it would report that as an issue with the pid file, but hopefully this will help someone else. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/