GlusterFS 3.3.1 split-brain rsync question

daemons at kanuka.com.au (Daniel Mons) · Sun, 14 Apr 2013 14:31:24 +1000

Running the following script on an unimportant tree in the cluster
this weekend as a test.  So far, so good, and it appears to be doing
what I want.

Thanks again Pete for the recommendation.

-Dan

#!/bin/bash
BR='-------------------------'
UUID=$(/usr/bin/uuidgen)
if [ "$UUID" == "" ]
        then
        echo "UUID is null"
        exit 1
fi
find "/mnt/blah/" -type f | while read FILE
do
        DNAME=$(dirname "${FILE}")
        FNAME=$(basename "${FILE}")
        cd "${DNAME}"
        if (( $? > 0 ))
        then
                echo "Bad cd operation"
                exit 1
        fi
        pwd
        mv -v "${FNAME}" "${FNAME}.${UUID}"
        if (( $? > 0 ))
        then
                echo "Bad mv operation"
                exit 1
        fi
        cp -pv "${FNAME}.${UUID}" "${FNAME}"
        if (( $? > 0 ))
        then
                echo "Bad cp operation"
                exit 1
        fi
        rm -fv "${FNAME}.${UUID}"
        if (( $? > 0 ))
        then
                echo "Bad rm operation"
                exit 1
        fi
        echo "${BR}"
done

On 11 April 2013 22:41, Daniel Mons <daemons at kanuka.com.au> wrote:
> Hi Pete,
>
> Thanks for that link.  I'm going to try this en mass on an unimportant
> directory over the weekend.
>
> -Dan
>
>
> On 11 April 2013 01:41, Pete Smith <pete at realisestudio.com> wrote:
>> Hi Dan
>>
>> I've come up against this recently whilst trying to delete large amounts of
>> files from our cluster.
>>
>> I'm resolving it with the method from
>> http://comments.gmane.org/gmane.comp.file-systems.gluster.user/1917
>>
>> With Fabric as a helping hand, it's not too tedious.
>>
>> Not sure about the level of glustershd compatibiity, but it's working for
>> me.
>>
>> HTH
>>
>> Pete
>> --
>>
>>
>> On 10 April 2013 11:44, Daniel Mons <daemons at kanuka.com.au> wrote:
>>>
>>> Our production GlusterFS 3.3.1GA setup is a 3x2 distribute-replicate,
>>> with 100TB usable for staff.  This is one of 4 identical GlusterFS
>>> clusters we're running.
>>>
>>> Very early in the life of our production Gluster rollout, we ran
>>> Netatalk 2.X to share files with MacOSX clients (due to slow negative
>>> lookup on CIFS/Samba for those pesky resource fork files in MacOSX's
>>> Finder).  Netatalk 2.X wrote it's CNID_DB files back to Gluster, which
>>> caused enormous IO, locking up many nodes at a time (lots of "hung
>>> task" errors in dmesg/syslog).
>>>
>>> We've since moved to Netatalk 3.X which puts its CNID_DB files
>>> elsewhere (we put them on local SSD RAID), and the lockups have
>>> vanished.  However, our split-brain files number in the tens of
>>> thousands to to those previous lockups, and aren't always predictable
>>> (i.e.: it's not always the case where brick0 is "good" and brick1 is
>>> "bad").  Manually fixing the files is far too time consuming.
>>>
>>> I've written a rudimentary script that trawls
>>> /var/log/glusterfs/glustershd.log for split-brain GFIDs, tracks it
>>> down on the matching pair of bricks, and figures out via a few rules
>>> (size tends to be a good indicator for us, as bigger files tend to be
>>> more rencent ones) which is the "good" file.  This works for about 80%
>>> of files, which will dramatically reduce the amount of data we have to
>>> manually check.
>>>
>>> My question is: what should I do from here?  Options are:
>>>
>>> Option 1) Delete the file from the "bad" brick
>>>
>>> Option 2)  rsync the file from the "good" brick to the "bad" brick
>>> with -aX flag (preserve everything, including trusted.afr.$server and
>>> trusted.gfid xattrs)
>>>
>>> Option 3) rsync the file from "good" to "bad", and then setfattr -x
>>> trusted.* on the bad brick.
>>>
>>> Which of these is considered the better (more glustershd compatible)
>>> option?  Or alternatively, is there something else that's preferred?
>>>
>>> Normally I'd just test this on our backup gluster, however as it was
>>> never running Netatalk, it has no split-brain problems, so I can't
>>> test the functionality.
>>>
>>> Thanks for any insight provided,
>>>
>>> -Dan
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>>
>> --
>> Pete Smith
>> DevOp/System Administrator
>> Realise Studio
>> 12/13 Poland Street, London W1F 8QB
>> T. +44 (0)20 7165 9644
>>
>> realisestudio.com