I see. Thanks a tonne for the thorough explanation! :) I can see that our setup would be vulnerable here because the logger on one server is not generally aware of the state of the replica on the other server. So, it is possible that the log files may have been renamed before heal had a chance to kick in. Could I also request you for the bug ID (should there be one) against which you are coding up the fix, so that we could get a notification once it is passed? Also, as an aside, is O_DIRECT supposed to prevent this from occurring if one were to make allowance for the performance hit? Thanks again, Anirban |
From: Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>;
To: Anirban Ghoshal <chalcogen_eg_oxygen@xxxxxxxxx>; <gluster-users@xxxxxxxxxxx>;
Subject: Re: Split-brain seen with [0 0] pending matrix and io-cache page errors
Sent: Sun, Oct 19, 2014 9:01:58 AM
On 10/19/2014 01:36 PM, Anirban Ghoshal
wrote:
I am working on the fix. RCA: 0) Lets say the file 'abc.log' is opened for writing on replica pair (brick-0, brick-1) 1) brick-0 went down 2) abc.log is renamed to abc.log.1 3) brick-0 comes back up 4) re-open on old abc.log happens from mount to brick-0 5) self-heal kicks in and deletes old abc.log and creates and syncs abc.log.1 6) But the mount is still writing to the deleted 'old abc.log' on brick-0 so abc.log.1 file remains at the same size while abc.log.1 file keeps increasing on brick-1. This leads to size mismatch split-brain on abc.log.1. Race happens between steps 4), 5). If 5) happens before 4) no split-brain will be observed. Work-around: 0) Take backup of good abc.log.1 file from brick-1. (Just being paranoid) Do any of the following two steps to make sure the stale file that is open is closed 1-a) Take the brick process with bad file down using kill -9 <brick-pid> (In my example brick-0). 1-b) Introduce a temporary disconnect between mount and brick-0. (I would choose 1-a) 2) Remove the bad file(abc.log.1) and its gfid-backend-file from brick-0 3) Bring the brick back up (gluster volume start <volname> force)/restore the connection and let it heal by doing 'stat' on the file abc.log.1 on the mount. This bug existed from 2012, from the first time I implemented rename/hard-link self-heal. It is difficult to re-create. I have to put break-points at several places in the process to hit the race. Pranith
|
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users