On 10/19/2014 06:05 PM, Anirban Ghoshal
wrote:
I see. Thanks a tonne for the thorough explanation!
:) I can see that our setup would be vulnerable here
because the logger on one server is not generally aware
of the state of the replica on the other server. So, it
is possible that the log files may have been renamed
before heal had a chance to kick in.
Could I also request you for the bug ID (should there be
one) against which you are coding up the fix, so that we
could get a notification once it is passed?
|
This bug was reported by Redhat QE and the bug is cloned upstream. I
copied the relevant content so you would understand the context:
https://bugzilla.redhat.com/show_bug.cgi?id=1154491
Pranith
Also, as an aside, is O_DIRECT supposed to prevent this
from occurring if one were to make allowance for the
performance hit?
|
Unfortunately no :-(. As far as I understand that was the only
work-around.
Pranith
On 10/19/2014 01:36 PM,
Anirban Ghoshal wrote:
It is possible, yes, because these
are actually a kind of log files. I
suppose, like other logging frameworks
these files an remain open for a
considerable period, and then get
renamed to support log rotate semantics.
That said, I might need to check with
the team that actually manages the
logging framework to be sure. I only
take care of the file-system stuff. I
can tell you for sure Monday.
If it is the same race that you mention,
is there a fix for it?
Thanks,
Anirban
|
I am working on the fix.
RCA:
0) Lets say the file 'abc.log' is opened for writing
on replica pair (brick-0, brick-1)
1) brick-0 went down
2) abc.log is renamed to abc.log.1
3) brick-0 comes back up
4) re-open on old abc.log happens from mount to
brick-0
5) self-heal kicks in and deletes old abc.log and
creates and syncs abc.log.1
6) But the mount is still writing to the deleted
'old abc.log' on brick-0 so abc.log.1 file remains
at the same size while abc.log.1 file keeps
increasing on brick-1. This leads to size mismatch
split-brain on abc.log.1.
Race happens between steps 4), 5). If 5) happens
before 4) no split-brain will be observed.
Work-around:
0) Take backup of good abc.log.1 file from brick-1.
(Just being paranoid)
Do any of the following two steps to make sure the
stale file that is open is closed
1-a) Take the brick process with bad file down using
kill -9 <brick-pid> (In my example brick-0).
1-b) Introduce a temporary disconnect between mount
and brick-0.
(I would choose 1-a)
2) Remove the bad file(abc.log.1) and its
gfid-backend-file from brick-0
3) Bring the brick back up (gluster volume start
<volname> force)/restore the connection and
let it heal by doing 'stat' on the file abc.log.1 on
the mount.
This bug existed from 2012, from the first time I
implemented rename/hard-link self-heal. It is
difficult to re-create. I have to put break-points
at several places in the process to hit the race.
Pranith
On
10/18/2014 04:36 PM, Anirban
Ghoshal wrote:
Hi,
Yes, they do, and
considerably. I'd
forgotten to mention
that on my last email.
Their mtimes, however,
as far as i could tell
on separate servers,
seemed to coincide.
Thanks,
Anirban
|
Are these files always open? And
is it possible that the file could
have been renamed when one of the
bricks was offline? I know of a
race which can introduce this one.
Just trying to find if it is the
same case.
Pranith
hi,
Could you
see if the size
of the file
mismatches?
Pranith
On
10/18/2014
04:20 AM,
Anirban
Ghoshal wrote:
Hi
everyone,
I
have this
really
confusing
split-brain
here that's
bothering me.
I am running
glusterfs
3.4.2 over
linux 2.6.34.
I have a
replica 2
volume
'testvol' that
is It seems I
cannot
read/stat/edit
the file in
question, and
`gluster
volume heal
testvol info
split-brain`
shows nothing.
Here are the
logs from the
fuse-mount for
the volume:
[2014-09-29
07:53:02.867111]
W
[fuse-bridge.c:1172:fuse_err_cbk]
0-glusterfs-fuse:
4560969:
FLUSH() ERR
=> -1
(Input/output
error)
[2014-09-29
07:54:16.007799]
W
[page.c:991:__ioc_page_error]
0-testvol-io-cache:
page error for
page =
0x7fd5c8529d20
& waitq =
0x7fd5c8067d40
[2014-09-29
07:54:16.007854]
W
[fuse-bridge.c:2089:fuse_readv_cbk]
0-glusterfs-fuse:
4561103: READ
=> -1
(Input/output
error)
[2014-09-29
07:54:16.008018]
W
[page.c:991:__ioc_page_error]
0-testvol-io-cache:
page error for
page =
0x7fd5c8607ee0
& waitq =
0x7fd5c8067d40
[2014-09-29
07:54:16.008056]
W
[fuse-bridge.c:2089:fuse_readv_cbk]
0-glusterfs-fuse:
4561104: READ
=> -1
(Input/output
error)
[2014-09-29
07:54:16.008233]
W
[page.c:991:__ioc_page_error]
0-testvol-io-cache:
page error for
page =
0x7fd5c8066f30
& waitq =
0x7fd5c8067d40
[2014-09-29
07:54:16.008269]
W
[fuse-bridge.c:2089:fuse_readv_cbk]
0-glusterfs-fuse:
4561105: READ
=> -1
(Input/output
error)
[2014-09-29
07:54:16.008800]
W
[page.c:991:__ioc_page_error]
0-testvol-io-cache:
page error for
page =
0x7fd5c860bcf0
& waitq =
0x7fd5c863b1f0
[2014-09-29
07:54:16.008839]
W
[fuse-bridge.c:2089:fuse_readv_cbk]
0-glusterfs-fuse:
4561107: READ
=> -1
(Input/output
error)
[2014-09-29
07:54:16.009365]
W
[page.c:991:__ioc_page_error]
0-testvol-io-cache:
page error for
page =
0x7fd5c85fd120
& waitq =
0x7fd5c8067d40
[2014-09-29
07:54:16.009413]
W
[fuse-bridge.c:2089:fuse_readv_cbk]
0-glusterfs-fuse:
4561109: READ
=> -1
(Input/output
error)
[2014-09-29
07:54:16.040549]
W
[afr-open.c:213:afr_open]
0-testvol-replicate-0:
failed to open
as split brain
seen,
returning EIO
[2014-09-29
07:54:16.040594]
W
[fuse-bridge.c:915:fuse_fd_cbk]
0-glusterfs-fuse:
4561142:
OPEN()
/SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log
=> -1
(Input/output
error)
Could
somebody
please give me
some clue on
where to
begin? I
checked the
xattrs on /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log
and it seems
the changelogs
are [0, 0] on
both replicas,
and the gfid's
match.
Thank
you very much
for any help
on this.
Anirban
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
|
|
|
|