Re: Split-brain seen with [0 0] pending matrix and io-cache page errors

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 20 Oct 2014 09:38:05 +0530



    On 10/19/2014 06:05 PM, Anirban Ghoshal
      wrote:

    
              I see. Thanks a tonne for the thorough explanation!
                :) I can see that our setup would be vulnerable here
                because the logger on one server is not generally aware
                of the state of the replica on the other server. So, it
                is possible that the log files may have been renamed
                before heal had a chance to kick in. 

                
                Could I also request you for the bug ID (should there be
                one) against which you are coding up the fix, so that we
                could get a notification once it is passed?

              
    This bug was reported by Redhat QE and the bug is cloned upstream. I
    copied the relevant content so you would understand the context:

    https://bugzilla.redhat.com/show_bug.cgi?id=1154491

    
    Pranith

    
                Also, as an aside, is O_DIRECT supposed to prevent this
                from occurring if one were to make allowance for the
                performance hit? 

              
    Unfortunately no :-(. As far as I understand that was the only
    work-around.

    
    Pranith

    
                Thanks again,

                Anirban
            
          
  From:
               Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>;
              

               To: 
              Anirban Ghoshal <chalcogen_eg_oxygen@xxxxxxxxx>;
              <gluster-users@xxxxxxxxxxx>; 

               Subject: 
              Re:  Split-brain seen with [0 0] pending
              matrix and io-cache page errors 

               Sent: 
              Sun, Oct 19, 2014 9:01:58 AM 

            
                    On 10/19/2014 01:36 PM,
                      Anirban Ghoshal wrote:

                    
                              It is possible, yes, because these
                                are actually a kind of log files. I
                                suppose, like other logging frameworks
                                these files an remain open for a
                                considerable period, and then get
                                renamed to support log rotate semantics.
                                

                                That said, I might need to check with
                                the team that actually manages the
                                logging framework to be sure. I only
                                take care of the file-system stuff. I
                                can tell you for sure Monday. 

                                
                                If it is the same race that you mention,
                                is there a fix for it?

                                
                                Thanks,

                                Anirban
                            
                          
                    I am working on the fix.

                    
                    RCA:

                    0) Lets say the file 'abc.log' is opened for writing
                    on replica pair (brick-0, brick-1)

                    1) brick-0 went down

                    2) abc.log is renamed to abc.log.1

                    3) brick-0 comes back up

                    4) re-open on old abc.log happens from mount to
                    brick-0

                    5) self-heal kicks in and deletes old abc.log and
                    creates and syncs abc.log.1

                    6) But the mount is still writing to the deleted
                    'old abc.log' on brick-0 so abc.log.1 file remains
                    at the same size while abc.log.1 file keeps
                    increasing on brick-1. This leads to size mismatch
                    split-brain on abc.log.1.

                    
                    Race happens between steps 4), 5). If 5) happens
                    before 4) no split-brain will be observed.

                    
                    Work-around:

                    
                    0) Take backup of good abc.log.1 file from brick-1.
                    (Just being paranoid)

                    
                    Do any of the following two steps to make sure the
                    stale file that is open is closed

                    1-a) Take the brick process with bad file down using
                    kill -9 <brick-pid> (In my example brick-0).

                    1-b) Introduce a temporary disconnect between mount
                    and brick-0.

                    (I would choose 1-a)

                    2) Remove the bad file(abc.log.1) and its
                    gfid-backend-file from brick-0

                    3) Bring the brick back up (gluster volume start
                    <volname> force)/restore the connection and
                    let it heal by doing 'stat' on the file abc.log.1 on
                    the mount.

                    
                    This bug existed from 2012, from the first time I
                    implemented rename/hard-link self-heal. It is
                    difficult to re-create. I have to put break-points
                    at several places in the process to hit the race.

                    
                    Pranith
                    

                                  Thanks,

                                  Anirban
                              
                            
  From:
                                 Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>;
                                

                                 To:
                                 Anirban Ghoshal <chalcogen_eg_oxygen@xxxxxxxxx>;
                                <gluster-users@xxxxxxxxxxx>;
                                

                                 Subject:
                                 Re:  Split-brain
                                seen with [0 0] pending matrix and
                                io-cache page errors 

                                 Sent:
                                 Sun, Oct 19, 2014 5:42:24 AM 

                              
                                      On
                                        10/18/2014 04:36 PM, Anirban
                                        Ghoshal wrote:

                                      
                                                Hi,

                                                  
                                                  Yes, they do, and
                                                  considerably. I'd
                                                  forgotten to mention
                                                  that on my last email.
                                                  Their mtimes, however,
                                                  as far as i could tell
                                                  on separate servers,
                                                  seemed to coincide. 

                                                  
                                                  Thanks,

                                                  Anirban
                                              
                                            
                                      Are these files always open? And
                                      is it possible that the file could
                                      have been renamed when one of the
                                      bricks was offline? I know of a
                                      race which can introduce this one.
                                      Just trying to find if it is the
                                      same case.

                                      
                                      Pranith
                                      

  From:  Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>;
                                                  

                                                   To:
                                                   Anirban Ghoshal <chalcogen_eg_oxygen@xxxxxxxxx>;
                                                  gluster-users@xxxxxxxxxxx
                                                  <gluster-users@xxxxxxxxxxx>;
                                                  

                                                   Subject:
                                                   Re:
                                                  
                                                  Split-brain seen with
                                                  [0 0] pending matrix
                                                  and io-cache page
                                                  errors 

                                                   Sent:
                                                   Sat, Oct 18, 2014
                                                  12:26:08 AM 

                                                
                                                        hi,

                                                              Could you
                                                        see if the size
                                                        of the file
                                                        mismatches?

                                                        
                                                        Pranith

                                                        
                                                          On

                                                          10/18/2014
                                                          04:20 AM,
                                                          Anirban
                                                          Ghoshal wrote:

                                                          
                                                          Hi
                                                          everyone,
                                                          

                                                          I
                                                          have this
                                                          really
                                                          confusing
                                                          split-brain
                                                          here that's
                                                          bothering me.
                                                          I am running
                                                          glusterfs
                                                          3.4.2 over
                                                          linux 2.6.34.
                                                          I have a
                                                          replica 2
                                                          volume
                                                          'testvol' that
                                                          is It seems I
                                                          cannot
                                                          read/stat/edit
                                                          the file in
                                                          question, and
                                                          `gluster
                                                          volume heal
                                                          testvol info
                                                          split-brain`
                                                          shows nothing.
                                                          Here are the
                                                          logs from the
                                                          fuse-mount for
                                                          the volume:
                                                          

                                                          [2014-09-29

                                                          07:53:02.867111]
                                                          W
                                                          [fuse-bridge.c:1172:fuse_err_cbk]
                                                          0-glusterfs-fuse:

                                                          4560969:
                                                          FLUSH() ERR
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.007799]
                                                          W
                                                          [page.c:991:__ioc_page_error]
                                                          0-testvol-io-cache:
                                                          page error for
                                                          page =
                                                          0x7fd5c8529d20
                                                          & waitq =
0x7fd5c8067d40 

                                                          [2014-09-29

                                                          07:54:16.007854]
                                                          W
                                                          [fuse-bridge.c:2089:fuse_readv_cbk]
                                                          0-glusterfs-fuse:

                                                          4561103: READ
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.008018]
                                                          W
                                                          [page.c:991:__ioc_page_error]
                                                          0-testvol-io-cache:
                                                          page error for
                                                          page =
                                                          0x7fd5c8607ee0
                                                          & waitq =
0x7fd5c8067d40 

                                                          [2014-09-29

                                                          07:54:16.008056]
                                                          W
                                                          [fuse-bridge.c:2089:fuse_readv_cbk]
                                                          0-glusterfs-fuse:

                                                          4561104: READ
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.008233]
                                                          W
                                                          [page.c:991:__ioc_page_error]
                                                          0-testvol-io-cache:
                                                          page error for
                                                          page =
                                                          0x7fd5c8066f30
                                                          & waitq =
0x7fd5c8067d40 

                                                          [2014-09-29

                                                          07:54:16.008269]
                                                          W
                                                          [fuse-bridge.c:2089:fuse_readv_cbk]
                                                          0-glusterfs-fuse:

                                                          4561105: READ
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.008800]
                                                          W
                                                          [page.c:991:__ioc_page_error]
                                                          0-testvol-io-cache:
                                                          page error for
                                                          page =
                                                          0x7fd5c860bcf0
                                                          & waitq =
0x7fd5c863b1f0 

                                                          [2014-09-29

                                                          07:54:16.008839]
                                                          W
                                                          [fuse-bridge.c:2089:fuse_readv_cbk]
                                                          0-glusterfs-fuse:

                                                          4561107: READ
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.009365]
                                                          W
                                                          [page.c:991:__ioc_page_error]
                                                          0-testvol-io-cache:
                                                          page error for
                                                          page =
                                                          0x7fd5c85fd120
                                                          & waitq =
0x7fd5c8067d40 

                                                          [2014-09-29

                                                          07:54:16.009413]
                                                          W
                                                          [fuse-bridge.c:2089:fuse_readv_cbk]
                                                          0-glusterfs-fuse:

                                                          4561109: READ
                                                          => -1
                                                          (Input/output
                                                          error) 

                                                          [2014-09-29

                                                          07:54:16.040549]
                                                          W
                                                          [afr-open.c:213:afr_open]
                                                          0-testvol-replicate-0:

                                                          failed to open
                                                          as split brain
                                                          seen,
                                                          returning EIO 

                                                          [2014-09-29

                                                          07:54:16.040594]
                                                          W
                                                          [fuse-bridge.c:915:fuse_fd_cbk]
                                                          0-glusterfs-fuse:

                                                          4561142:
                                                          OPEN()
                                                          /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log
                                                          => -1
                                                          (Input/output
                                                          error)

                                                          
                                                          Could

                                                          somebody
                                                          please give me
                                                          some clue on
                                                          where to
                                                          begin? I
                                                          checked the
                                                          xattrs on /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log


                                                          and it seems
                                                          the changelogs
                                                          are [0, 0] on
                                                          both replicas,
                                                          and the gfid's
                                                          match.
                                                          

                                                          Thank
                                                          you very much
                                                          for any help
                                                          on this.
                                                          Anirban
                                                          

                                                          _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
                                                          
                                                        
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users