Re: I/O error on replicated volume

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Fri, 27 Mar 2015 13:54:39 +0530



    On 03/27/2015 11:04 AM, Jonathan Heese
      wrote:

    
      On Mar 27, 2015, at 1:29 AM, "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>
        wrote:

        
          When we change the transport from x to y, it should reflect in
          all the vol files. But unfortunately, the volume set command
          failed to change in nfs server, (of course it is a bug).  I
          had clearly mentioned in my previous mails, that changing the
          volume files using the volume set command is not recommended,
          i suggested this, just to check whether tcp work fine or not.

          
          The reason why you are getting rdma connection error is
          because , now bricks are running through tcp, so the brick
          process will be listening on socket port. But nfs-server asked
          for an rdma connection, so they are trying to connect from
          rdma port to tcp port. Obviously the connection will be
          rejected.

        
      Okay, thanks for the thorough explanation there.
      

      Now that we know that TCP does function without the original
        I/O errors (from the start of this thread), how do you suggest
        that I proceed?
      

      Do I have to wait for a subsequent release to rid myself of
        this bug?
    
    
    I will make sure to fix the bug. Also we can expect that rdma
    patches will be merged soon.  After rdma bug's are fixed in 3.5.x ,
    you can test and switch to rdma.

    
      Would it be feasible for me to switch from RDMA to TCP in a
        more permanent fashion (maybe wipe the cluster and start over?)?
    
    
    Either you can manually edit nfs-volfile, to change transport to
    "option transport-type tcp", in all the places, then restarting nfs
    will solve the problem. Or else and if possible you can start a
    fresh cluster running on tcp. 

    
    Rafi

    
      Thanks.
      

      Regards,
      Jon Heese
      

        Regards

          Rafi KC

           
          On 03/27/2015 12:28 AM, Jonathan
            Heese wrote:

          
              Rafi,
              

              Here is my nfs-server.vol file:
              

              [root@duke ~]# cat /var/lib/glusterd/nfs/nfs-server.vol

                volume gluster_disk-client-0

                    type protocol/client

                    option send-gids true

                    option password 562ab460-7754-4b5a-82e6-18ed6c130786

                    option username ad5d5754-cf02-4b96-9f85-ff3129ae0405

                    option transport-type rdma

                    option remote-subvolume /bricks/brick1

                    option remote-host duke-ib

                end-volume

                
                volume gluster_disk-client-1

                    type protocol/client

                    option send-gids true

                    option password 562ab460-7754-4b5a-82e6-18ed6c130786

                    option username ad5d5754-cf02-4b96-9f85-ff3129ae0405

                    option transport-type rdma

                    option remote-subvolume /bricks/brick1

                    option remote-host duchess-ib

                end-volume

                
                volume gluster_disk-replicate-0

                    type cluster/replicate

                    subvolumes gluster_disk-client-0
                gluster_disk-client-1

                end-volume

                
                volume gluster_disk-dht

                    type cluster/distribute

                    subvolumes gluster_disk-replicate-0

                end-volume

                
                volume gluster_disk-write-behind

                    type performance/write-behind

                    subvolumes gluster_disk-dht

                end-volume

                
                volume gluster_disk

                    type debug/io-stats

                    option count-fop-hits off

                    option latency-measurement off

                    subvolumes gluster_disk-write-behind

                end-volume

                
                volume nfs-server

                    type nfs/server

                    option nfs3.gluster_disk.volume-id
                2307a5a8-641e-44f4-8eaf-7cc2b704aafd

                    option rpc-auth.addr.gluster_disk.allow *

                    option nfs.drc off

                    option nfs.nlm on

                    option nfs.dynamic-volumes on

                    subvolumes gluster_disk

                end-volume

                
              I see that "transport-type rdma" is listed a couple
                times here, but "gluster volume info" indicates that the
                volume is using the tcp transport:
              

              [root@duke ~]# gluster volume info gluster_disk

                
                Volume Name: gluster_disk

                Type: Replicate

                Volume ID: 2307a5a8-641e-44f4-8eaf-7cc2b704aafd

                Status: Started

                Number of Bricks: 1 x 2 = 2

                Transport-type: tcp

                Bricks:

                Brick1: duke-ib:/bricks/brick1

                Brick2: duchess-ib:/bricks/brick1

                Options Reconfigured:

                config.transport: tcp

                
              Please let me know if you need any further information
                from me to determine how to correct this discrepancy.
              

              Also, I feel compelled to ask: Since the TCP
                connections are going over the InfiniBand connections
                between the two Gluster servers (based on the hostnames
                which are pointed to the IB IPs via hosts files), are
                there any (significant) drawbacks to using TCP instead
                of RDMA here?  Thanks.
              

              Regards,
              Jon Heese

              
                From: Mohammed
                    Rafi K C
                    <rkavunga@xxxxxxxxxx>

                    Sent: Monday, March 23, 2015 3:29 AM

                    To: Jonathan Heese

                    Cc: gluster-users

                    Subject: Re:  I/O error on
                    replicated volume
                   
                
                  On 03/23/2015 11:28 AM,
                    Jonathan Heese wrote:

                  
                    On Mar 23, 2015, at 1:20 AM, "Mohammed Rafi K
                      C" <rkavunga@xxxxxxxxxx>
                      wrote:

                      
                        On 03/21/2015 07:49
                          PM, Jonathan Heese wrote:

                        
                            Mohamed,
                            

                            I have completed the steps you suggested
                              (unmount all, stop the volume, set the
                              config.transport to tcp, start the volume,
                              mount, etc.), and the behavior has indeed
                              changed.
                            

                            [root@duke ~]# gluster volume info

                              
                              Volume Name: gluster_disk

                              Type: Replicate

                              Volume ID:
                              2307a5a8-641e-44f4-8eaf-7cc2b704aafd

                              Status: Started

                              Number of Bricks: 1 x 2 = 2

                              Transport-type: tcp

                              Bricks:

                              Brick1: duke-ib:/bricks/brick1

                              Brick2: duchess-ib:/bricks/brick1

                              Options Reconfigured:

                              config.transport: tcp
                            

                              [root@duke ~]# gluster volume status

                              Status of volume: gluster_disk

                              Gluster
                              process                                        
                              Port    Online  Pid

------------------------------------------------------------------------------

                              Brick
                              duke-ib:/bricks/brick1                           
                              49152   Y       16362

                              Brick
                              duchess-ib:/bricks/brick1                        
                              49152   Y       14155

                              NFS Server on
                              localhost                                
                              2049    Y       16374

                              Self-heal Daemon on
                              localhost                          
                              N/A     Y       16381

                              NFS Server on
                              duchess-ib                               
                              2049    Y       14167

                              Self-heal Daemon on
                              duchess-ib                         
                              N/A     Y       14174

                              
                              Task Status of Volume gluster_disk

------------------------------------------------------------------------------

                              There are no active volume tasks

                              
                            I am no longer seeing the I/O errors
                              during prolonged periods of write I/O that
                              I was seeing when the transport was set to
                              rdma. However, I am seeing this message on
                              both nodes every 3 seconds (almost
                              exactly):
                            

                            ==> /var/log/glusterfs/nfs.log <==

                              [2015-03-21 14:17:40.379719] W
                              [rdma.c:1076:gf_rdma_cm_event_handler]
                              0-gluster_disk-client-1: cma event
                              RDMA_CM_EVENT_REJECTED, error 8
                              (me:10.10.10.1:1023 peer:10.10.10.2:49152)

                            
                            Is this something to worry about? 
                          
                        
                        If you are not using nfs to export the volumes,
                        there is nothing to worry. 

                      
                    I'm using the native glusterfs FUSE component to
                    mount the volume locally on both servers -- I assume
                    that you're referring to the standard NFS protocol
                    stuff, which I'm not using here.
                    

                    Incidentally, I would like to keep my logs from
                      filling up with junk if possible.  Is there
                      something I can do to get rid of these (useless?)
                      error messages?

                    
                  If i understand correctly, you are getting this
                  enormous log message from nfs log only, all other logs
                  and everything are fine now, right ? If that is the
                  case, and you are not at all using nfs for exporting
                  the volume, as  a workaround you can disable nfs for
                  your volume or cluster. (gluster v set nfs.disable
                  on). This will turnoff your gluster nfs server, and
                  you will no longer get those log messages.

                  
                                Any idea why there are rdma pieces in
                                  play when I've set my transport to
                                  tcp?
                              
                            
                            there should not be any piece of rdma,if
                            possible, can you paste the volfile for nfs
                            server. You can find the volfile in
                            /var/lib/glusterd/nfs/nfs-server.vol or
                            /usr/local/var/lib/glusterd/nfs/nfs-server.vol

                          
                        I will get this for you when I can.
                           Thanks.
                      
                    
                  If you can make it, that will be great help to
                  understand the problem.

                  
                  Rafi KC

                  
                        Regards,
                        Jon Heese
                        

                          Rafi KC

                            
                                The actual I/O appears to be handled
                                  properly and I've seen no further
                                  errors in the testing I've done so
                                  far.
                                

                                Thanks.

                                
                                Regards,
                                Jon Heese
                                

                                  From:
                                      
gluster-users-bounces@xxxxxxxxxxx 
<gluster-users-bounces@xxxxxxxxxxx> on behalf of Jonathan
                                      Heese 
                                        <jheese@xxxxxxxxx>

                                      Sent: Friday, March 20,
                                      2015 7:04 AM

                                      To: Mohammed Rafi K C

                                      Cc: gluster-users

                                      Subject: Re:
                                       I/O error on
                                      replicated volume
                                     
                                  
                                    Mohammed,
                                    

                                    Thanks very much for the reply.
                                       I will try that and report back.

                                      
                                      Regards,
                                      Jon Heese
                                    
                                    
                                      On Mar 20, 2015, at 3:26 AM,
                                      "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>
                                      wrote:

                                      
                                        On
                                          03/19/2015 10:16 PM, Jonathan
                                          Heese wrote:

                                        
                                            Hello
                                                  all,
                                             
                                            Does
                                                anyone else have any
                                                further suggestions for
                                                troubleshooting this?
                                             
                                            To
                                                sum up: I have a 2 node
                                                2 brick replicated
                                                volume, which holds a
                                                handful of iSCSI image
                                                files which are mounted
                                                and served up by tgtd
                                                (CentOS 6) to a handful
                                                of devices on a
                                                dedicated iSCSI
                                                network.  The most
                                                important iSCSI clients
                                                (initiators) are four
                                                VMware ESXi 5.5 hosts
                                                that use the iSCSI
                                                volumes as backing for
                                                their datastores for
                                                virtual machine storage.
                                             
                                            After
                                                a few minutes of
                                                sustained writing to the
                                                volume, I am seeing a
                                                massive flood (over 1500
                                                per second at times) of
                                                this error in
                                                /var/log/glusterfs/mnt-gluster-disk.log:
                                            [2015-03-16
                                                02:24:07.582801] W
                                                [fuse-bridge.c:2242:fuse_writev_cbk]
                                                0-glusterfs-fuse:
                                                635358: WRITE => -1
                                                (Input/output error)
                                             
                                            When
                                                this happens, the ESXi
                                                box fails its write
                                                operation and returns an
                                                error to the effect of
                                                “Unable to write data to
                                                datastore”.  I don’t see
                                                anything else in the
                                                supporting logs to
                                                explain the root cause
                                                of the i/o errors.
                                             
                                            Any
                                                and all suggestions are
                                                appreciated.  Thanks.
                                             
                                          
                                        From the mount logs, i assume
                                        that your volume transport type
                                        is rdma. There are some known
                                        issues for rdma in 3.5.3, and
                                        the patch for to address those
                                        issues are already send to
                                        upstream [1]. From the logs, I'm
                                        not sure and it is hard to tell
                                        you whether this problem is
                                        something related to rdma
                                        transport or not. To make sure
                                        that the tcp transport is works
                                        well in this scenario, if
                                        possible can you try to
                                        reproduce the same using tcp
                                        type volumes. You can change the
                                        transport type of volume by
                                        doing the following step ( not
                                        recommended in normal use case).

                                        
                                        1) unmount every client

                                        2) stop the volume

                                        3) run gluster volume set
                                        volname config.transport tcp

                                        4) start the volume again

                                        5) mount the clients

                                        
                                        [1] : 
                                          http://goo.gl/2PTL61

                                        
                                        Regards

                                        Rafi KC

                                        
                                              Jon
                                                    Heese

                                                Systems
                                                    Engineer

                                                INetU
                                                    Managed Hosting

                                                P:
                                                  610.266.7441 x 261

                                                F:
                                                  610.266.7434

                                                www.inetu.net
                                              **
                                                    This message
                                                    contains
                                                    confidential
                                                    information, which
                                                    also may be
                                                    privileged, and is
                                                    intended only for
                                                    the person(s)
                                                    addressed above. Any
                                                    unauthorized use,
                                                    distribution,
                                                    copying or
                                                    disclosure of
                                                    confidential and/or
                                                    privileged
                                                    information is
                                                    strictly prohibited.
                                                    If you have received
                                                    this communication
                                                    in error, please
                                                    erase all copies of
                                                    the message and its
                                                    attachments and
                                                    notify the sender
                                                    immediately via
                                                    reply e-mail. **
                                            
                                             
                                                From:
                                                    Jonathan Heese
                                                    

                                                    Sent:
                                                    Tuesday, March 17,
                                                    2015 12:36 PM

                                                    To:
                                                    'Ravishankar N'; 
gluster-users@xxxxxxxxxxx

                                                    Subject: RE:
                                                     I/O
                                                    error on replicated
                                                    volume
                                              
                                            
                                            Ravi,
                                             
                                            The
                                                last lines in the mount
                                                log before the massive
                                                vomit of I/O errors are
                                                from 22 minutes prior,
                                                and seem innocuous to
                                                me:
                                             
                                            [2015-03-16
                                                01:37:07.126340] E
                                                [client-handshake.c:1760:client_query_portmap_cbk]
                                                0-gluster_disk-client-0:
                                                failed to get the port
                                                number for remote
                                                subvolume. Please run
                                                'gluster volume status'
                                                on server to see if
                                                brick process is
                                                running.
                                            [2015-03-16
                                                01:37:07.126587] W
                                                [rdma.c:4273:gf_rdma_disconnect]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
                                                [0x7fd9c557bccf]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
                                                [0x7fd9c557a995]
                                                (-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)

                                                [0x7fd9c0d8fb9a])))
                                                0-gluster_disk-client-0:
                                                disconnect called
                                                (peer:10.10.10.1:24008)
                                            [2015-03-16
                                                01:37:07.126687] E
                                                [client-handshake.c:1760:client_query_portmap_cbk]
                                                0-gluster_disk-client-1:
                                                failed to get the port
                                                number for remote
                                                subvolume. Please run
                                                'gluster volume status'
                                                on server to see if
                                                brick process is
                                                running.
                                            [2015-03-16
                                                01:37:07.126737] W
                                                [rdma.c:4273:gf_rdma_disconnect]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
                                                [0x7fd9c557bccf]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
                                                [0x7fd9c557a995]
                                                (-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)

                                                [0x7fd9c0d8fb9a])))
                                                0-gluster_disk-client-1:
                                                disconnect called
                                                (peer:10.10.10.2:24008)
                                            [2015-03-16
                                                01:37:10.730165] I
                                                [rpc-clnt.c:1729:rpc_clnt_reconfig]
                                                0-gluster_disk-client-0:
                                                changing port to 49152
                                                (from 0)
                                            [2015-03-16
                                                01:37:10.730276] W
                                                [rdma.c:4273:gf_rdma_disconnect]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
                                                [0x7fd9c557bccf]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
                                                [0x7fd9c557a995]
                                                (-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)

                                                [0x7fd9c0d8fb9a])))
                                                0-gluster_disk-client-0:
                                                disconnect called
                                                (peer:10.10.10.1:24008)
                                            [2015-03-16
                                                01:37:10.739500] I
                                                [rpc-clnt.c:1729:rpc_clnt_reconfig]
                                                0-gluster_disk-client-1:
                                                changing port to 49152
                                                (from 0)
                                            [2015-03-16
                                                01:37:10.739560] W
                                                [rdma.c:4273:gf_rdma_disconnect]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f)
                                                [0x7fd9c557bccf]
                                                (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
                                                [0x7fd9c557a995]
                                                (-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)

                                                [0x7fd9c0d8fb9a])))
                                                0-gluster_disk-client-1:
                                                disconnect called
                                                (peer:10.10.10.2:24008)
                                            [2015-03-16
                                                01:37:10.741883] I
                                                [client-handshake.c:1677:select_server_supported_programs]
                                                0-gluster_disk-client-0:
                                                Using Program GlusterFS
                                                3.3, Num (1298437),
                                                Version (330)
                                            [2015-03-16
                                                01:37:10.744524] I
                                                [client-handshake.c:1462:client_setvolume_cbk]
                                                0-gluster_disk-client-0:
                                                Connected to
                                                10.10.10.1:49152,
                                                attached to remote
                                                volume '/bricks/brick1'.
                                            [2015-03-16
                                                01:37:10.744537] I
                                                [client-handshake.c:1474:client_setvolume_cbk]
                                                0-gluster_disk-client-0:
                                                Server and Client
                                                lk-version numbers are
                                                not same, reopening the
                                                fds
                                            [2015-03-16
                                                01:37:10.744566] I
                                                [afr-common.c:4267:afr_notify]
                                                0-gluster_disk-replicate-0:
                                                Subvolume
                                                'gluster_disk-client-0'
                                                came back up; going
                                                online.
                                            [2015-03-16
                                                01:37:10.744627] I
                                                [client-handshake.c:450:client_set_lk_version_cbk]
                                                0-gluster_disk-client-0:
                                                Server lk version = 1
                                            [2015-03-16
                                                01:37:10.753037] I
                                                [client-handshake.c:1677:select_server_supported_programs]
                                                0-gluster_disk-client-1:
                                                Using Program GlusterFS
                                                3.3, Num (1298437),
                                                Version (330)
                                            [2015-03-16
                                                01:37:10.755657] I
                                                [client-handshake.c:1462:client_setvolume_cbk]
                                                0-gluster_disk-client-1:
                                                Connected to
                                                10.10.10.2:49152,
                                                attached to remote
                                                volume '/bricks/brick1'.
                                            [2015-03-16
                                                01:37:10.755676] I
                                                [client-handshake.c:1474:client_setvolume_cbk]
                                                0-gluster_disk-client-1:
                                                Server and Client
                                                lk-version numbers are
                                                not same, reopening the
                                                fds
                                            [2015-03-16
                                                01:37:10.761945] I
                                                [fuse-bridge.c:5016:fuse_graph_setup]
                                                0-fuse: switched to
                                                graph 0
                                            [2015-03-16
                                                01:37:10.762144] I
                                                [client-handshake.c:450:client_set_lk_version_cbk]
                                                0-gluster_disk-client-1:
                                                Server lk version = 1
                                            [2015-03-16
                                                  01:37:10.762279] I
                                                [fuse-bridge.c:3953:fuse_init]
                                                0-glusterfs-fuse: FUSE
                                                inited with protocol
                                                versions: glusterfs 7.22
                                                kernel 7.14
                                            [2015-03-16
                                                  01:59:26.098670] W
                                                [fuse-bridge.c:2242:fuse_writev_cbk]
                                                0-glusterfs-fuse:
                                                292084: WRITE => -1
                                                (Input/output error)
                                            …
                                             
                                            I’ve
                                                seen no indication of
                                                split-brain on any files
                                                at any point in this
                                                (ever since downdating
                                                from 3.6.2 to 3.5.3,
                                                which is when this
                                                particular issue
                                                started):
                                            [root@duke
                                                gfapi-module-for-linux-target-driver-]#
                                                gluster v heal
                                                gluster_disk info
                                            Brick
duke.jonheese.local:/bricks/brick1/
                                            Number
                                                of entries: 0
                                             
                                            Brick
duchess.jonheese.local:/bricks/brick1/
                                            Number
                                                of entries: 0
                                             
                                            Thanks.
                                             
                                            
                                              Jon
                                                    Heese

                                                Systems
                                                    Engineer

                                                INetU
                                                    Managed Hosting

                                                P:
                                                  610.266.7441 x 261

                                                F:
                                                  610.266.7434

                                                www.inetu.net
                                              **
                                                    This message
                                                    contains
                                                    confidential
                                                    information, which
                                                    also may be
                                                    privileged, and is
                                                    intended only for
                                                    the person(s)
                                                    addressed above. Any
                                                    unauthorized use,
                                                    distribution,
                                                    copying or
                                                    disclosure of
                                                    confidential and/or
                                                    privileged
                                                    information is
                                                    strictly prohibited.
                                                    If you have received
                                                    this communication
                                                    in error, please
                                                    erase all copies of
                                                    the message and its
                                                    attachments and
                                                    notify the sender
                                                    immediately via
                                                    reply e-mail. **
                                            
                                             
                                                From:
                                                    Ravishankar N [mailto:ravishankar@xxxxxxxxxx]
                                                    

                                                    Sent:
                                                    Tuesday, March 17,
                                                    2015 12:35 AM

                                                    To: Jonathan
                                                    Heese; gluster-users@xxxxxxxxxxx

                                                    Subject: Re:
                                                     I/O
                                                    error on replicated
                                                    volume
                                              
                                            
                                              On
                                                03/17/2015 02:14 AM,
                                                Jonathan Heese wrote:
                                            
                                            
                                                  Hello,

                                                      
                                                      So I resolved my
                                                      previous issue
                                                      with split-brains
                                                      and the lack of
                                                      self-healing by
                                                      dropping my
                                                      installed
                                                      glusterfs*
                                                      packages from
                                                      3.6.2 to 3.5.3,
                                                      but now I've
                                                      picked up a new
                                                      issue, which
                                                      actually makes
                                                      normal use of the
                                                      volume practically
                                                      impossible.

                                                      
                                                      A little
                                                      background for
                                                      those not already
                                                      paying close
                                                      attention:

                                                      I have a 2 node 2
                                                      brick replicating
                                                      volume whose
                                                      purpose in life is
                                                      to hold iSCSI
                                                      target files,
                                                      primarily for use
                                                      to provide
                                                      datastores to a
                                                      VMware ESXi
                                                      cluster.  The plan
                                                      is to put a
                                                      handful of image
                                                      files on the
                                                      Gluster volume,
                                                      mount them locally
                                                      on both Gluster
                                                      nodes, and run
                                                      tgtd on both,
                                                      pointed to the
                                                      image files on the
                                                      mounted gluster
                                                      volume. Then the
                                                      ESXi boxes will
                                                      use multipath
                                                      (active/passive)
                                                      iSCSI to connect
                                                      to the nodes, with
                                                      automatic failover
                                                      in case of planned
                                                      or unplanned
                                                      downtime of the
                                                      Gluster nodes.

                                                      
                                                      In my most recent
                                                      round of testing
                                                      with 3.5.3, I'm
                                                      seeing a massive
                                                      failure to write
                                                      data to the volume
                                                      after about 5-10
                                                      minutes, so I've
                                                      simplified the
                                                      scenario a bit (to
                                                      minimize the
                                                      variables) to:
                                                      both Gluster nodes
                                                      up, only one node
                                                      (duke) mounted and
                                                      running tgtd, and
                                                      just regular
                                                      (single path)
                                                      iSCSI from a
                                                      single ESXi
                                                      server.

                                                      
                                                      About 5-10 minutes
                                                      into migration a
                                                      VM onto the test
                                                      datastore,
                                                      /var/log/messages
                                                      on duke gets
                                                      blasted with a ton
                                                      of messages
                                                      exactly like this:
                                                  Mar
                                                    15 22:24:06 duke
                                                    tgtd:
                                                    bs_rdwr_request(180)
                                                    io error 0x1781e00
                                                    2a -1 512 22971904,
                                                    Input/output error
                                                   
                                                  And
                                                    /var/log/glusterfs/mnt-gluster_disk.log
                                                    gets blased with a
                                                    ton of messages
                                                    exactly like this:
                                                  [2015-03-16
                                                    02:24:07.572279] W
                                                    [fuse-bridge.c:2242:fuse_writev_cbk]
                                                    0-glusterfs-fuse:
                                                    635299: WRITE =>
                                                    -1 (Input/output
                                                    error)
                                                   
                                                
                                                Are there any messages
                                                in the mount log from
                                                AFR about split-brain
                                                just before the above
                                                line appears?

                                                Does `gluster v heal
                                                <VOLNAME> info`
                                                show any files?
                                                Performing I/O on files
                                                that are in split-brain
                                                fail with EIO.

                                                
                                                -Ravi

                                                
                                                  And
                                                    the write operation
                                                    from VMware's side
                                                    fails as soon as
                                                    these messages
                                                    start.
                                                   
                                                  I
                                                    don't see any other
                                                    errors (in the log
                                                    files I know of)
                                                    indicating the root
                                                    cause of these i/o
                                                    errors.  I'm sure
                                                    that this is not
                                                    enough information
                                                    to tell what's going
                                                    on, but can anyone
                                                    help me figure out
                                                    what to look at next
                                                    to figure this out?
                                                   
                                                  I've
                                                    also considered
                                                    using Dan
                                                    Lambright's libgfapi
                                                    gluster module for
                                                    tgtd (or something
                                                    similar) to avoid
                                                    going through FUSE,
                                                    but I'm not sure
                                                    whether that would
                                                    be irrelevant to
                                                    this problem, since
                                                    I'm not 100% sure if
                                                    it lies in FUSE or
                                                    elsewhere.
                                                   
                                                  Thanks!
                                                   
                                                  Jon
                                                        Heese

                                                    Systems
                                                        Engineer

                                                    INetU
                                                        Managed Hosting

                                                    P:
                                                      610.266.7441 x 261

                                                    F:
                                                      610.266.7434

                                                    www.inetu.net
                                                  **
                                                        This message
                                                        contains
                                                        confidential
                                                        information,
                                                        which also may
                                                        be privileged,
                                                        and is intended
                                                        only for the
                                                        person(s)
                                                        addressed above.
                                                        Any unauthorized
                                                        use,
                                                        distribution,
                                                        copying or
                                                        disclosure of
                                                        confidential
                                                        and/or
                                                        privileged
                                                        information is
                                                        strictly
                                                        prohibited. If
                                                        you have
                                                        received this
                                                        communication in
                                                        error, please
                                                        erase all copies
                                                        of the message
                                                        and its
                                                        attachments and
                                                        notify the
                                                        sender
                                                        immediately via
                                                        reply e-mail. **
                                                   
                                                
                                              _______________________________________________
                                              Gluster-users mailing list
                                              Gluster-users@xxxxxxxxxxx
                                              http://www.gluster.org/mailman/listinfo/gluster-users
                                            
                                             
                                          _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
                                        
                                        
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users