Re: libgfapi failover problem on replica bricks

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 01 Sep 2014 13:57:27 +0530



    On 09/01/2014 12:56 PM, Roman wrote:

    
      Hmm, I don't know how, but both VM-s survived the
        second server outage :) Still had no any message about healing
        completion anywhere :)
    
    Healing can be performed by:

    1) Mount process (/path/to/mount/log/)

    2) Self-heal daemons on either of the bricks
    (/var/log/glusterfs/glustershd.log)

    
    Check if there are any messages on either of these logs.

    
    Pranith

    
        2014-09-01 10:13 GMT+03:00 Roman <romeo.r@xxxxxxxxx>:

          
            The mount is on the proxmox machine. 
              

              here are the logs from disconnection till connection:
              

                [2014-09-01 06:19:38.059383] W
                  [socket.c:522:__socket_rwv] 0-glusterfs: readv on 10.250.0.1:24007
                  failed (Connection timed out)
                [2014-09-01 06:19:40.338393] W
                  [socket.c:522:__socket_rwv]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: readv on 10.250.0.1:49159
                  failed (Connection timed out)
                
                  [2014-09-01 06:19:40.338447] I
                  [client.c:2229:client_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: disconnected
                  from 10.250.0.1:49159.
                  Client process will keep trying to connect to glusterd
                  until brick's port is available
                [2014-09-01 06:19:49.196768] E
                  [socket.c:2161:socket_connect_finish] 0-glusterfs:
                  connection to 10.250.0.1:24007
                  failed (No route to host)
                [2014-09-01 06:20:05.565444] E
                  [socket.c:2161:socket_connect_finish]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: connection to 10.250.0.1:24007
                  failed (No route to host)
                [2014-09-01 06:23:26.607180] I
                  [rpc-clnt.c:1729:rpc_clnt_reconfig]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: changing port to
                  49159 (from 0)
                [2014-09-01 06:23:26.608032] I
                  [client-handshake.c:1677:select_server_supported_programs]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: Using Program
                  GlusterFS 3.3, Num (1298437), Version (330)
                [2014-09-01 06:23:26.608395] I
                  [client-handshake.c:1462:client_setvolume_cbk]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: Connected to 10.250.0.1:49159,
                  attached to remote volume
                  '/exports/HA-2TB-TT-Proxmox-cluster/2TB'.
                [2014-09-01 06:23:26.608420] I
                  [client-handshake.c:1474:client_setvolume_cbk]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: Server and
                  Client lk-version numbers are not same, reopening the
                  fds
                [2014-09-01 06:23:26.608606] I
                  [client-handshake.c:450:client_set_lk_version_cbk]
                  0-HA-2TB-TT-Proxmox-cluster-client-0: Server lk
                  version = 1
                [2014-09-01 06:23:40.604979] I
                  [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs:
                  No change in volfile, continuing
              
              
              Now there is no healing traffic also. I could try to
                disconnect now second server to see if it is going to
                failover. I don't really believe it will :(
              

              here are some logs for stor1 server (the one I've
                disconnected):
              
                root@stor1:~# cat
                  /var/log/glusterfs/bricks/exports-HA-2TB-TT-Proxmox-cluster-2TB.log
                [2014-09-01 06:19:26.403323] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:26.403399] I
                  [server-helpers.c:289:do_fd_cleanup]
                  0-HA-2TB-TT-Proxmox-cluster-server: fd cleanup on
                  /images/112/vm-112-disk-1.raw
                [2014-09-01 06:19:26.403486] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:29.475318] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:19:29.475373] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:36.963318] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:36.963373] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:40.419298] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:40.419355] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:42.531310] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:19:42.531368] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:23:25.088518] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  sisemon-141844-2014/08/28-19:27:19:824141-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:25.532734] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor2-22775-2014/08/28-19:26:34:786262-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:26.608074] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  pve1-289547-2014/08/28-19:27:22:605477-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:27.187556] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  pve1-298005-2014/08/28-19:41:19:7269-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:27.213890] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor2-22777-2014/08/28-19:26:34:791148-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:31.222654] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-1
                  (version: 3.5.2)
                [2014-09-01 06:23:52.591365] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:23:52.591447] W
                  [inodelk.c:392:pl_inodelk_log_cleanup]
                  0-HA-2TB-TT-Proxmox-cluster-server: releasing lock on
                  14f70955-5e1e-4499-b66b-52cd50892315 held by
                  {client=0x7f2494001ed0, pid=0
                  lk-owner=bc3ddbdbae7f0000}
                [2014-09-01 06:23:52.591568] I
                  [server-helpers.c:289:do_fd_cleanup]
                  0-HA-2TB-TT-Proxmox-cluster-server: fd cleanup on
                  /images/124/vm-124-disk-1.qcow2
                [2014-09-01 06:23:52.591679] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
pve1-494566-2014/08/29-01:00:13:257498-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:23:58.709444] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                  (version: 3.5.2)
                [2014-09-01 06:24:00.741542] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:24:00.741598] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor1-3975-2014/09/01-06:23:58:673930-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:30:06.010819] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                  (version: 3.5.2)
                [2014-09-01 06:30:08.056059] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:30:08.056127] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor1-4030-2014/09/01-06:30:05:976735-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:36:54.307743] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                  (version: 3.5.2)
                [2014-09-01 06:36:56.340078] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:36:56.340122] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor1-4077-2014/09/01-06:36:54:289911-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                [2014-09-01 06:46:53.601517] I
                  [server-handshake.c:575:server_setvolume]
                  0-HA-2TB-TT-Proxmox-cluster-server: accepted client
                  from
                  stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                  (version: 3.5.2)
                [2014-09-01 06:46:55.624705] I
                  [server.c:520:server_rpc_notify]
                  0-HA-2TB-TT-Proxmox-cluster-server: disconnecting
                  connectionfrom
stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
                
                  [2014-09-01 06:46:55.624793] I
                  [client_t.c:417:gf_client_unref]
                  0-HA-2TB-TT-Proxmox-cluster-server: Shutting down
                  connection
stor2-6891-2014/09/01-06:46:53:583529-HA-2TB-TT-Proxmox-cluster-client-0-0-0
              
              
              last 2 lines are pretty unclear. Why it has
                disconnected?
              

              2014-09-01 9:41 GMT+03:00 Pranith
                Kumar Karampuri <pkarampu@xxxxxxxxxx>:
                
                  
                          On 09/01/2014 12:08 PM, Roman wrote:

                          
                            Well, as for me, VM-s are not
                              very impacted by healing process. At least
                              the munin server running with pretty high
                              load (average rarely goes below 0,9 :)
                              )had no problems. To create some more load
                              I've made a copy of 590 MB file on the
                              VM-s disk, It took 22 seconds. Which is ca
                              27 MB /sec or 214 Mbps/sec
                               

                              Servers are connected via 10 gbit
                                network. Proxmox client is connected to
                                the server with separate 1 gbps
                                interface. We are thinking of moving it
                                to 10gbps also.

                                
                                Here are some heal info which is
                                  pretty confusing.
                              
                              
                              right after 1st server restored it
                                connection, it was pretty clear:
                              

                                root@stor1:~# gluster volume heal
                                  HA-2TB-TT-Proxmox-cluster info
                                Brick
                                  stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
                                /images/124/vm-124-disk-1.qcow2 -
                                  Possibly undergoing heal
                                Number of entries: 1
                                

                                Brick
                                  stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
                                /images/124/vm-124-disk-1.qcow2 -
                                  Possibly undergoing heal
                                /images/112/vm-112-disk-1.raw -
                                  Possibly undergoing heal
                                Number of entries: 2
                              
                              
                              some time later is says 
                              
                                root@stor1:~# gluster volume heal
                                  HA-2TB-TT-Proxmox-cluster info
                                Brick
                                  stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
                                Number of entries: 0
                                

                                Brick
                                  stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB/
                                Number of entries: 0
                              
                              
                              while I can still see traffic between
                                servers and still there was no messages
                                about healing process completion.
                            
                          
                        On which machine do we have the mount?

                            
                            Pranith
                        
                          
                                2014-08-29
                                  10:02 GMT+03:00 Pranith Kumar
                                  Karampuri <pkarampu@xxxxxxxxxx>:

                                  
                                     Wow, this is
                                      great news! Thanks a lot for
                                      sharing the results :-). Did you
                                      get a chance to test the
                                      performance of the applications in
                                      the vm during self-heal?

                                      May I know more about your use
                                      case? i.e. How many vms and what
                                      is the avg size of each vm etc?

                                          
                                          Pranith
                                      
                                        
                                          On 08/28/2014 11:27 PM,
                                            Roman wrote:

                                          
                                            Here are the
                                              results.
                                              1. still have problem
                                                with logs rotation. logs
                                                are being written to
                                                .log.1 file, not .log
                                                file. any hints, how to
                                                fix?
                                              2. healing logs are
                                                now much more better, I
                                                can see the successful
                                                message.
                                              3. both volumes with
                                                HD off and on
                                                successfully synced. the
                                                volume with HD on synced
                                                much more faster.
                                              4. both VMs on
                                                volumes survived the
                                                outage, when new files
                                                were added  and deleted
                                                during outage.
                                              

                                              So replication works
                                                well with both HD on and
                                                off for volumes for
                                                VM-s. With HD even
                                                faster. Need to solve
                                                the logging issue.
                                              

                                              Seems we could start
                                                production storage from
                                                this moment :) The whole
                                                company will use it.
                                                Some distributed and
                                                some replicated. Thanks
                                                for great product.
                                            
                                            
                                              2014-08-27
                                                16:03 GMT+03:00 Roman <romeo.r@xxxxxxxxx>:

                                                
                                                  Installed
                                                    new packages. Will
                                                    make some tests
                                                    tomorrow. thanx.
                                                  

                                                    2014-08-27
                                                      14:10 GMT+03:00
                                                      Pranith Kumar
                                                      Karampuri <pkarampu@xxxxxxxxxx>:
                                                      
                                                        
                                                          On 08/27/2014
                                                          04:38 PM,
                                                          Kaleb KEITHLEY
                                                          wrote:

                                                          
                                                          On 08/27/2014
                                                          03:09 AM,
                                                          Humble
                                                          Chirammal
                                                          wrote:

                                                          
                                                          ----- Original
                                                          Message -----

                                                          | From:
                                                          "Pranith Kumar
                                                          Karampuri"
                                                          <pkarampu@xxxxxxxxxx>

                                                          | To: "Humble
                                                          Chirammal"
                                                          <hchiramm@xxxxxxxxxx>

                                                          | Cc: "Roman"
                                                          <romeo.r@xxxxxxxxx>,


                                                          gluster-users@xxxxxxxxxxx,
                                                          "Niels de Vos"
                                                          <ndevos@xxxxxxxxxx>

                                                          | Sent:
                                                          Wednesday,
                                                          August 27,
                                                          2014 12:34:22
                                                          PM

                                                          | Subject: Re:
                                                          
                                                          libgfapi
                                                          failover
                                                          problem on
                                                          replica bricks

                                                          |

                                                          |

                                                          | On
                                                          08/27/2014
                                                          12:24 PM,
                                                          Roman wrote:

                                                          | >
                                                          root@stor1:~#
                                                          ls -l
                                                          /usr/sbin/glfsheal

                                                          | > ls:
                                                          cannot access
                                                          /usr/sbin/glfsheal:
                                                          No such file
                                                          or directory

                                                          | > Seems
                                                          like not.

                                                          | Humble,

                                                          |       Seems
                                                          like the
                                                          binary is
                                                          still not
                                                          packaged?

                                                          
                                                          Checking with
                                                          Kaleb on this.

                                                          
                                                          ...

                                                          
                                                          |
                                                          >>> 
                                                                     |

                                                          |
                                                          >>> 
                                                                     |
                                                          Humble/Niels,

                                                          |
                                                          >>> 
                                                                     | 
                                                              Do we have
                                                          debs available
                                                          for 3.5.2? In
                                                          3.5.1

                                                          |
                                                          >>> 
                                                                   
                                                           there was
                                                          packaging

                                                          |
                                                          >>> 
                                                                     |
                                                          issue where
                                                          /usr/bin/glfsheal
                                                          is not
                                                          packaged along

                                                          |
                                                          >>> 
                                                                   
                                                           with the deb.
                                                          I

                                                          |
                                                          >>> 
                                                                     |
                                                          think that
                                                          should be
                                                          fixed now as
                                                          well?

                                                          |
                                                          >>> 
                                                                     |

                                                          |
                                                          >>> 
                                                                   
                                                           Pranith,

                                                          | >>>

                                                          |
                                                          >>> 
                                                                     The
                                                          3.5.2 packages
                                                          for debian is
                                                          not available
                                                          yet. We

                                                          |
                                                          >>> 
                                                                     are
                                                          co-ordinating
                                                          internally to
                                                          get it
                                                          processed.

                                                          |
                                                          >>> 
                                                                     I
                                                          will update
                                                          the list once
                                                          its available.

                                                          | >>>

                                                          |
                                                          >>> 
                                                                   
                                                           --Humble

                                                          
                                                          glfsheal isn't
                                                          in our 3.5.2-1
                                                          DPKGs either.
                                                          We (meaning I)
                                                          started with
                                                          the 3.5.1
                                                          packaging bits
                                                          from Semiosis.
                                                          Perhaps he
                                                          fixed 3.5.1
                                                          after giving
                                                          me his bits.

                                                          
                                                          I'll fix it
                                                          and spin
                                                          3.5.2-2 DPKGs.

                                                          
                                                          That is great
                                                          Kaleb. Please
                                                          notify
                                                          semiosis as
                                                          well in case
                                                          he is yet to
                                                          fix it.

                                                          
                                                          Pranith

                                                          
                                                          -- 

                                                          
                                                          Kaleb

                                                          
                                                        -- 

                                                        Best regards,

                                                        Roman. 
                                                
                                              
                                              -- 

                                              Best regards,

                                              Roman. 
                                          
                                          
                                -- 

                                Best regards,

                                Roman. 
                            
                            
                  -- 

                  Best regards,

                  Roman.
                
          
        -- 

        Best regards,

        Roman.
      
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users