Re: glusterfs under high load failing?

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 13 Oct 2014 22:19:04 +0530



    On 10/13/2014 10:03 PM, Roman wrote:

    
      hmm,
        seems like another strange issue? Seen this before. Had to
          restart the volume to get my empty space back.
        
          root@glstor-cli:/srv/nfs/HA-WIN-TT-1T# ls -l
          total 943718400
          -rw-r--r-- 1 root root 966367641600 Oct 13 16:55 disk
          root@glstor-cli:/srv/nfs/HA-WIN-TT-1T# rm disk
          root@glstor-cli:/srv/nfs/HA-WIN-TT-1T# df -h
          Filesystem                                            
             Size  Used Avail Use% Mounted on
          rootfs                                                
             282G  1.1G  266G   1% /
          udev                                                    
            10M     0   10M   0% /dev
          tmpfs                                                  
            1.4G  228K  1.4G   1% /run
          /dev/disk/by-uuid/c62ee3c0-c0e5-44af-b0cd-7cb3fbcc0fba
             282G  1.1G  266G   1% /
          tmpfs                                                  
            5.0M     0  5.0M   0% /run/lock
          tmpfs                                                  
            5.2G     0  5.2G   0% /run/shm
          stor1:HA-WIN-TT-1T                                    
            1008G  901G   57G  95% /srv/nfs/HA-WIN-TT-1T
        
        
        no file, but size is still 901G.
        Both servers show the same.
        Do I really have to restart the volume to fix that?
      
    
    IMO this can happen if there is an fd leak. open-fd is the only
    variable that can change with volume restart. How do you re-create
    the bug?

    
    Pranith

    
        2014-10-13 19:30 GMT+03:00 Roman <romeo.r@xxxxxxxxx>:

          
            Sure.
              I'll let it to run for this night .
            
            
                  2014-10-13 19:19 GMT+03:00
                    Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>:

                    
                       hi Roman,

                             Do you think we can run this test again?
                        this time, could you enable 'gluster volume
                        profile <volname> start', do the same
                        test. Provide output of 'gluster volume profile
                        <volname> info' and logs after the test?

                            
                            Pranith
                        
                          
                            On 10/13/2014 09:45 PM, Roman wrote:

                            
                              Sure !
                                

                                  root@stor1:~# gluster volume info
                                  

                                  Volume Name:
                                    HA-2TB-TT-Proxmox-cluster
                                  Type: Replicate
                                  Volume ID:
                                    66e38bde-c5fa-4ce2-be6e-6b2adeaa16c2
                                  Status: Started
                                  Number of Bricks: 1 x 2 = 2
                                  Transport-type: tcp
                                  Bricks:
                                  Brick1:
                                    stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB
                                  Brick2:
                                    stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB
                                  Options Reconfigured:
                                  nfs.disable: 0
                                  network.ping-timeout: 10
                                  

                                  Volume Name: HA-WIN-TT-1T
                                  Type: Replicate
                                  Volume ID:
                                    2937ac01-4cba-44a8-8ff8-0161b67f8ee4
                                  Status: Started
                                  Number of Bricks: 1 x 2 = 2
                                  Transport-type: tcp
                                  Bricks:
                                  Brick1: stor1:/exports/NFS-WIN/1T
                                  Brick2: stor2:/exports/NFS-WIN/1T
                                  Options Reconfigured:
                                  nfs.disable: 1
                                  network.ping-timeout: 10
                                  

                                2014-10-13
                                  19:09 GMT+03:00 Pranith Kumar
                                  Karampuri <pkarampu@xxxxxxxxxx>:

                                  
                                     Could you give
                                      your 'gluster volume info' output?

                                      
                                      Pranith
                                      
                                        
                                          On 10/13/2014 09:36 PM,
                                            Roman wrote:

                                          
                                            Hi,
                                              

                                              I've got this kind of
                                                setup (servers run
                                                replica)
                                              

                                              @ 10G backend
                                              gluster storage1
                                              gluster storage2
                                              gluster client1
                                              

                                              @1g backend
                                              other gluster clients
                                              

                                              Servers got HW RAID5
                                                with SAS disks.
                                              

                                              So today I've desided
                                                to create a 900GB file
                                                for iscsi target that
                                                will be located @
                                                glusterfs separate
                                                volume, using dd (just a
                                                dummy file filled with
                                                zeros, bs=1G count 900)
                                              For the first of all
                                                the process took pretty
                                                lots of time, the
                                                writing speed was 130
                                                MB/sec (client port was
                                                2 gbps, servers ports
                                                were running @ 1gbps).
                                              Then it reported
                                                something like "endpoint
                                                is not connected" and
                                                all of my VMs on the
                                                other volume started to
                                                give me IO errors.
                                              Servers load was
                                                around 4,6 (total 12
                                                cores)
                                              

                                              Maybe it was due to
                                                timeout of 2 secs, so
                                                I've made it a big
                                                higher, 10 sec.
                                              

                                              Also during the dd
                                                image creation time, VMs
                                                very often reported me
                                                that their disks are
                                                slow like
                                              
                                                WARNINGs: Read IO
                                                  Wait time is -0.02
                                                  (outside range [0:1]).
                                                Is 130MB /sec is the
                                                  maximum bandwidth for
                                                  all of the volumes in
                                                  total? That why would
                                                  we need 10g backends?
                                                HW Raid local speed
                                                  is 300 MB/sec, so it
                                                  should not be an
                                                  issue. any ideas or
                                                  mby any advices?
                                                

                                                Maybe some1 got
                                                  optimized sysctl.conf
                                                  for 10G backend?
                                                mine is pretty
                                                  simple, which can be
                                                  found from googling.
                                                

                                                just to mention:
                                                  those VM-s were
                                                  connected using
                                                  separate 1gbps
                                                  intraface, which
                                                  means, they should not
                                                  be affected by the
                                                  client with 10g
                                                  backend.
                                                

                                                logs are pretty
                                                  useless, they just say
                                                   this during the
                                                  outage
                                                

                                                [2014-10-13
                                                  12:09:18.392910] W
                                                  [client-handshake.c:276:client_ping_cbk]
                                                  0-HA-2TB-TT-Proxmox-cluster-client-0:
                                                  timer must have
                                                  expired
                                                [2014-10-13
                                                  12:10:08.389708] C
                                                  [client-handshake.c:127:rpc_client_ping_timer_expired]
                                                  0-HA-2TB-TT-Proxmox-cluster-client-0:
                                                  server 10.250.0.1:49159 has
                                                  not responded in the
                                                  last 2 seconds,
                                                  disconnecting.
                                                [2014-10-13
                                                  12:10:08.390312] W
                                                  [client-handshake.c:276:client_ping_cbk]
                                                  0-HA-2TB-TT-Proxmox-cluster-client-0:
                                                  timer must have
                                                  expired
                                              
                                              so I decided to set
                                                the timout a bit higher.
                                              
                                                
                                                So it seems to me,
                                                  that under high load
                                                  GlusterFS is not
                                                  useable? 130 MB/s is
                                                  not that much to get
                                                  some kind of timeouts
                                                  or makeing the systme
                                                  so slow, that VM-s
                                                  feeling themselves
                                                  bad.
                                                

                                                Of course, after
                                                  the disconnection,
                                                  healing process was
                                                  started, but as VM-s
                                                  lost connection to
                                                  both of servers, it
                                                  was pretty useless,
                                                  they could not run
                                                  anymore. and BTW, when
                                                  u load the server with
                                                  such huge job (dd of
                                                  900GB), healing
                                                  process goes soooooo
                                                  slow :)
                                                

                                                -- 

                                                Best regards,

                                                Roman. 
                                            
                                            
                                        _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
                                      
                                      
                                -- 

                                Best regards,

                                Roman. 
                            
                            
              -- 

                  Best regards,

                  Roman.
                
          
        -- 

        Best regards,

        Roman.
      
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users