Re: glusterfs under high load failing?

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 13 Oct 2014 21:39:05 +0530



    Could you give your 'gluster volume info' output?

    
    Pranith

    On 10/13/2014 09:36 PM, Roman wrote:

    
      Hi,
        

        I've got this kind of setup (servers run replica)
        

        @ 10G backend
        gluster storage1
        gluster storage2
        gluster client1
        

        @1g backend
        other gluster clients
        

        Servers got HW RAID5 with SAS disks.
        

        So today I've desided to create a 900GB file for iscsi
          target that will be located @ glusterfs separate volume, using
          dd (just a dummy file filled with zeros, bs=1G count 900)
        For the first of all the process took pretty lots of time,
          the writing speed was 130 MB/sec (client port was 2 gbps,
          servers ports were running @ 1gbps).
        Then it reported something like "endpoint is not connected"
          and all of my VMs on the other volume started to give me IO
          errors.
        Servers load was around 4,6 (total 12 cores)
        

        Maybe it was due to timeout of 2 secs, so I've made it a
          big higher, 10 sec.
        

        Also during the dd image creation time, VMs very often
          reported me that their disks are slow like
        
          WARNINGs: Read IO Wait time is -0.02 (outside
            range
            [0:1]).
          Is 130MB /sec is the maximum bandwidth for all of
            the volumes in total? That why would we need 10g backends?
          HW Raid local speed is 300 MB/sec, so it should
            not be an issue. any ideas or mby any advices?
          

          Maybe some1 got optimized sysctl.conf for 10G
            backend?
          mine is pretty simple, which can be found from
            googling.
          

          just to mention: those VM-s were connected using
            separate 1gbps intraface, which means, they should not be
            affected by the client with 10g backend.
          

          logs are pretty useless, they just say  this
            during the outage
          

          [2014-10-13 12:09:18.392910] W
            [client-handshake.c:276:client_ping_cbk]
            0-HA-2TB-TT-Proxmox-cluster-client-0: timer must have
            expired
          [2014-10-13 12:10:08.389708] C
            [client-handshake.c:127:rpc_client_ping_timer_expired]
            0-HA-2TB-TT-Proxmox-cluster-client-0: server 10.250.0.1:49159
            has not responded in the last 2 seconds, disconnecting.
          [2014-10-13 12:10:08.390312] W
            [client-handshake.c:276:client_ping_cbk]
            0-HA-2TB-TT-Proxmox-cluster-client-0: timer must have
            expired
        
        so I decided to set the timout a bit higher.
        
          
          So it seems to me, that under high load GlusterFS is not
            useable? 130 MB/s is not that much to get some kind of
            timeouts or makeing the systme so slow, that VM-s feeling
            themselves bad.
          

          Of course, after the disconnection, healing process was
            started, but as VM-s lost connection to both of servers, it
            was pretty useless, they could not run anymore. and BTW,
            when u load the server with such huge job (dd of 900GB),
            healing process goes soooooo slow :)
          

          -- 

          Best regards,

          Roman.
        
      
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users