Re: Unreasonably poor performance of replicated volumes

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Sat, 14 Apr 2018 09:19:04 -0700



    A jumbo ethernet frame can be 9000 bytes. The ethernet frame header
    is at least 38 bytes, and the minimum TCP/IP header size is 40 bytes
    or 0.78% of the jumbo frame combined. Gluster's RPC also adds a few
    bytes (not sure how many and don't have time to test at the moment
    but for the sake of argument we'll just say 20 bytes) but, all
    together, it's about 99% efficient. If you write 20 bytes to a file
    (for an extreme example) then you'll have your 20 bytes+RPC
    header+TCP/IP header+ethernet header; 118 bytes in headers for 20
    bytes of data. That header being 90% of the frame means that your
    packet is only 10% efficient. That's per replica so if you have a
    replica 3 that's three individual frames with 118 bytes of headers
    each to write the same 20 bytes of data. Those go out to the three
    servers and wait for their response. So you have a network round
    trip + a tiny bit of latency for stacking the three frames in the
    kernel + disk write latency. That's a lot of overhead and cannot
    ever be as fast as writing to a local disk for any networked
    storage.

    
    The question, however, is does it need to be? Do you care if a
    single thread is slower in a clustered environment than it would be
    on a local raid stack? In good clustered engineering your workload
    will be handled by multiple threads over a cluster of workers.
    Overall, you have more threads than you could have on a single
    machine. This allows servicing a greater overall workload than you
    could without a cluster. I refer to that as comparing apples to
    orchards (1).

    
    On 04/13/18 10:58, Anastasia Belyaeva
      wrote:

    
      Thanks a lot for your reply!
        

        You guessed it right though  - mailing lists, various
          blogs, documentation, videos and even source code at this
          point. Changing some off the options does make performance
          slightly better, but nothing particularly groundbreaking.

        
        So, if I understand you correctly, no one has yet managed
          to get acceptable performance (relative to underlying hardware
          capabilities) with smaller block sizes? Is there an
          explanation for this?
        

        2018-04-13 1:57 GMT+03:00 Vlad Kopylov
          <vladkopy@xxxxxxxxx>:

          
                Guess you went through user lists and tried
                  something like this already http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html

                
                I have a same exact setup and below is as far as it went
                after months of trail and error.

              
              We all have somewhat same setup and same issue with this -
              you can find same post as yours on the daily basis.

            
              On Wed, Apr 11, 2018 at 3:03 PM,
                Anastasia Belyaeva <anastasia.blv@xxxxxxxxx>
                wrote:

                
                  Hello everybody!
                    

                    I have 3 gluster servers (gluster 3.12.6,
                        Centos 7.2; those are actually virtual
                      machines located on 3 separate physical
                      XenServer7.1 servers) 
                    

                    They are all connected via infiniband network.
                      Iperf3 shows around 23 Gbit/s network
                        bandwidth between each 2 of them.
                    

                    Each server has 3 HDD put into a stripe*3
                        thin pool (LVM2) with logical volume
                      created on top of it, formatted with xfs.
                      Gluster top reports the following throughput:
                    

                      root@fsnode2
                        ~ $ gluster volume top r3vol write-perf bs 4096
                        count 524288 list-cnt 0

                        Brick: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Throughput 631.82 MBps time 3.3989 secs

                        Brick: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Throughput 566.96 MBps time 3.7877 secs

                        Brick: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Throughput 546.65 MBps time 3.9285 secs
                    
                    
                      root@fsnode2
                        ~ $ gluster volume top r2vol write-perf bs 4096
                        count 524288 list-cnt 0

                        Brick: fsnode2.ibnet:/data/glusterfs/r2vol/brick1/brick

                        Throughput 539.60 MBps time 3.9798 secs

                        Brick: fsnode4.ibnet:/data/glusterfs/r2vol/brick1/brick

                        Throughput 580.07 MBps time 3.7021 secs
                    
                    
                    And two pure replicated ('replica 2' and
                        'replica 3') volumes. *The 'replica 2'
                      volume is for testing purpose only.
                    
                      Volume
                        Name: r2vol

                        Type: Replicate

                        Volume ID: 4748d0c0-6bef-40d5-b1ec-d30e10cfddd9

                        Status: Started

                        Snapshot Count: 0

                        Number of Bricks: 1 x 2 = 2

                        Transport-type: tcp

                        Bricks:

                        Brick1: fsnode2.ibnet:/data/glusterfs/r2vol/brick1/brick

                        Brick2: fsnode4.ibnet:/data/glusterfs/r2vol/brick1/brick

                        Options Reconfigured:

                        nfs.disable: on

                      
                      Volume
                        Name: r3vol

                        Type: Replicate

                        Volume ID: b0f64c28-57e1-4b9d-946b-26ed6b499f29

                        Status: Started

                        Snapshot Count: 0

                        Number of Bricks: 1 x 3 = 3

                        Transport-type: tcp

                        Bricks:

                        Brick1: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Brick2: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Brick3: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick

                        Options Reconfigured:

                        nfs.disable: on
                    
                    
                    Client is also gluster 3.12.6, Centos
                      7.3 virtual machine, FUSE mount 
                    
                      root@centos7u3-nogdesktop2
                        ~ $ mount |grep gluster

                        gluster-host.ibnet:/r2vol on /mnt/gluster/r2
                        type fuse.glusterfs
                        (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

                        gluster-host.ibnet:/r3vol on /mnt/gluster/r3
                        type fuse.glusterfs
                        (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
                    
                    
                    The problem is that there is a
                      significant performance loss with smaller block
                      sizes. For example: 
                    

                    4K block size
                    [replica 3 volume]
                    
                      root@centos7u3-nogdesktop2 ~ $ dd
                        if=/dev/zero of=/mnt/gluster/r3/file$RANDOM
                        bs=4096 count=262144
                      262144+0 records in
                      262144+0 records out
                      1073741824 bytes (1.1 GB) copied, 11.2207 s,
                        95.7 MB/s
                    
                    
                    [replica 2 volume]

                    
                      root@centos7u3-nogdesktop2 ~ $ dd
                        if=/dev/zero of=/mnt/gluster/r2/file$RANDOM
                        bs=4096 count=262144
                      262144+0 records in
                      262144+0 records out
                      1073741824 bytes (1.1 GB) copied, 12.0149 s,
                        89.4 MB/s
                    
                    
                    512K block size

                      
                    [replica 3 volume]

                      
                      root@centos7u3-nogdesktop2 ~ $ dd
                        if=/dev/zero of=/mnt/gluster/r3/file$RANDOM
                        bs=512K count=2048
                      2048+0 records in
                      2048+0 records out
                      1073741824 bytes (1.1 GB) copied, 5.27207 s,
                        204 MB/s
                    
                    
                    [replica 2 volume]

                    
                      root@centos7u3-nogdesktop2 ~ $ dd
                        if=/dev/zero of=/mnt/gluster/r2/file$RANDOM
                        bs=512K count=2048
                      2048+0 records in
                      2048+0 records out
                      1073741824 bytes (1.1 GB) copied, 4.22321 s,
                        254 MB/s
                    
                    
                    With bigger block size It's still not where I
                      expect it to be, but at least it starts to make
                      some sense.
                    

                    I've been trying to solve this for a very long
                      time with no luck. 
                    I've already tried both kernel tuning
                      (different 'tuned' profiles and the ones
                      recommended in the "Linux Kernel Tuning" section)
                      and tweaking gluster volume options, including
                      write-behind/flush-behind/write-behind-window-size.
                    The latter, to my surprise, didn't make any
                      difference. 'Cause at first I thought it was the
                      buffering issue but it turns out it does buffer
                      writes, just not very efficient (well at least
                      what it looks like in the gluster profile
                        output)
                    

                      root@fsnode2
                        ~ $ gluster volume profile r3vol info clear

                        ...

                        Cleared stats.
                      

                        root@centos7u3-nogdesktop2
                          ~ $ dd if=/dev/zero
                          of=/mnt/gluster/r3/file$RANDOM bs=4096
                          count=262144

                          262144+0 records in

                          262144+0 records out

                          1073741824 bytes (1.1 GB) copied, 10.9743 s,
                          97.8 MB/s
                      
                       
                        root@fsnode2
                          ~ $ gluster volume profile r3vol info

                          Brick: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick

                          -------------------------------------------------------

                          Cumulative Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      18.00 us      18.00 us    
                           18.00 us              1      STATFS

                                0.00      20.50 us      11.00 us    
                           30.00 us              2       FLUSH

                                0.00      22.50 us      17.00 us    
                           28.00 us              2    FINODELK

                                0.01      76.50 us      65.00 us    
                           88.00 us              2    FXATTROP

                                0.01     177.00 us     177.00 us    
                          177.00 us              1      CREATE

                                0.02      56.14 us      23.00 us    
                          128.00 us              7      LOOKUP

                                0.02     259.00 us      20.00 us    
                          498.00 us              2     ENTRYLK

                               99.94      59.23 us      17.00 us  
                          10914.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes

                          Interval 0 Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      18.00 us      18.00 us    
                           18.00 us              1      STATFS

                                0.00      20.50 us      11.00 us    
                           30.00 us              2       FLUSH

                                0.00      22.50 us      17.00 us    
                           28.00 us              2    FINODELK

                                0.01      76.50 us      65.00 us    
                           88.00 us              2    FXATTROP

                                0.01     177.00 us     177.00 us    
                          177.00 us              1      CREATE

                                0.02      56.14 us      23.00 us    
                          128.00 us              7      LOOKUP

                                0.02     259.00 us      20.00 us    
                          498.00 us              2     ENTRYLK

                               99.94      59.23 us      17.00 us  
                          10914.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes

                          Brick: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick

                          -------------------------------------------------------

                          Cumulative Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      33.00 us      33.00 us    
                           33.00 us              1      STATFS

                                0.00      22.50 us      13.00 us    
                           32.00 us              2     ENTRYLK

                                0.00      32.00 us      26.00 us    
                           38.00 us              2       FLUSH

                                0.01      47.50 us      16.00 us    
                           79.00 us              2    FINODELK

                                0.01     157.00 us     157.00 us    
                          157.00 us              1      CREATE

                                0.01      92.00 us      70.00 us    
                          114.00 us              2    FXATTROP

                                0.03      72.57 us      39.00 us    
                          121.00 us              7      LOOKUP

                               99.94      47.97 us      15.00 us  
                           1598.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes

                          Interval 0 Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      33.00 us      33.00 us    
                           33.00 us              1      STATFS

                                0.00      22.50 us      13.00 us    
                           32.00 us              2     ENTRYLK

                                0.00      32.00 us      26.00 us    
                           38.00 us              2       FLUSH

                                0.01      47.50 us      16.00 us    
                           79.00 us              2    FINODELK

                                0.01     157.00 us     157.00 us    
                          157.00 us              1      CREATE

                                0.01      92.00 us      70.00 us    
                          114.00 us              2    FXATTROP

                                0.03      72.57 us      39.00 us    
                          121.00 us              7      LOOKUP

                               99.94      47.97 us      15.00 us  
                           1598.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes

                          Brick: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick

                          -------------------------------------------------------

                          Cumulative Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      58.00 us      58.00 us    
                           58.00 us              1      STATFS

                                0.00      38.00 us      38.00 us    
                           38.00 us              2     ENTRYLK

                                0.01      59.00 us      32.00 us    
                           86.00 us              2       FLUSH

                                0.01      81.00 us      33.00 us    
                          129.00 us              2    FINODELK

                                0.01      91.50 us      73.00 us    
                          110.00 us              2    FXATTROP

                                0.01     239.00 us     239.00 us    
                          239.00 us              1      CREATE

                                0.04     103.14 us      63.00 us    
                          210.00 us              7      LOOKUP

                               99.92      52.99 us      16.00 us  
                          11289.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes

                          Interval 0 Stats:

                             Block Size:               4096b+          
                               8192b+               16384b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 1576          
                                 4173                 19605

                             Block Size:              32768b+          
                              65536b+              131072b+

                           No. of Reads:                    0          
                                    0                     0

                          No. of Writes:                 7777          
                                 1847                   657

                           %-latency   Avg-latency   Min-Latency  
                          Max-Latency   No. of calls         Fop

                           ---------   -----------   -----------  
                          -----------   ------------        ----

                                0.00       0.00 us       0.00 us      
                          0.00 us              1     RELEASE

                                0.00      58.00 us      58.00 us    
                           58.00 us              1      STATFS

                                0.00      38.00 us      38.00 us    
                           38.00 us              2     ENTRYLK

                                0.01      59.00 us      32.00 us    
                           86.00 us              2       FLUSH

                                0.01      81.00 us      33.00 us    
                          129.00 us              2    FINODELK

                                0.01      91.50 us      73.00 us    
                          110.00 us              2    FXATTROP

                                0.01     239.00 us     239.00 us    
                          239.00 us              1      CREATE

                                0.04     103.14 us      63.00 us    
                          210.00 us              7      LOOKUP

                               99.92      52.99 us      16.00 us  
                          11289.00 us          35635       WRITE

                              Duration: 38 seconds

                             Data Read: 0 bytes

                          Data Written: 1073741824 bytes
                      
                    
                    At this point I'm officially run out of idea
                      where to look next. So any help, suggestions or
                      pointers are highly appreciated! 
                    
                        
                          -- 

                            
                                      Best
                                        regards,
                                      Anastasia
                                        Belyaeva
                                    
                                    
                  _______________________________________________

                  Gluster-users mailing list

                  Gluster-users@xxxxxxxxxxx

                  http://lists.gluster.org/mailman/listinfo/gluster-users

                
        -- 

        
                  Best regards,
                  Anastasia Belyaeva
                
                
                С уважением,

                
                Анастасия Беляева

                
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users