Re: High apply latency

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Mon, 5 Feb 2018 18:37:45 +0100



    Hi Jakub,

    
    Le 05/02/2018 à 12:26, Jakub Jaszewski
      a écrit :

    
        Hi
          Frederic, 
        

        Many
          thanks for your contribution to the topic!
        

        I've
          just set logging level 20 for filestore via 
        

        ceph tell osd.0 config set
            debug_filestore 20
        

        but so far
        found
         nothing by keyword 'split'
        
          in /var/log/ceph/ceph-osd.0.log
        

    So, if you're running ceph > 12.2.1, that means splitting is not
    happening. Did you check during writes ? Did you check other OSDs
    logs ?

    
    Actually, splitting should not happen now that you've increased filestore_merge_threshold and
      filestore_split_multiple values.

      
            I've
              also run your script across the cluster nodes, results as
              follows
          
        
        id=3, pool=volumes,
            objects=10454548, avg=160.28
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
        id=3, pool=volumes,
            objects=10454548, avg=159.22
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
        id=3, pool=volumes,
            objects=10454548, avg=159.843
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
        id=3, pool=volumes,
            objects=10454548, avg=159.695
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
        id=3, pool=volumes,
            objects=10454548, avg=160.594
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
        id=3, pool=volumes,
            objects=10454548, avg=160.099
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
        id=3, pool=volumes,
            objects=10454548, avg=159.912
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
        id=3, pool=volumes,
            objects=10454548, avg=159.407
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
        id=3, pool=volumes,
            objects=10454548, avg=160.075
        id=20,
            pool=default.rgw.buckets.data, objects=22419862, avg=35.481
        

            Looks
              like there is nothing to be handled by split, am I right?
              But what about merging ? Avg is less than 40
          
           ,
            should directories structure be reduced now? 

          
    It should, I guess. But then you'd see blocked requests on every
    object deletion. If you do, you might want to set
    filestore_merge_threshold to -40 (negative value) so merging does
    not happen anymore.

    Splitting would still happen over 5120 files per subdirectory.

    
              "
              
              filestore_merge_threshold": "40",
             
              "filestore_split_multiple": "8",
             
              "filestore_split_rand_factor": "20",
        
        
        M
          ay I ask for the link to
              documentation where I can read more about OSD
              underlying directory structure?
        
      
      I'm not aware of any related documentation.

      
      Do you still observe slow or blocked requests now that you've
      increased filestore_merge_threshold and filestore_split_multiple
      ?

      
      Regards,

      
      Frédéric.

      
          And
            just noticed log entries in  /var/log/ceph/ceph-osd.0.log
          

          2018-02-05 11:22:03.346400
              7f3cc94fe700  0 -- 10.212.14.11:6818/4702
              >> 10.212.14.17:6802/82845
              conn(0xe254cca800 :6818
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg accept connect_seq 27 vs existing
              csq=27 existing_state=STATE_STANDBY
          2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- 10.212.14.11:6818/4702
              >> 10.212.14.17:6802/82845
              conn(0xe254cca800 :6818
              s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
              l=0).handle_connect_msg accept connect_seq 28 vs existing
              csq=27 existing_state=STATE_STANDBY
          
          
        M
            
          any
            thanks!
        
            Jakub
            
      
        On Mon, Feb 5, 2018 at 9:56 AM,
          Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
          wrote:

          
              Hi,
              In addition, starting with Luminous 12.2.1 (RHCS 3),
                splitting ops should be loggued with default setting of
                debug level messages:
                https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320

                There's also a RFE for merging to be loggued as well as
                splitting: https://bugzilla.redhat.com/show_bug.cgi?id=1523532

              
              Regards,

              
              Frédéric.
              
                
                  Le
                    02/02/2018 à 17:00, Frédéric Nass a écrit :

                  
                    Hi,
                    Split and merge operations happen during writes
                      only, splitting on file creation and merging on
                      file deletion.
                    As you don't see any blocked requests during
                      reads I would guess your issue happens during
                      splitting. Now that you increased
                      filestore_merge_threshold and
                      filestore_split_multiple, you shouldn't expect any
                      splitting operations to happen any soon, nor any
                      merging operations, unless your workload consists
                      of writing a huge number of files and removing
                      them.
                    You should check how many files are in each lower
                      directories of pool 20's PGs. This would help to
                      confirm that the blocked requests come with the
                      splitting.
                    We now use the below script (on one of the OSD
                      nodes) to get an average value of the number of
                      files in some PGs of each pool and run this script
                      every 5 minutes with Munin to get a graph out of
                      the values.

                      This way, we can anticipate the splitting and even
                      provoke the splitting during off hours with rados
                      bench (prefix) :
                    #!/bin/bash

                    
                    for pool in $(ceph osd pool ls detail | grep -v
                      -E 'snap|^$' | cut -d " " -f2-3 | sed -e 's/
                      /|/g') ; do

                      
                      pool_id=$(echo $pool | cut -d "|" -f1)

                      pool_name=$(echo $pool | cut -d "|" -f2 | sed -e
                      "s/'//g")

                      nbr_objects=$(rados df | grep "$pool_name " | awk
                      '{print $3}')

                      #echo "$pool_id, $pool_name |$nbr_objects|"

                      
                      ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head
                      > /dev/null 2>&1

                      
                      #echo $?

                      #continue

                      
                      if [ $? -eq 0 ]; then

                      avg=$(for pg_dir in $(ls -d
                      /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head
                      | tail -5) ; do find $pg_dir -type d -print0 |
                      while read -d '' -r dir; do files=`find $dir -type
                      f -maxdepth 1|wc -l`; printf "$files in
                      $dir\n";done | grep -v '^0 ' | awk '{ sum += $1;
                      n++ } END { if (n > 0) print sum / n; }' ; done
                      | awk '{ sum += $1; n++ } END { if (n > 0)
                      print sum / n; }')

                      else

                      avg=0

                      fi

                      
                      echo "id=$pool_id, pool=$pool_name,
                      objects=$nbr_objects, avg=$avg"

                      
                      done

                    
                    With 40/8 values, subdirectory splitting will
                      happen when the number of files goes over 5120 and
                      merging when the number of files goes below 40. My
                      assumption is that
                      merging can also happen to intermediate
                      directories when then go below 40 files, but I'm
                      not sure about that.

                    
                    If you ever need to provoke splitting on a pool,
                      you can do this with a rados bench (don't forget
                      the prefix so you can easily remove the files
                      aftewards if it turns out that some were left in
                      the pool).

                      You can also do the splitting offline if you need
                      it:
                    systemctl stop ceph-osd@${osd_num}

                      ceph-osd -i ${osd_num} --flush-journal

                      ceph-objectstore-tool --data-path
                      /var/lib/ceph/osd/ceph-${osd_num}
                      --journal-path /var/lib/ceph/osd/ceph-${osd_num}/journal
                      --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
                      --op apply-layout-settings --pool ${pool_name}

                    
                    Regards,
                    Frederic.
                    

                    Le
                      02/02/2018 à 09:55, Jakub Jaszewski a écrit :

                    
                        Hi,
                        

                        So
                          I have changed merge & split settings to
                         
                        
                          filestore_merge_threshold
                              = 40

                            
                          filestore_split_multiple
                              = 8
                          

                          and restart all OSDs , host by host.

                        
                        Let me ask a
                          question, although the pool default.rgw.buckets.data that
                          was affected prior to the above change has
                          higher write bandwidth it is very random now.
                          Writes are random for other pools (same for EC
                          and replicated types) too, before the change
                          writes to replicated pools were much more
                          stable.
                        Reads
                            from pools look fine and stable.

                        
                          Is
                            it the result of mentioned change ? Is PG
                            directory structure updating or ...?
                          

                          [...]

                          
                            Regards
                            Jakub
                          
                         
                        On Thu, Feb 1, 2018 at
                          3:33 PM, Jakub Jaszewski <jaszewski.jakub@xxxxxxxxx>
                          wrote:

                          
                              Regarding
                                split & merge, I have default values
                              
                                
                                  filestore_merge_threshold
                                      = 10

                                    
                                  filestore_split_multiple
                                      = 2
                                  

                              according
                                to https://bugzilla.redhat.com/show_bug.cgi?id=1219974
                                the recommended values are
                              

                                  filestore_merge_threshold
                                      = 40

                                    
                                  filestore_split_multiple
                                      = 8
                                  

                                Is it something that I can easily
                                change to default or lower values than
                                proposed in case of further performance
                                degradation ?
                              

                              I
                                did tests of 4 pools: 2 replicated
                                  pools (x3 ) and 2 EC  pools (k=6,m=3)
                              

                              The
                                  pool with the lowest bandwidth has osd
                                  tree structure like 
                              ├──
                                  20.115s1_head
                              │   └──
                                  DIR_5
                              │     
                                   └── DIR_1
                              │         
                                   ├── DIR_1
                              │         
                                   │   ├── DIR_0
                              │         
                                   │   ├── DIR_1
                              │         
                                   │   ├── DIR_2
                              │         
                                   │   │   ├── DIR_0
                              │         
                                   │   │   ├── DIR_1
                              │         
                                   │   │   ├── DIR_2
                              │         
                                   │   │   ├── DIR_3
                              │         
                                   │   │   ├── DIR_4
                              │         
                                   │   │   ├── DIR_5
                              │         
                                   │   │   ├── DIR_6
                              │         
                                   │   │   ├── DIR_7
                              │         
                                   │   │   ├── DIR_8
                              │         
                                   │   │   ├── DIR_9
                              │         
                                   │   │   ├── DIR_A
                              │         
                                   │   │   ├── DIR_B
                              │         
                                   │   │   ├── DIR_C
                              │         
                                   │   │   ├── DIR_D
                              │         
                                   │   │   ├── DIR_E
                              │       
                                       │   │   └── DIR_F 
                              

                              Tests
                                    results
                              

                                        #
                                          rados bench -p
                                          default.rgw.buckets.data 10
                                          write 
                                        hints
                                          = 1
                                        Maintaining
                                          16 concurrent writes of
                                          4194432 bytes to objects of
                                          size 4194432 for up to 10
                                          seconds or 0 objects
                                        Object
                                          prefix:
                                          benchmark_data_sg08-09_180679
                                          sec
                                          Cur ops   started  finished 
                                          avg MB/s  cur MB/s last
                                          lat(s)  avg lat(s)
                                           
                                          0       0         0         0 
                                                 0         0         
                                           -           0
                                           
                                          1      16       129       113 
                                           451.975   452.014 
                                           0.0376714    0.128027
                                           
                                          2      16       209       193 
                                           385.964    320.01   
                                          0.119609    0.138517
                                           
                                          3      16       235       219 
                                           291.974   104.003 
                                           0.0337624     0.13731
                                           
                                          4      16       235       219 
                                           218.981         0         
                                           -     0.13731
                                           
                                          5      16       266       250 
                                           199.983   62.0019   
                                          0.111673    0.238424
                                           
                                          6      16       317       301 
                                           200.649   204.006 
                                           0.0340569    0.298489
                                           
                                          7      16       396       380 
                                           217.124    316.01 
                                           0.0379956    0.283458
                                           
                                          8      16       444       428 
                                           213.981   192.006 
                                           0.0304383    0.274193
                                           
                                          9      16       485       469 
                                           208.426   164.005   
                                          0.391956    0.283421
                                         
                                           10      16       496     
                                           480   191.983   44.0013   
                                          0.104497    0.292074
                                         
                                           11      16       497     
                                           481   174.894   4.00012   
                                          0.999985    0.293545
                                         
                                           12      16       497     
                                           481    160.32         0     
                                               -    0.293545
                                         
                                           13      16       497     
                                           481   147.987         0     
                                               -    0.293545
                                         
                                           14      16       497     
                                           481   137.417         0     
                                               -    0.293545
                                        Total
                                          time run:         14.493353
                                        Total
                                          writes made:      497
                                        Write
                                          size:             4194432
                                        Object
                                          size:            4194432
                                        Bandwidth
                                          (MB/sec):     137.171
                                        Stddev
                                          Bandwidth:       147.001
                                        Max
                                          bandwidth (MB/sec): 452.014
                                        Min
                                          bandwidth (MB/sec): 0
                                        Average
                                          IOPS:           34
                                        Stddev
                                          IOPS:            36
                                        Max
                                          IOPS:               113
                                        Min
                                          IOPS:               0
                                        Average
                                          Latency(s):     0.464281
                                        Stddev
                                          Latency(s):      1.09388
                                        Max
                                          latency(s):         6.3723
                                        Min
                                          latency(s):         0.023835
                                        Cleaning
                                          up (deleting benchmark
                                          objects)
                                        Removed
                                          497 objects
                                        Clean
                                          up completed and total clean
                                          up time :10.622382
                                        #
                                        

                                        #
                                          rados bench -p
                                          benchmark_erasure_coded 10
                                          write 
                                        hints
                                          = 1
                                        Maintaining
                                          16 concurrent writes of
                                          4202496 bytes to objects of
                                          size 4202496 for up to 10
                                          seconds or 0 objects
                                        Object
                                          prefix:
                                          benchmark_data_sg08-09_180807
                                          sec
                                          Cur ops   started  finished 
                                          avg MB/s  cur MB/s last
                                          lat(s)  avg lat(s)
                                           
                                          0       0         0         0 
                                                 0         0         
                                           -           0
                                           
                                          1      16       424       408 
                                           1635.11   1635.19 
                                           0.0490434   0.0379616
                                           
                                          2      16       828       812 
                                           1627.03   1619.16 
                                           0.0616501   0.0388467
                                           
                                          3      16      1258      1242 
                                           1659.06   1723.36 
                                           0.0304412   0.0384537
                                           
                                          4      16      1659      1643 
                                           1646.03   1607.13 
                                           0.0155402   0.0387351
                                           
                                          5      16      2053      2037 
                                           1632.61   1579.08 
                                           0.0453354   0.0390236
                                           
                                          6      16      2455      2439 
                                              1629   1611.14 
                                           0.0485313   0.0392376
                                           
                                          7      16      2649      2633 
                                           1507.34   777.516 
                                           0.0148972   0.0393161
                                           
                                          8      16      2858      2842 
                                           1423.61   837.633 
                                           0.0157639   0.0449088
                                           
                                          9      16      3245      3229 
                                           1437.75   1551.02 
                                           0.0200845   0.0444847
                                         
                                           10      16      3629     
                                          3613   1447.85      1539 
                                           0.0654451   0.0441569
                                        Total
                                          time run:         10.229591
                                        Total
                                          writes made:      3630
                                        Write
                                          size:             4202496
                                        Object
                                          size:            4202496
                                        Bandwidth
                                          (MB/sec):     1422.18
                                        Stddev
                                          Bandwidth:       341.609
                                        Max
                                          bandwidth (MB/sec): 1723.36
                                        Min
                                          bandwidth (MB/sec): 777.516
                                        Average
                                          IOPS:           354
                                        Stddev
                                          IOPS:            85
                                        Max
                                          IOPS:               430
                                        Min
                                          IOPS:               194
                                        Average
                                          Latency(s):     0.0448612
                                        Stddev
                                          Latency(s):      0.0712224
                                        Max
                                          latency(s):         1.08353
                                        Min
                                          latency(s):         0.0134629
                                        Cleaning
                                          up (deleting benchmark
                                          objects)
                                        Removed
                                          3630 objects
                                        Clean
                                          up completed and total clean
                                          up time :2.321669
                                        # 
                                        

                                        #
                                          rados bench -p volumes 10
                                          write 
                                        hints
                                          = 1
                                        Maintaining
                                          16 concurrent writes of
                                          4194304 bytes to objects of
                                          size 4194304 for up to 10
                                          seconds or 0 objects
                                        Object
                                          prefix:
                                          benchmark_data_sg08-09_180651
                                          sec
                                          Cur ops   started  finished 
                                          avg MB/s  cur MB/s last
                                          lat(s)  avg lat(s)
                                           
                                          0       0         0         0 
                                                 0         0         
                                           -           0
                                           
                                          1      16       336       320 
                                           1279.89      1280 
                                           0.0309006   0.0472524
                                           
                                          2      16       653       637 
                                           1273.84      1268 
                                           0.0465151   0.0495701
                                           
                                          3      16       956       940 
                                           1253.17      1212 
                                           0.0337327   0.0504146
                                           
                                          4      16      1256      1240 
                                           1239.85      1200 
                                           0.0177263   0.0509145
                                           
                                          5      16      1555      1539 
                                           1231.05      1196 
                                           0.0364991   0.0516724
                                           
                                          6      16      1868      1852 
                                           1234.51      1252 
                                           0.0260964   0.0510236
                                           
                                          7      16      2211      2195 
                                           1254.13      1372   
                                          0.040738    0.050847
                                           
                                          8      16      2493      2477 
                                           1238.35      1128 
                                           0.0228582   0.0514979
                                           
                                          9      16      2838      2822 
                                           1254.07      1380 
                                           0.0265224   0.0508641
                                         
                                           10      16      3116     
                                          3100   1239.85      1112 
                                           0.0160151   0.0513104
                                        Total
                                          time run:         10.192091
                                        Total
                                          writes made:      3117
                                        Write
                                          size:             4194304
                                        Object
                                          size:            4194304
                                        Bandwidth
                                          (MB/sec):     1223.3
                                        Stddev
                                          Bandwidth:       89.9383
                                        Max
                                          bandwidth (MB/sec): 1380
                                        Min
                                          bandwidth (MB/sec): 1112
                                        Average
                                          IOPS:           305
                                        Stddev
                                          IOPS:            22
                                        Max
                                          IOPS:               345
                                        Min
                                          IOPS:               278
                                        Average
                                          Latency(s):     0.0518144
                                        Stddev
                                          Latency(s):      0.0529575
                                        Max
                                          latency(s):         0.663523
                                        Min
                                          latency(s):         0.0122169
                                        Cleaning
                                          up (deleting benchmark
                                          objects)
                                        Removed
                                          3117 objects
                                        Clean
                                          up completed and total clean
                                          up time :0.212296
                                        # 
                                        

                                        #
                                          rados bench -p
                                          benchmark_replicated  10
                                          write 
                                        hints
                                          = 1
                                        Maintaining
                                          16 concurrent writes of
                                          4194304 bytes to objects of
                                          size 4194304 for up to 10
                                          seconds or 0 objects
                                        Object
                                          prefix:
                                          benchmark_data_sg08-09_180779
                                          sec
                                          Cur ops   started  finished 
                                          avg MB/s  cur MB/s last
                                          lat(s)  avg lat(s)
                                           
                                          0       0         0         0 
                                                 0         0         
                                           -           0
                                           
                                          1      16       309       293 
                                           1171.94      1172 
                                           0.0233267   0.0508877
                                           
                                          2      16       632       616 
                                           1231.87      1292 
                                           0.0258237    0.049612
                                           
                                          3      16       959       943 
                                           1257.19      1308 
                                           0.0335615   0.0483464
                                           
                                          4      16      1276      1260 
                                           1259.85      1268   
                                          0.031461   0.0504689
                                           
                                          5      16      1643      1627 
                                           1301.44      1468 
                                           0.0274032   0.0489651
                                           
                                          6      16      1991      1975 
                                           1316.51      1392 
                                           0.0408116   0.0483596
                                           
                                          7      16      2328      2312 
                                           1320.98      1348 
                                           0.0242298    0.048175
                                           
                                          8      16      2677      2661 
                                           1330.33      1396   
                                          0.097513    0.047962
                                           
                                          9      16      3042      3026 
                                           1344.72      1460 
                                           0.0196724   0.0474078
                                         
                                           10      16      3384     
                                          3368   1347.03      1368 
                                           0.0426199   0.0472573
                                        Total
                                          time run:         10.482871
                                        Total
                                          writes made:      3384
                                        Write
                                          size:             4194304
                                        Object
                                          size:            4194304
                                        Bandwidth
                                          (MB/sec):     1291.25
                                        Stddev
                                          Bandwidth:       90.4861
                                        Max
                                          bandwidth (MB/sec): 1468
                                        Min
                                          bandwidth (MB/sec): 1172
                                        Average
                                          IOPS:           322
                                        Stddev
                                          IOPS:            22
                                        Max
                                          IOPS:               367
                                        Min
                                          IOPS:               293
                                        Average
                                          Latency(s):     0.048763
                                        Stddev
                                          Latency(s):      0.0547666
                                        Max
                                          latency(s):         0.938211
                                        Min
                                          latency(s):         0.0121556
                                        Cleaning
                                          up (deleting benchmark
                                          objects)
                                        Removed
                                          3384 objects
                                        Clean
                                          up completed and total clean
                                          up time :0.239684
                                        #
                                        

                                    Luis why
                                      did you advise against increasing
                                      pg_num pgp_num ? I'm wondering
                                      which option is better: increasing
                                      pg_num or
                                      filestore_merge_threshold and
                                      filestore_split_multiple ?
                                    

                                    Thanks
                                    
                                        Jakub
                                        

                              On Thu, Feb
                                  1, 2018 at 9:38 AM, Jaroslaw Owsiewski
                                  <jaroslaw.owsiewski@xxxxxxxxxx>
                                  wrote:

                                
                                      Hi,
                                        

                                        maybe "split  is on the
                                          floor"?
                                        

                                        Regards
                                        
                                            -- 
                                            Jarek
                                          
                                    
                      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                    
                    _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com