Re: Commit and Apply latency on nautilus

Sasha Litvak <alexander.v.litvak@xxxxxxxxx> · Tue, 1 Oct 2019 21:46:15 -0500

All,

Thank you for your suggestions.  During the last night test, I had at least one drive on one node doing a power-on reset by the controller.   It caused a couple of OSDs asserting / timing out on that node.  I am testing and updating the usual suspects on this node and after that on a whole cluster, i.e. kernel, controller firmware, SSD firmware. All of these have updates available.  Dell mentioned a possible crush on bionic during high throughput but none of it is clear and simple.   I would like to eliminate firmware/drivers, especially if there is a bug causing a crash under the load.  I will then proceed with Mokhtar's and Robert's suggestions.
If anyone has any more suggestions please share on this thread as it may help someone else later on.

Best,   

On Tue, Oct 1, 2019 at 2:56 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

    Some suggestions:
    monitor raw resources such as cpu %util raw disk %util/busy, raw
      disk iops.
    instead of running a mix of workloads at this stage, narrow it
      down first, for example using rbd rand writes and 4k block sizes,
      then change 1 param at a time for example change the block size.
      See how your cluster performs and what resources loads you get
      step by step. Latency from 4M will not be the same as 4k.
    i would also run fio tests on the raw Nytro 1551 devices
      including sync writes.
    I would not recommend you increase readahead for random io. 

    I do not recommend making RAID0

    /Maged

    On 01/10/2019 02:12, Sasha Litvak
      wrote:

      At this point, I ran out of ideas.  I changed
        nr_requests and readahead parameters to 128->1024 and
        128->4096, tuned nodes to performance-throughput.  However, I
        still get high latency during benchmark testing.  I attempted to
        disable cache on ssd 

        for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done 

        and I think it make things not better at all.  I have H740
          and H730 controllers with drives in HBA mode.  

        Other them converting them one by one to RAID0 I am not
          sure what else I can try. 

        Any suggestions?

        On Mon, Sep 30, 2019 at 2:45
          PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:

        BTW:
          commit and apply latency are the exact same thing since

          BlueStore, so don't bother looking at both.

          In fact you should mostly be looking at the op_*_latency
          counters

          Paul

          -- 

          Paul Emmerich

          Looking for help with your Ceph cluster? Contact us at https://croit.io

          croit GmbH

          Freseniusstr. 31h

          81247 München

          www.croit.io

          Tel: +49 89 1896585 90

          On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak

          <alexander.v.litvak@xxxxxxxxx>
          wrote:

          >

          > In my case, I am using premade Prometheus sourced
          dashboards in grafana.

          >

          > For individual latency, the query looks like that

          >

          > 
          irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
          (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])

          > irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m])
          / on (ceph_daemon) irate(ceph_osd_op_w_latency_count[1m])

          >

          > The other ones use

          >

          > ceph_osd_commit_latency_ms

          > ceph_osd_apply_latency_ms

          >

          > and graph the distribution of it over time

          >

          > Also, average OSD op latency

          >

          >
          avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) /
          rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m])
          >= 0)

          >
          avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) /
          rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m])
          >= 0)

          >

          > Average OSD apply + commit latency

          > avg(ceph_osd_apply_latency_ms{cluster="$cluster"})

          > avg(ceph_osd_commit_latency_ms{cluster="$cluster"})

          >

          >

          > On Mon, Sep 30, 2019 at 11:13 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>
          wrote:

          >>

          >>

          >> What parameters are you exactly using? I want to do a
          similar test on

          >> luminous, before I upgrade to Nautilus. I have quite
          a lot (74+)

          >>

          >> type_instance=Osd.opBeforeDequeueOpLat

          >> type_instance=Osd.opBeforeQueueOpLat

          >> type_instance=Osd.opLatency

          >> type_instance=Osd.opPrepareLatency

          >> type_instance=Osd.opProcessLatency

          >> type_instance=Osd.opRLatency

          >> type_instance=Osd.opRPrepareLatency

          >> type_instance=Osd.opRProcessLatency

          >> type_instance=Osd.opRwLatency

          >> type_instance=Osd.opRwPrepareLatency

          >> type_instance=Osd.opRwProcessLatency

          >> type_instance=Osd.opWLatency

          >> type_instance=Osd.opWPrepareLatency

          >> type_instance=Osd.opWProcessLatency

          >> type_instance=Osd.subopLatency

          >> type_instance=Osd.subopWLatency

          >> ...

          >> ...

          >>

          >>

          >>

          >>

          >>

          >> -----Original Message-----

          >> From: Alex Litvak [mailto:alexander.v.litvak@xxxxxxxxx]

          >> Sent: zondag 29 september 2019 13:06

          >> To: ceph-users@xxxxxxxxxxxxxx

          >> Cc: ceph-devel@xxxxxxxxxxxxxxx

          >> Subject:  Commit and Apply latency on
          nautilus

          >>

          >> Hello everyone,

          >>

          >> I am running a number of parallel benchmark tests
          against the cluster

          >> that should be ready to go to production.

          >> I enabled prometheus to monitor various information
          and while cluster

          >> stays healthy through the tests with no errors or
          slow requests,

          >> I noticed an apply / commit latency jumping between
          40 - 600 ms on

          >> multiple SSDs.  At the same time op_read and op_write
          are on average

          >> below 0.25 ms in the worth case scenario.

          >>

          >> I am running nautilus 14.2.2, all bluestore, no
          separate NVME devices

          >> for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with
          all drives Seagate

          >> Nytro 1551, osd spread across 6 nodes, running in

          >> containers.  Each node has plenty of RAM with
          utilization ~ 25 GB during

          >> the benchmark runs.

          >>

          >> Here are benchmarks being run from 6 client systems
          in parallel,

          >> repeating the test for each block size in
          <4k,16k,128k,4M>.

          >>

          >> On rbd mapped partition local to each client:

          >>

          >> fio --name=randrw --ioengine=libaio --iodepth=4
          --rw=randrw

          >> --bs=<4k,16k,128k,4M> --direct=1 --size=2G
          --numjobs=8 --runtime=300

          >> --group_reporting --time_based --rwmixread=70

          >>

          >> On mounted cephfs volume with each client storing
          test file(s) in own

          >> sub-directory:

          >>

          >> fio --name=randrw --ioengine=libaio --iodepth=4
          --rw=randrw

          >> --bs=<4k,16k,128k,4M> --direct=1 --size=2G
          --numjobs=8 --runtime=300

          >> --group_reporting --time_based --rwmixread=70

          >>

          >> dbench -t 30 30

          >>

          >> Could you please let me know if huge jump in applied
          and committed

          >> latency is justified in my case and whether I can do
          anything to improve

          >> / fix it.  Below is some additional cluster info.

          >>

          >> Thank you,

          >>

          >> root@storage2n2-la:~# podman exec -it
          ceph-mon-storage2n2-la ceph osd df

          >> ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA   
          OMAP    META     AVAIL

          >>   %USE VAR  PGS STATUS

          >>   6   ssd 1.74609  1.00000 1.7 TiB  93 GiB  92 GiB
          240 MiB  784 MiB 1.7

          >> TiB 5.21 0.90  44     up

          >> 12   ssd 1.74609  1.00000 1.7 TiB  98 GiB  97 GiB 118
          MiB  906 MiB 1.7

          >> TiB 5.47 0.95  40     up

          >> 18   ssd 1.74609  1.00000 1.7 TiB 102 GiB 101 GiB 123
          MiB  901 MiB 1.6

          >> TiB 5.73 0.99  47     up

          >> 24   ssd 3.49219  1.00000 3.5 TiB 222 GiB 221 GiB 134
          MiB  890 MiB 3.3

          >> TiB 6.20 1.07  96     up

          >> 30   ssd 3.49219  1.00000 3.5 TiB 213 GiB 212 GiB 151
          MiB  873 MiB 3.3

          >> TiB 5.95 1.03  93     up

          >> 35   ssd 3.49219  1.00000 3.5 TiB 203 GiB 202 GiB 301
          MiB  723 MiB 3.3

          >> TiB 5.67 0.98 100     up

          >>   5   ssd 1.74609  1.00000 1.7 TiB 103 GiB 102 GiB
          123 MiB  901 MiB 1.6

          >> TiB 5.78 1.00  49     up

          >> 11   ssd 1.74609  1.00000 1.7 TiB 109 GiB 108 GiB  63
          MiB  961 MiB 1.6

          >> TiB 6.09 1.05  46     up

          >> 17   ssd 1.74609  1.00000 1.7 TiB 104 GiB 103 GiB 205
          MiB  819 MiB 1.6

          >> TiB 5.81 1.01  50     up

          >> 23   ssd 3.49219  1.00000 3.5 TiB 210 GiB 209 GiB 168
          MiB  856 MiB 3.3

          >> TiB 5.86 1.01  86     up

          >> 29   ssd 3.49219  1.00000 3.5 TiB 204 GiB 203 GiB 272
          MiB  752 MiB 3.3

          >> TiB 5.69 0.98  92     up

          >> 34   ssd 3.49219  1.00000 3.5 TiB 198 GiB 197 GiB 295
          MiB  729 MiB 3.3

          >> TiB 5.54 0.96  85     up

          >>   4   ssd 1.74609  1.00000 1.7 TiB 119 GiB 118 GiB 
          16 KiB 1024 MiB 1.6

          >> TiB 6.67 1.15  50     up

          >> 10   ssd 1.74609  1.00000 1.7 TiB  95 GiB  94 GiB 183
          MiB  841 MiB 1.7

          >> TiB 5.31 0.92  46     up

          >> 16   ssd 1.74609  1.00000 1.7 TiB 102 GiB 101 GiB 122
          MiB  902 MiB 1.6

          >> TiB 5.72 0.99  50     up

          >> 22   ssd 3.49219  1.00000 3.5 TiB 218 GiB 217 GiB 109
          MiB  915 MiB 3.3

          >> TiB 6.11 1.06  91     up

          >> 28   ssd 3.49219  1.00000 3.5 TiB 198 GiB 197 GiB 343
          MiB  681 MiB 3.3

          >> TiB 5.54 0.96  95     up

          >> 33   ssd 3.49219  1.00000 3.5 TiB 198 GiB 196 GiB 297
          MiB 1019 MiB 3.3

          >> TiB 5.53 0.96  85     up

          >>   1   ssd 1.74609  1.00000 1.7 TiB 101 GiB 100 GiB
          222 MiB  802 MiB 1.6

          >> TiB 5.63 0.97  49     up

          >>   7   ssd 1.74609  1.00000 1.7 TiB 102 GiB 101 GiB
          153 MiB  871 MiB 1.6

          >> TiB 5.69 0.99  46     up

          >> 13   ssd 1.74609  1.00000 1.7 TiB 106 GiB 105 GiB  67
          MiB  957 MiB 1.6

          >> TiB 5.96 1.03  42     up

          >> 19   ssd 3.49219  1.00000 3.5 TiB 206 GiB 205 GiB 179
          MiB  845 MiB 3.3

          >> TiB 5.77 1.00  83     up

          >> 25   ssd 3.49219  1.00000 3.5 TiB 195 GiB 194 GiB 352
          MiB  672 MiB 3.3

          >> TiB 5.45 0.94  97     up

          >> 31   ssd 3.49219  1.00000 3.5 TiB 201 GiB 200 GiB 305
          MiB  719 MiB 3.3

          >> TiB 5.62 0.97  90     up

          >>   0   ssd 1.74609  1.00000 1.7 TiB 110 GiB 109 GiB 
          29 MiB  995 MiB 1.6

          >> TiB 6.14 1.06  43     up

          >>   3   ssd 1.74609  1.00000 1.7 TiB 109 GiB 108 GiB 
          28 MiB  996 MiB 1.6

          >> TiB 6.07 1.05  41     up

          >>   9   ssd 1.74609  1.00000 1.7 TiB 103 GiB 102 GiB
          149 MiB  875 MiB 1.6

          >> TiB 5.76 1.00  52     up

          >> 15   ssd 3.49219  1.00000 3.5 TiB 209 GiB 208 GiB 253
          MiB  771 MiB 3.3

          >> TiB 5.83 1.01  98     up

          >> 21   ssd 3.49219  1.00000 3.5 TiB 199 GiB 198 GiB 302
          MiB  722 MiB 3.3

          >> TiB 5.56 0.96  90     up

          >> 27   ssd 3.49219  1.00000 3.5 TiB 208 GiB 207 GiB 226
          MiB  798 MiB 3.3

          >> TiB 5.81 1.00  95     up

          >>   2   ssd 1.74609  1.00000 1.7 TiB  96 GiB  95 GiB
          158 MiB  866 MiB 1.7

          >> TiB 5.35 0.93  45     up

          >>   8   ssd 1.74609  1.00000 1.7 TiB 106 GiB 105 GiB
          132 MiB  892 MiB 1.6

          >> TiB 5.91 1.02  50     up

          >> 14   ssd 1.74609  1.00000 1.7 TiB  96 GiB  95 GiB 180
          MiB  844 MiB 1.7

          >> TiB 5.35 0.92  46     up

          >> 20   ssd 3.49219  1.00000 3.5 TiB 221 GiB 220 GiB 156
          MiB  868 MiB 3.3

          >> TiB 6.18 1.07 101     up

          >> 26   ssd 3.49219  1.00000 3.5 TiB 206 GiB 205 GiB 332
          MiB  692 MiB 3.3

          >> TiB 5.76 1.00  92     up

          >> 32   ssd 3.49219  1.00000 3.5 TiB 221 GiB 220 GiB  88
          MiB  936 MiB 3.3

          >> TiB 6.18 1.07  91     up

          >>                      TOTAL  94 TiB 5.5 TiB 5.4 TiB
          6.4 GiB   30 GiB  89

          >> TiB 5.78

          >> MIN/MAX VAR: 0.90/1.15  STDDEV: 0.30

          >>

          >>

          >> root@storage2n2-la:~# podman exec -it
          ceph-mon-storage2n2-la ceph -s

          >>    cluster:

          >>      id:     9b4468b7-5bf2-4964-8aec-4b2f4bee87ad

          >>      health: HEALTH_OK

          >>

          >>    services:

          >>      mon: 3 daemons, quorum
          storage2n1-la,storage2n2-la,storage2n3-la

          >> (age 9w)

          >>      mgr: storage2n2-la(active, since 9w), standbys:
          storage2n1-la,

          >> storage2n3-la

          >>      mds: cephfs:1 {0=storage2n6-la=up:active} 1
          up:standby-replay 1

          >> up:standby

          >>      osd: 36 osds: 36 up (since 9w), 36 in (since 9w)

          >>

          >>    data:

          >>      pools:   3 pools, 832 pgs

          >>      objects: 4.18M objects, 1.8 TiB

          >>      usage:   5.5 TiB used, 89 TiB / 94 TiB avail

          >>      pgs:     832 active+clean

          >>

          >>    io:

          >>      client:   852 B/s rd, 15 KiB/s wr, 4 op/s rd, 2
          op/s wr

          >>

          >>

          >>

          >>

          >>

          >> _______________________________________________

          >> ceph-users mailing list

          >> ceph-users@xxxxxxxxxxxxxx

          >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          >>

          >>

          > _______________________________________________

          > ceph-users mailing list

          > ceph-users@xxxxxxxxxxxxxx

          > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com