Re: slow requests break performance

Christian Balzer <chibi@xxxxxxx> · Fri, 3 Feb 2017 20:31:00 +0900

Hello,

On Thu, 02 Feb 2017 10:24:53 +0100 Eugen Block wrote:

> Hi,
> 
> thank you very much for your answer! I'm not sure I get all your  
> points, but I'll try to dig deeper.
> 
I'll repeat myself and say that looking at your nodes with atop during a
benchmark and slow request situation should fire up the old light bulb. 

> > IOPS are more likely to be your bottleneck  
> 
> Can you give me an example of reasonable numbers for ops/s, especially  
> for our small cluster with 1 Gb/s? I have no idea what could be  
> considered as normal.
> 
Expected IOPS?
Far too many variables, in both the HW/design and the testing methodology. 

But it will be lot less than you expect, with your network only coming
into play when looking at sync writes, async small 4K I/Os will saturate
your disk and CPUs (in that order, reverse so for SSD/NVMe based OSDs). 

If you had a cluster consisting of 3 identical nodes, each with 10 HDDs
nominally rated at 150 IOPS your standard Ceph cluster would give you a
theoretical maximum of mere 750 IOPS. 1500*3(node)/3(replica)/2(journal)
In reality it's like to be less due to FS and other overhead.

> >> It's a 1 Gb/s network.
> >>  
> > And there's your "problem".
> > I read this as single link "1Gb/s" network  
> 
> Just a note on that: we are using LACP.
> 
So actually 2Gb/s per node?
Due to the hashing nature of LACP you may of course not see any benefits
with a single client test.
But w/o you thoroughly analyzing things and identifying bottlenecks (is
the network really saturated or are the OSDs/HDDs redlined?) this is
guesswork again.

> Currently, we are preparing to modify our ceph cluster to have only  
> three nodes instead of four, and replace the LVM based OSDs with  
> dedicated HDDs (one large OSD per HDD (2 TB)), would this increase the  
> performance with the same network or could it be even worse? 

Worse, due the reduced number of nodes. See above.

> Or is it  
> simply impossible to have an acceptable performance with a 1 Gb/s  
> network? Any thoughts on that would help me a lot.
> 
Reasonable performance is something that's in the eye of the beholder.

If this would be for a backup/archive solution with the client(s) also on
a 1Gb/s network it might be just fine if you had a sufficient number of
nodes (6 comes to mind) and/or additional cluster network links.

But you're trying to run random VMs on top of this storage, so IOPS again.
Putting your journals on a SSD would be an immediate way to improve
things, provided the HDDs are saturated. 
Such a SSD would need sufficient endurance (you have to determine that, but
something like a DC S3610 comes to mind), but not masive speed with this
kind of network.
And if you can afford only one per node, it would also become a nice SPoF.

Christian

> Eugen
> 
> Zitat von Christian Balzer <chibi@xxxxxxx>:
> 
> > Hello,
> >
> > On Wed, 01 Feb 2017 15:16:15 +0100 Eugen Block wrote:
> >  
> >> > You've told us absolutely nothing about your cluster  
> >>
> >> You're right, I'll try to provide as much information as possible.
> >>
> >> Please note that we have kind of a "special" layout... The HDDs on
> >> ndesan01 are in a RAID6, which is bcached by two SSDs in a RAID1:
> >>  
> >
> > I've dabbled in RAID backed OSDs with mixed success, but the your setup
> > basically tries to do the IOPS for 5 OSDs on what essentially boils down
> > to actually a single HDD (due to the RAID6).
> > A HW cache or maybe your bcache setup will help with this of course, but
> > once they are saturated and/or your IO patterns have competing reads and
> > writes, it (performance) will essentially go to hell.
> >
> > I have no (successful and recent) experience with bcache, so this may or
> > may not be a factor, too.
> >
> > Again, monitoring things with atop during a benchmark (large window, at
> > least 5s frequency) on your nodes and especially this one will show you
> > where the bottlenecks are very quickly.
> >  
> >> ---cut here---
> >> ndesan01: # cat /proc/mdstat
> >> Personalities : [raid1] [raid6] [raid5] [raid4]  
> >> md126 : active raid6 sdd1[3] sde1[4] sdf1[5] sda1[0] sdc1[2] sdb1[1]
> >>        3906521088 blocks super 1.2 level 6, 512k chunk, algorithm 2
> >> [6/6] [UUUUUU]
> >>        bitmap: 6/8 pages [24KB], 65536KB chunk
> >>
> >> md127 : active raid1 sdh1[1] sdg1[0]
> >>        195228864 blocks super 1.2 [2/2] [UU]
> >>        bitmap: 2/2 pages [8KB], 65536KB chunk
> >>
> >> unused devices: <none>
> >> ndesan01: # pvs
> >>    PV           VG       Fmt  Attr PSize PFree   
> >>    /dev/bcache0 ndesan01 lvm2 a--  3.64t 427.46g
> >> ndesan01: # lvs | grep osd
> >>    cephosd0             ndesan01 -wi-ao---- 250.00g
> >>                                                     
> >>    cephosd1             ndesan01 -wi-ao---- 250.00g
> >>                                                     
> >>    cephosd2             ndesan01 -wi-ao---- 250.00g
> >>                                                     
> >>    cephosd3             ndesan01 -wi-ao---- 250.00g
> >>                                                     
> >>    cephosd4             ndesan01 -wi-ao---- 250.00g
> >> ---cut here---
> >>
> >> ndesan02 has the same layout. nde32 is not bcached
> >>
> >> nde32:~ #  lvs | grep osd
> >>    cephosd0             ndesan02     -wi-ao---- 250,00g
> >>    cephosd1             ndesan02     -wi-ao---- 250,00g
> >>    cephosd2             ndesan02     -wi-ao---- 250,00g
> >>    cephosd3             ndesan02     -wi-ao---- 250,00g
> >>    cephosd4             ndesan02     -wi-ao---- 250,00g
> >>
> >> ndesan03 is a dedicated ceph node and is also bcached:
> >>
> >> ---cut here---
> >> ndesan03:~ #  df -hl | grep osd
> >> /dev/bcache0                     932G  286G  646G  31% /var/ceph/osd0
> >> /dev/bcache1                     932G  288G  644G  31% /var/ceph/osd1
> >> /dev/bcache2                     932G  298G  634G  33% /var/ceph/osd2
> >> /dev/bcache3                     932G  289G  644G  31% /var/ceph/osd3
> >> /dev/bcache4                     932G  285G  647G  31% /var/ceph/osd4
> >> ---cut here---
> >>
> >> This mixed setup is due to the pre-existing infrastructure on these
> >> servers, we planned to migrate the environment step by step into ceph.
> >>
> >> ceph@ndesan01:~/ceph-deploy>  ceph --version
> >> ceph version 0.94.7-84-g8e6f430
> >>  
> > Not that it matters much, but since you may as well go to 0.94.9 aka the
> > latest Hammer.
> >  
> >> All 4 nodes have the same OS:
> >> ceph@ndesan01:~/ceph-deploy> cat /etc/os-release
> >> NAME="openSUSE Leap"
> >> VERSION="42.1"
> >>
> >> ndesan01, ndesan02 and nde32 have the same kernel:
> >> ceph@ndesan01:~/ceph-deploy>  uname -r
> >> 4.1.13-5-default
> >>  
> > The 4.1 to 4.4 kernels have various issues, of course SUSE might have
> > backported fixes.
> > It often pays to run the latest (4.8, 4.9) kernels with Ceph.
> > But again, not the cause of your problem.
> >  
> >> ndesan03:~ #  uname -r
> >> 4.1.12-1-default
> >>
> >> It's a 1 Gb/s network.
> >>  
> > And there's your "problem".
> > I read this as single link "1Gb/s" network, which makes your benchmark
> > numbers totally on par, as I suggested.
> >
> > Congestions by individual OSDs (unlucky roll of the dice that has one
> > OSD being primary for one IO and secondary for other concurrent ones) will
> > not help of course.
> >  
> >> ndesan01: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores)
> >>
> >> ndesan02: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores)
> >>  
> > If you're doing 4K sync/direct randwrites you'll notice that faster cores
> > will be better than more cores, but for bandwidth related tests it doesn't
> > matter. Of course that's against SSD/NVMe backed OSDs or at least ones
> > with SSD journals.
> >
> > Everything below looks nominal, the lower memory on one node will make
> > this one inherently slower and thus the overall bottleneck in an otherwise
> > identical cluster due to less pagecache hits and thus more real HDD
> > accesses.
> >
> > Christian  
> >> nde32: 2 x Intel(R) Xeon(R) CPU E5410  @ 2.33GHz (each 4 cores)
> >>
> >> ndesan03: 1 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (4 cores)
> >>
> >> ceph@ndesan01:~/ceph-deploy> ceph osd tree
> >> ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> -1 8.14980 root default
> >> -2 1.19995     host ndesan01
> >>   0 0.23999         osd.0          up  1.00000          1.00000
> >>   1 0.23999         osd.1          up  1.00000          1.00000
> >>   2 0.23999         osd.2          up  1.00000          1.00000
> >> 13 0.23999         osd.13         up  1.00000          1.00000
> >> 19 0.23999         osd.19         up  1.00000          1.00000
> >> -3 1.19995     host ndesan02
> >>   3 0.23999         osd.3          up  1.00000          1.00000
> >>   4 0.23999         osd.4          up  1.00000          1.00000
> >>   5 0.23999         osd.5          up  1.00000          1.00000
> >>   9 0.23999         osd.9          up  1.00000          1.00000
> >> 10 0.23999         osd.10         up  1.00000          1.00000
> >> -4 1.19995     host nde32
> >>   6 0.23999         osd.6          up  1.00000          1.00000
> >>   7 0.23999         osd.7          up  1.00000          1.00000
> >>   8 0.23999         osd.8          up  1.00000          1.00000
> >> 12 0.23999         osd.12         up  1.00000          1.00000
> >> 11 0.23999         osd.11         up  1.00000          1.00000
> >> -5 4.54994     host ndesan03
> >> 14 0.90999         osd.14         up  1.00000          1.00000
> >> 15 0.90999         osd.15         up  1.00000          1.00000
> >> 16 0.90999         osd.16         up  1.00000          1.00000
> >> 17 0.90999         osd.17         up  1.00000          1.00000
> >> 18 0.90999         osd.18         up  1.00000          1.00000
> >>
> >> ceph@ndesan01:~/ceph-deploy> ceph osd df
> >> ID WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR
> >>   0 0.23999  1.00000  249G  210G 39851M 84.43 1.59
> >>   1 0.23999  1.00000  249G  206G 44424M 82.64 1.55
> >>   2 0.23999  1.00000  249G  200G 50084M 80.43 1.51
> >> 13 0.23999  1.00000  249G  200G 50520M 80.26 1.51
> >> 19 0.23999  1.00000  249G  187G 63837M 75.05 1.41
> >>   3 0.23999  1.00000  249G  205G 45945M 82.04 1.54
> >>   4 0.23999  1.00000  249G  205G 45841M 82.08 1.54
> >>   5 0.23999  1.00000  249G  206G 44026M 82.79 1.56
> >>   9 0.23999  1.00000  249G  198G 52348M 79.54 1.50
> >> 10 0.23999  1.00000  249G  189G 62027M 75.76 1.42
> >>   6 0.23999  1.00000  249G  191G 59968M 76.56 1.44
> >>   7 0.23999  1.00000  249G  197G 53410M 79.13 1.49
> >>   8 0.23999  1.00000  249G  207G 43856M 82.86 1.56
> >> 12 0.23999  1.00000  249G  199G 51695M 79.80 1.50
> >> 11 0.23999  1.00000  249G  216G 33766M 86.80 1.63
> >> 14 0.90999  1.00000  931G  285G   645G 30.70 0.58
> >> 15 0.90999  1.00000  931G  287G   643G 30.88 0.58
> >> 16 0.90999  1.00000  931G  297G   633G 32.01 0.60
> >> 17 0.90999  1.00000  931G  288G   643G 30.93 0.58
> >> 18 0.90999  1.00000  931G  284G   646G 30.60 0.58
> >>                TOTAL 8403G 4468G  3935G 53.17
> >> MIN/MAX VAR: 0.58/1.63  STDDEV: 26.41
> >>
> >> I know that all OSDs should have the same size, but that wasn't my
> >> decision ;-) And it will be changed in the near future.
> >>
> >> Some information on RAM:
> >>
> >> ---cut here---
> >> ceph@ndesan01:~/ceph-deploy> free -g
> >>               total       used       free     shared    buffers     cached
> >> Mem:            62         61          0          0          3         50
> >> -/+ buffers/cache:          8         53
> >> Swap:            0          0          0
> >>
> >> ndesan02:~ # free -g
> >>               total       used       free     shared    buffers     cached
> >> Mem:            62         61          0          3          2         50
> >> -/+ buffers/cache:          9         53
> >> Swap:            0          0          0
> >>
> >> nde32:~ # free -g
> >>               total       used       free     shared    buffers     cached
> >> Mem:            39         38          0          0          0         29
> >> -/+ buffers/cache:          8         30
> >> Swap:            1          1          0
> >>
> >> ndesan03:~ # free -g
> >>               total       used       free     shared    buffers     cached
> >> Mem:            62         62          0          0          0         48
> >> -/+ buffers/cache:         13         49
> >> Swap:            0          0          0
> >> ---cut here---
> >>  
> >> > Since when does Xen support Ceph?  
> >>
> >> I'm not sure since when exactly. We started testing Ceph as our
> >> OpenStack storage backend last year in April or May, according to my
> >> email history it's supported since libvirt 1.3.2 (we had a bug filed
> >> for this issue).
> >>
> >> ceph@ndesan01:~/ceph-deploy> cat ceph.conf
> >> [global]
> >> fsid = 655cb05a-435a-41ba-83d9-8549f7c36167
> >> mon_initial_members = ndesan01, ndesan02, nde32
> >> mon_host = 192.168.160.15,192.168.160.16,192.168.160.17
> >> public_network = 192.168.160.0/24
> >> osd_pool_default_size = 3
> >> osd pool default min size = 2
> >> osd_crush_chooseleaf_type = 1
> >> mon_pg_warn_max_per_osd = 0
> >> auth_cluster_required = cephx
> >> auth_service_required = cephx
> >> auth_client_required = cephx
> >> filestore_xattr_use_omap = true
> >> filestore max inline xattr size = 512
> >> filestore max inline xattrs = 2
> >> rbd_default_features = 15
> >> mon data avail warn = 15
> >> mon data avail crit = 5
> >> osd_disk_thread_ioprio_class = "idle"
> >> osd_disk_thread_ioprio_priority = 7
> >> osd scrub begin hour = 0
> >> osd scrub end hour = 7
> >> osd_scrub_sleep = 0.1
> >> osd_deep_scrub_interval = 2419200
> >>
> >> [mon]
> >> mon osd allow primary affinity = true
> >>
> >> We changed the deep-scrub interval because we use a cronjob to execute
> >> deep scrubs in the evening, to avoid performance impact during working
> >> hours.
> >>
> >> Please let me know if I missed anything. I really appreciate you
> >> looking into this.
> >>
> >> Regards,
> >> Eugen
> >>
> >>
> >> Zitat von Christian Balzer <chibi@xxxxxxx>:
> >>  
> >> > Hello,
> >> >
> >> > On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:
> >> >  
> >> >> Hi,
> >> >>
> >> >> I haven't tracked the slow requests yet, but I ran some performance
> >> >> tests. Although I'm not an expert, I believe the results are quite
> >> >> unsatisfying.
> >> >>  
> >> > You've told us absolutely nothing about your cluster that would be
> >> > relevant to determine if your results are actually to be expected or not.
> >> >
> >> > Anything, and I mean anything is relevant, from OS, kernel, Ceph versions
> >> > to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network,
> >> > topology, etc.
> >> >  
> >> >> I ran a couple of rados bench tests in a pool, with different
> >> >> replication sizes (1 to 3). The tests were executed on one of the ceph
> >> >> nodes, which also is a monitor.
> >> >>  
> >> > Benchmarks, including rados bench run from a cluster node that also hosts
> >> > OSDs will give you skewed results.
> >> >  
> >> >> Results for size 1:
> >> >>
> >> >> Total time run:         20.428947
> >> >> Total writes made:      639
> >> >> Write size:             4194304
> >> >> Bandwidth (MB/sec):     125.117
> >> >>  
> >> > This and the following results would be on par with a smallish, 1Gb/s
> >> > network HDD based cluster.
> >> > But we don't know if that's what you're running or not...
> >> >  
> >> >> Results for size 2:
> >> >>
> >> >> Total time run:         20.741990
> >> >> Total writes made:      454
> >> >> Write size:             4194304
> >> >> Bandwidth (MB/sec):     87.552
> >> >>
> >> >> Still looks nice, at least we would be totally fine with this throughput.
> >> >>  
> >> > As mentioned countless times here, throughput is most likely your least
> >> > worry with Ceph, IOPS are more likely to be your bottleneck.
> >> >  
> >> >> Results for size 3:
> >> >>
> >> >> Total time run:         22.934567
> >> >> Total writes made:      252
> >> >> Write size:             4194304
> >> >> Bandwidth (MB/sec):     43.951
> >> >>
> >> >> I guess this is still ok for a replication size of 3. But the real
> >> >> problem seems to be the client throughput.
> >> >>
> >> >> Running fio on a Xen-VM inside rbd:
> >> >>  
> >> > Since when does Xen support Ceph?
> >> > As in, how are you running a Xen VM that uses Ceph?
> >> >  
> >> >> Run status group 0 (all jobs):
> >> >>    WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,
> >> >> maxb=21776KB/s, mint=48151msec, maxt=48151msec
> >> >>
> >> >> And this is one of the better results, we often have values of 12MB/s
> >> >> or even lower, e.g. on a KVM-VM:
> >> >>
> >> >> Run status group 0 (all jobs):
> >> >>    WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,
> >> >> mint=79683msec, maxt=79683msec
> >> >>  
> >> >
> >> > Fio results w/o the full command line parameters are basically worthless,
> >> > I'm afraid.
> >> >  
> >> >> This unacceptable!
> >> >>
> >> >> I use almost all defaults in our ceph.conf,  
> >> >
> >> > What did you change that could be relevant then?
> >> >  
> >> >> I've been reading
> >> >> everything I could find on Google, but I don't understand why the
> >> >> clients are so slow. Is it a valid assumption, that the slow requests
> >> >> could be caused by our slow clients?
> >> >>  
> >> >
> >> > Slow requests are caused typically when your cluster is not capable of
> >> > keeping up.
> >> >  
> >> >> One of the things I tried was
> >> >>
> >> >> vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb
> >> >>  
> >> > Only affects reads.
> >> >  
> >> >> but this had no effect. I tried it with different numbers of PGs, but
> >> >> that also didn't have any effect.
> >> >>  
> >> > Again, w/o knowing your cluster and these numbers, that information is of
> >> > little value.
> >> >  
> >> >> What else can I do to increase the performance of my instances or to
> >> >> analyse the bottleneck?
> >> >>  
> >> > Understand your HW/cluster.
> >> >
> >> > Monitor the nodes with atop etc, use collectd and graphite, etc, to keep
> >> > records of what happened when and to visualize things.
> >> >
> >> > Christian
> >> >  
> >> >> Regards,
> >> >> Eugen
> >> >>
> >> >>
> >> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
> >> >>  
> >> >> > Check the latency figures in a "perf dump". High numbers in a
> >> >> > particular area may help you nail it.
> >> >> >
> >> >> > I suspect though, that it may come down to enabling debug logging and
> >> >> > tracking a slow request through the logs.
> >> >> >
> >> >> > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote:  
> >> >> >> Hi,
> >> >> >>  
> >> >> >>> Looking at the output of dump_historic_ops and dump_ops_in_flight  
> >> >> >>
> >> >> >>
> >> >> >> I waited for new slow request messages and dumped the    
> >> historic_ops into a  
> >> >> >> file. The reporting OSD shows lots of "waiting for rw locks"  
> >> >> messages and a  
> >> >> >> duration of more than 30 secs:
> >> >> >>
> >> >> >>              "age": 366.044746,
> >> >> >>             "duration": 32.491506,
> >> >> >>             "type_data": [
> >> >> >>                 "commit sent; apply or cleanup",
> >> >> >>                 {
> >> >> >>                     "client": "client.9664429",
> >> >> >>                     "tid": 130439910
> >> >> >>                 },
> >> >> >>                 [
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:15.227649",
> >> >> >>                         "event": "initiated"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:15.232310",
> >> >> >>                         "event": "reached_pg"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:15.232341",
> >> >> >>                         "event": "waiting for rw locks"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:15.268819",
> >> >> >>                         "event": "reached_pg"
> >> >> >>                     },
> >> >> >> [
> >> >> >> .
> >> >> >> .
> >> >> >> .
> >> >> >> ]
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:45.515055",
> >> >> >>                         "event": "waiting for rw locks"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.921095",
> >> >> >>                         "event": "reached_pg"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.921157",
> >> >> >>                         "event": "started"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.921342",
> >> >> >>                         "event": "waiting for subops from 9,15"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.921724",
> >> >> >>                         "event": "commit_queued_for_journal_write"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.922186",
> >> >> >>                         "event": "write_thread_in_journal_buffer"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.931103",
> >> >> >>                         "event": "sub_op_commit_rec"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:46.968730",
> >> >> >>                         "event": "sub_op_commit_rec"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:47.717770",
> >> >> >>                         "event": "journaled_completion_queued"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:47.718280",
> >> >> >>                         "event": "op_commit"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:47.718359",
> >> >> >>                         "event": "commit_sent"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:47.718890",
> >> >> >>                         "event": "op_applied"
> >> >> >>                     },
> >> >> >>                     {
> >> >> >>                         "time": "2017-01-12 10:38:47.719154",
> >> >> >>                         "event": "done"
> >> >> >>                     }
> >> >> >>
> >> >> >>
> >> >> >> There were about 70 events "waiting for rw locks", I truncated  
> >> >> the output.  
> >> >> >> Based on the message "waiting for subops from 9,15" I also dumped the
> >> >> >> historic_ops for these two OSDs.
> >> >> >>
> >> >> >> Duration on OSD.9
> >> >> >>
> >> >> >>             "initiated_at": "2017-01-12 10:38:29.258221",
> >> >> >>             "age": 54.069919,
> >> >> >>             "duration": 20.831568,
> >> >> >>
> >> >> >> Duration on OSD.15
> >> >> >>
> >> >> >>             "initiated_at": "2017-01-12 10:38:23.695098",
> >> >> >>             "age": 112.118210,
> >> >> >>             "duration": 26.452526,
> >> >> >>
> >> >> >> They also contain many "waiting for rw locks" messages, but not  
> >> >> as much as  
> >> >> >> the dump from the reporting OSD.
> >> >> >> To me it seems that because two OSDs take a lot of time to    
> >> process their  
> >> >> >> requests (only slightly less than 30 secs), it sums up to more  
> >> >> than 30 secs  
> >> >> >> on the reporting (primary?) OSD. Is the reporting OSD always    
> >> the primary?  
> >> >> >>
> >> >> >> How can I debug this further? I searched the web for "waiting  
> >> >> for rw locks",  
> >> >> >> I also found Wido's blog [1] about my exact problem, but I'm not  
> >> >> sure how to  
> >> >> >> continue. Our admin says our network should be fine, but what    
> >> can I do to  
> >> >> >> rule that out?
> >> >> >>
> >> >> >> I don't think I have provided information about our cluster yet:
> >> >> >>
> >> >> >> 4 nodes, 3 mons, 20 OSDs on
> >> >> >> ceph version 0.94.7-84-g8e6f430  
> >> >> (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)  
> >> >> >>
> >> >> >> [1]
> >> >> >>  
> >> >>    
> >> https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/  
> >> >> >>
> >> >> >> Thanks!
> >> >> >> Eugen
> >> >> >>
> >> >> >>
> >> >> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
> >> >> >>
> >> >> >>  
> >> >> >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote:  
> >> >> >>>>
> >> >> >>>> Hi,
> >> >> >>>>
> >> >> >>>> I simply grepped for "slow request" in ceph.log. What exactly  
> >> >> do you mean  
> >> >> >>>> by
> >> >> >>>> "effective OSD"?
> >> >> >>>>
> >> >> >>>> If I have this log line:
> >> >> >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
> >> >> >>>> seconds
> >> >> >>>> old, received at 2017-01-11 [...]    
> >> ack+ondisk+write+known_if_redirected  
> >> >> >>>> e12440) currently waiting for subops from 0,12
> >> >> >>>>
> >> >> >>>> I assumed that osd.16 is the one causing problems.  
> >> >> >>>
> >> >> >>>
> >> >> >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the    
> >> output of  
> >> >> >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may
> >> >> >>> help to narrow down what the issue is.
> >> >> >>>  
> >> >> >>>> But now that you mention
> >> >> >>>> the subops, I only noticed them yesterday, but didn't have    
> >> the time yet  
> >> >> >>>> to
> >> >> >>>> investigate further. I'll have a look into the subops messages  
> >> >> and report  
> >> >> >>>> back.
> >> >> >>>>
> >> >> >>>> Thanks!
> >> >> >>>> Eugen
> >> >> >>>>
> >> >> >>>> Zitat von Burkhard Linke
> >> >> >>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:
> >> >> >>>>
> >> >> >>>>  
> >> >> >>>>> Hi,
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> just for clarity:
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> Did you parse the slow request messages and use the effective  
> >> >> OSD in the  
> >> >> >>>>> statistics? Some message may refer to other OSDs, e.g.  
> >> >> "waiting for sub  
> >> >> >>>>> op
> >> >> >>>>> on OSD X,Y". The reporting OSD is not the root cause in    
> >> that case, but  
> >> >> >>>>> one
> >> >> >>>>> of the mentioned OSDs (and I'm currently not aware of a method to
> >> >> >>>>> determine
> >> >> >>>>> which of the both OSD is the cause in case of 3 replicates.....).
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> Regards,
> >> >> >>>>>
> >> >> >>>>> Burkhard
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> _______________________________________________
> >> >> >>>>> ceph-users mailing list
> >> >> >>>>> ceph-users@xxxxxxxxxxxxxx
> >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>> Eugen Block                             voice   : +49-40-559 51 75
> >> >> >>>> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
> >> >> >>>> Postfach 61 03 15
> >> >> >>>> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
> >> >> >>>>
> >> >> >>>>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >> >> >>>>           Sitz und Registergericht: Hamburg, HRB 90934
> >> >> >>>>                   Vorstand: Jens-U. Mozdzen
> >> >> >>>>                    USt-IdNr. DE 814 013 983
> >> >> >>>>
> >> >> >>>> _______________________________________________
> >> >> >>>> ceph-users mailing list
> >> >> >>>> ceph-users@xxxxxxxxxxxxxx
> >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Cheers,
> >> >> >>> Brad  
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Eugen Block                             voice   : +49-40-559 51 75
> >> >> >> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
> >> >> >> Postfach 61 03 15
> >> >> >> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
> >> >> >>
> >> >> >>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >> >> >>           Sitz und Registergericht: Hamburg, HRB 90934
> >> >> >>                   Vorstand: Jens-U. Mozdzen
> >> >> >>                    USt-IdNr. DE 814 013 983
> >> >> >>  
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Cheers,
> >> >> > Brad  
> >> >>
> >> >>
> >> >>  
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/  
> >>
> >>
> >>  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/  
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com