Hello, On Thu, 02 Feb 2017 10:24:53 +0100 Eugen Block wrote: > Hi, > > thank you very much for your answer! I'm not sure I get all your > points, but I'll try to dig deeper. > I'll repeat myself and say that looking at your nodes with atop during a benchmark and slow request situation should fire up the old light bulb. > > IOPS are more likely to be your bottleneck > > Can you give me an example of reasonable numbers for ops/s, especially > for our small cluster with 1 Gb/s? I have no idea what could be > considered as normal. > Expected IOPS? Far too many variables, in both the HW/design and the testing methodology. But it will be lot less than you expect, with your network only coming into play when looking at sync writes, async small 4K I/Os will saturate your disk and CPUs (in that order, reverse so for SSD/NVMe based OSDs). If you had a cluster consisting of 3 identical nodes, each with 10 HDDs nominally rated at 150 IOPS your standard Ceph cluster would give you a theoretical maximum of mere 750 IOPS. 1500*3(node)/3(replica)/2(journal) In reality it's like to be less due to FS and other overhead. > >> It's a 1 Gb/s network. > >> > > And there's your "problem". > > I read this as single link "1Gb/s" network > > Just a note on that: we are using LACP. > So actually 2Gb/s per node? Due to the hashing nature of LACP you may of course not see any benefits with a single client test. But w/o you thoroughly analyzing things and identifying bottlenecks (is the network really saturated or are the OSDs/HDDs redlined?) this is guesswork again. > Currently, we are preparing to modify our ceph cluster to have only > three nodes instead of four, and replace the LVM based OSDs with > dedicated HDDs (one large OSD per HDD (2 TB)), would this increase the > performance with the same network or could it be even worse? Worse, due the reduced number of nodes. See above. > Or is it > simply impossible to have an acceptable performance with a 1 Gb/s > network? Any thoughts on that would help me a lot. > Reasonable performance is something that's in the eye of the beholder. If this would be for a backup/archive solution with the client(s) also on a 1Gb/s network it might be just fine if you had a sufficient number of nodes (6 comes to mind) and/or additional cluster network links. But you're trying to run random VMs on top of this storage, so IOPS again. Putting your journals on a SSD would be an immediate way to improve things, provided the HDDs are saturated. Such a SSD would need sufficient endurance (you have to determine that, but something like a DC S3610 comes to mind), but not masive speed with this kind of network. And if you can afford only one per node, it would also become a nice SPoF. Christian > Eugen > > Zitat von Christian Balzer <chibi@xxxxxxx>: > > > Hello, > > > > On Wed, 01 Feb 2017 15:16:15 +0100 Eugen Block wrote: > > > >> > You've told us absolutely nothing about your cluster > >> > >> You're right, I'll try to provide as much information as possible. > >> > >> Please note that we have kind of a "special" layout... The HDDs on > >> ndesan01 are in a RAID6, which is bcached by two SSDs in a RAID1: > >> > > > > I've dabbled in RAID backed OSDs with mixed success, but the your setup > > basically tries to do the IOPS for 5 OSDs on what essentially boils down > > to actually a single HDD (due to the RAID6). > > A HW cache or maybe your bcache setup will help with this of course, but > > once they are saturated and/or your IO patterns have competing reads and > > writes, it (performance) will essentially go to hell. > > > > I have no (successful and recent) experience with bcache, so this may or > > may not be a factor, too. > > > > Again, monitoring things with atop during a benchmark (large window, at > > least 5s frequency) on your nodes and especially this one will show you > > where the bottlenecks are very quickly. > > > >> ---cut here--- > >> ndesan01: # cat /proc/mdstat > >> Personalities : [raid1] [raid6] [raid5] [raid4] > >> md126 : active raid6 sdd1[3] sde1[4] sdf1[5] sda1[0] sdc1[2] sdb1[1] > >> 3906521088 blocks super 1.2 level 6, 512k chunk, algorithm 2 > >> [6/6] [UUUUUU] > >> bitmap: 6/8 pages [24KB], 65536KB chunk > >> > >> md127 : active raid1 sdh1[1] sdg1[0] > >> 195228864 blocks super 1.2 [2/2] [UU] > >> bitmap: 2/2 pages [8KB], 65536KB chunk > >> > >> unused devices: <none> > >> ndesan01: # pvs > >> PV VG Fmt Attr PSize PFree > >> /dev/bcache0 ndesan01 lvm2 a-- 3.64t 427.46g > >> ndesan01: # lvs | grep osd > >> cephosd0 ndesan01 -wi-ao---- 250.00g > >> > >> cephosd1 ndesan01 -wi-ao---- 250.00g > >> > >> cephosd2 ndesan01 -wi-ao---- 250.00g > >> > >> cephosd3 ndesan01 -wi-ao---- 250.00g > >> > >> cephosd4 ndesan01 -wi-ao---- 250.00g > >> ---cut here--- > >> > >> ndesan02 has the same layout. nde32 is not bcached > >> > >> nde32:~ # lvs | grep osd > >> cephosd0 ndesan02 -wi-ao---- 250,00g > >> cephosd1 ndesan02 -wi-ao---- 250,00g > >> cephosd2 ndesan02 -wi-ao---- 250,00g > >> cephosd3 ndesan02 -wi-ao---- 250,00g > >> cephosd4 ndesan02 -wi-ao---- 250,00g > >> > >> ndesan03 is a dedicated ceph node and is also bcached: > >> > >> ---cut here--- > >> ndesan03:~ # df -hl | grep osd > >> /dev/bcache0 932G 286G 646G 31% /var/ceph/osd0 > >> /dev/bcache1 932G 288G 644G 31% /var/ceph/osd1 > >> /dev/bcache2 932G 298G 634G 33% /var/ceph/osd2 > >> /dev/bcache3 932G 289G 644G 31% /var/ceph/osd3 > >> /dev/bcache4 932G 285G 647G 31% /var/ceph/osd4 > >> ---cut here--- > >> > >> This mixed setup is due to the pre-existing infrastructure on these > >> servers, we planned to migrate the environment step by step into ceph. > >> > >> ceph@ndesan01:~/ceph-deploy> ceph --version > >> ceph version 0.94.7-84-g8e6f430 > >> > > Not that it matters much, but since you may as well go to 0.94.9 aka the > > latest Hammer. > > > >> All 4 nodes have the same OS: > >> ceph@ndesan01:~/ceph-deploy> cat /etc/os-release > >> NAME="openSUSE Leap" > >> VERSION="42.1" > >> > >> ndesan01, ndesan02 and nde32 have the same kernel: > >> ceph@ndesan01:~/ceph-deploy> uname -r > >> 4.1.13-5-default > >> > > The 4.1 to 4.4 kernels have various issues, of course SUSE might have > > backported fixes. > > It often pays to run the latest (4.8, 4.9) kernels with Ceph. > > But again, not the cause of your problem. > > > >> ndesan03:~ # uname -r > >> 4.1.12-1-default > >> > >> It's a 1 Gb/s network. > >> > > And there's your "problem". > > I read this as single link "1Gb/s" network, which makes your benchmark > > numbers totally on par, as I suggested. > > > > Congestions by individual OSDs (unlucky roll of the dice that has one > > OSD being primary for one IO and secondary for other concurrent ones) will > > not help of course. > > > >> ndesan01: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores) > >> > >> ndesan02: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores) > >> > > If you're doing 4K sync/direct randwrites you'll notice that faster cores > > will be better than more cores, but for bandwidth related tests it doesn't > > matter. Of course that's against SSD/NVMe backed OSDs or at least ones > > with SSD journals. > > > > Everything below looks nominal, the lower memory on one node will make > > this one inherently slower and thus the overall bottleneck in an otherwise > > identical cluster due to less pagecache hits and thus more real HDD > > accesses. > > > > Christian > >> nde32: 2 x Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (each 4 cores) > >> > >> ndesan03: 1 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (4 cores) > >> > >> ceph@ndesan01:~/ceph-deploy> ceph osd tree > >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > >> -1 8.14980 root default > >> -2 1.19995 host ndesan01 > >> 0 0.23999 osd.0 up 1.00000 1.00000 > >> 1 0.23999 osd.1 up 1.00000 1.00000 > >> 2 0.23999 osd.2 up 1.00000 1.00000 > >> 13 0.23999 osd.13 up 1.00000 1.00000 > >> 19 0.23999 osd.19 up 1.00000 1.00000 > >> -3 1.19995 host ndesan02 > >> 3 0.23999 osd.3 up 1.00000 1.00000 > >> 4 0.23999 osd.4 up 1.00000 1.00000 > >> 5 0.23999 osd.5 up 1.00000 1.00000 > >> 9 0.23999 osd.9 up 1.00000 1.00000 > >> 10 0.23999 osd.10 up 1.00000 1.00000 > >> -4 1.19995 host nde32 > >> 6 0.23999 osd.6 up 1.00000 1.00000 > >> 7 0.23999 osd.7 up 1.00000 1.00000 > >> 8 0.23999 osd.8 up 1.00000 1.00000 > >> 12 0.23999 osd.12 up 1.00000 1.00000 > >> 11 0.23999 osd.11 up 1.00000 1.00000 > >> -5 4.54994 host ndesan03 > >> 14 0.90999 osd.14 up 1.00000 1.00000 > >> 15 0.90999 osd.15 up 1.00000 1.00000 > >> 16 0.90999 osd.16 up 1.00000 1.00000 > >> 17 0.90999 osd.17 up 1.00000 1.00000 > >> 18 0.90999 osd.18 up 1.00000 1.00000 > >> > >> ceph@ndesan01:~/ceph-deploy> ceph osd df > >> ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > >> 0 0.23999 1.00000 249G 210G 39851M 84.43 1.59 > >> 1 0.23999 1.00000 249G 206G 44424M 82.64 1.55 > >> 2 0.23999 1.00000 249G 200G 50084M 80.43 1.51 > >> 13 0.23999 1.00000 249G 200G 50520M 80.26 1.51 > >> 19 0.23999 1.00000 249G 187G 63837M 75.05 1.41 > >> 3 0.23999 1.00000 249G 205G 45945M 82.04 1.54 > >> 4 0.23999 1.00000 249G 205G 45841M 82.08 1.54 > >> 5 0.23999 1.00000 249G 206G 44026M 82.79 1.56 > >> 9 0.23999 1.00000 249G 198G 52348M 79.54 1.50 > >> 10 0.23999 1.00000 249G 189G 62027M 75.76 1.42 > >> 6 0.23999 1.00000 249G 191G 59968M 76.56 1.44 > >> 7 0.23999 1.00000 249G 197G 53410M 79.13 1.49 > >> 8 0.23999 1.00000 249G 207G 43856M 82.86 1.56 > >> 12 0.23999 1.00000 249G 199G 51695M 79.80 1.50 > >> 11 0.23999 1.00000 249G 216G 33766M 86.80 1.63 > >> 14 0.90999 1.00000 931G 285G 645G 30.70 0.58 > >> 15 0.90999 1.00000 931G 287G 643G 30.88 0.58 > >> 16 0.90999 1.00000 931G 297G 633G 32.01 0.60 > >> 17 0.90999 1.00000 931G 288G 643G 30.93 0.58 > >> 18 0.90999 1.00000 931G 284G 646G 30.60 0.58 > >> TOTAL 8403G 4468G 3935G 53.17 > >> MIN/MAX VAR: 0.58/1.63 STDDEV: 26.41 > >> > >> I know that all OSDs should have the same size, but that wasn't my > >> decision ;-) And it will be changed in the near future. > >> > >> Some information on RAM: > >> > >> ---cut here--- > >> ceph@ndesan01:~/ceph-deploy> free -g > >> total used free shared buffers cached > >> Mem: 62 61 0 0 3 50 > >> -/+ buffers/cache: 8 53 > >> Swap: 0 0 0 > >> > >> ndesan02:~ # free -g > >> total used free shared buffers cached > >> Mem: 62 61 0 3 2 50 > >> -/+ buffers/cache: 9 53 > >> Swap: 0 0 0 > >> > >> nde32:~ # free -g > >> total used free shared buffers cached > >> Mem: 39 38 0 0 0 29 > >> -/+ buffers/cache: 8 30 > >> Swap: 1 1 0 > >> > >> ndesan03:~ # free -g > >> total used free shared buffers cached > >> Mem: 62 62 0 0 0 48 > >> -/+ buffers/cache: 13 49 > >> Swap: 0 0 0 > >> ---cut here--- > >> > >> > Since when does Xen support Ceph? > >> > >> I'm not sure since when exactly. We started testing Ceph as our > >> OpenStack storage backend last year in April or May, according to my > >> email history it's supported since libvirt 1.3.2 (we had a bug filed > >> for this issue). > >> > >> ceph@ndesan01:~/ceph-deploy> cat ceph.conf > >> [global] > >> fsid = 655cb05a-435a-41ba-83d9-8549f7c36167 > >> mon_initial_members = ndesan01, ndesan02, nde32 > >> mon_host = 192.168.160.15,192.168.160.16,192.168.160.17 > >> public_network = 192.168.160.0/24 > >> osd_pool_default_size = 3 > >> osd pool default min size = 2 > >> osd_crush_chooseleaf_type = 1 > >> mon_pg_warn_max_per_osd = 0 > >> auth_cluster_required = cephx > >> auth_service_required = cephx > >> auth_client_required = cephx > >> filestore_xattr_use_omap = true > >> filestore max inline xattr size = 512 > >> filestore max inline xattrs = 2 > >> rbd_default_features = 15 > >> mon data avail warn = 15 > >> mon data avail crit = 5 > >> osd_disk_thread_ioprio_class = "idle" > >> osd_disk_thread_ioprio_priority = 7 > >> osd scrub begin hour = 0 > >> osd scrub end hour = 7 > >> osd_scrub_sleep = 0.1 > >> osd_deep_scrub_interval = 2419200 > >> > >> [mon] > >> mon osd allow primary affinity = true > >> > >> We changed the deep-scrub interval because we use a cronjob to execute > >> deep scrubs in the evening, to avoid performance impact during working > >> hours. > >> > >> Please let me know if I missed anything. I really appreciate you > >> looking into this. > >> > >> Regards, > >> Eugen > >> > >> > >> Zitat von Christian Balzer <chibi@xxxxxxx>: > >> > >> > Hello, > >> > > >> > On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote: > >> > > >> >> Hi, > >> >> > >> >> I haven't tracked the slow requests yet, but I ran some performance > >> >> tests. Although I'm not an expert, I believe the results are quite > >> >> unsatisfying. > >> >> > >> > You've told us absolutely nothing about your cluster that would be > >> > relevant to determine if your results are actually to be expected or not. > >> > > >> > Anything, and I mean anything is relevant, from OS, kernel, Ceph versions > >> > to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network, > >> > topology, etc. > >> > > >> >> I ran a couple of rados bench tests in a pool, with different > >> >> replication sizes (1 to 3). The tests were executed on one of the ceph > >> >> nodes, which also is a monitor. > >> >> > >> > Benchmarks, including rados bench run from a cluster node that also hosts > >> > OSDs will give you skewed results. > >> > > >> >> Results for size 1: > >> >> > >> >> Total time run: 20.428947 > >> >> Total writes made: 639 > >> >> Write size: 4194304 > >> >> Bandwidth (MB/sec): 125.117 > >> >> > >> > This and the following results would be on par with a smallish, 1Gb/s > >> > network HDD based cluster. > >> > But we don't know if that's what you're running or not... > >> > > >> >> Results for size 2: > >> >> > >> >> Total time run: 20.741990 > >> >> Total writes made: 454 > >> >> Write size: 4194304 > >> >> Bandwidth (MB/sec): 87.552 > >> >> > >> >> Still looks nice, at least we would be totally fine with this throughput. > >> >> > >> > As mentioned countless times here, throughput is most likely your least > >> > worry with Ceph, IOPS are more likely to be your bottleneck. > >> > > >> >> Results for size 3: > >> >> > >> >> Total time run: 22.934567 > >> >> Total writes made: 252 > >> >> Write size: 4194304 > >> >> Bandwidth (MB/sec): 43.951 > >> >> > >> >> I guess this is still ok for a replication size of 3. But the real > >> >> problem seems to be the client throughput. > >> >> > >> >> Running fio on a Xen-VM inside rbd: > >> >> > >> > Since when does Xen support Ceph? > >> > As in, how are you running a Xen VM that uses Ceph? > >> > > >> >> Run status group 0 (all jobs): > >> >> WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s, > >> >> maxb=21776KB/s, mint=48151msec, maxt=48151msec > >> >> > >> >> And this is one of the better results, we often have values of 12MB/s > >> >> or even lower, e.g. on a KVM-VM: > >> >> > >> >> Run status group 0 (all jobs): > >> >> WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s, > >> >> mint=79683msec, maxt=79683msec > >> >> > >> > > >> > Fio results w/o the full command line parameters are basically worthless, > >> > I'm afraid. > >> > > >> >> This unacceptable! > >> >> > >> >> I use almost all defaults in our ceph.conf, > >> > > >> > What did you change that could be relevant then? > >> > > >> >> I've been reading > >> >> everything I could find on Google, but I don't understand why the > >> >> clients are so slow. Is it a valid assumption, that the slow requests > >> >> could be caused by our slow clients? > >> >> > >> > > >> > Slow requests are caused typically when your cluster is not capable of > >> > keeping up. > >> > > >> >> One of the things I tried was > >> >> > >> >> vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb > >> >> > >> > Only affects reads. > >> > > >> >> but this had no effect. I tried it with different numbers of PGs, but > >> >> that also didn't have any effect. > >> >> > >> > Again, w/o knowing your cluster and these numbers, that information is of > >> > little value. > >> > > >> >> What else can I do to increase the performance of my instances or to > >> >> analyse the bottleneck? > >> >> > >> > Understand your HW/cluster. > >> > > >> > Monitor the nodes with atop etc, use collectd and graphite, etc, to keep > >> > records of what happened when and to visualize things. > >> > > >> > Christian > >> > > >> >> Regards, > >> >> Eugen > >> >> > >> >> > >> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > >> >> > >> >> > Check the latency figures in a "perf dump". High numbers in a > >> >> > particular area may help you nail it. > >> >> > > >> >> > I suspect though, that it may come down to enabling debug logging and > >> >> > tracking a slow request through the logs. > >> >> > > >> >> > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote: > >> >> >> Hi, > >> >> >> > >> >> >>> Looking at the output of dump_historic_ops and dump_ops_in_flight > >> >> >> > >> >> >> > >> >> >> I waited for new slow request messages and dumped the > >> historic_ops into a > >> >> >> file. The reporting OSD shows lots of "waiting for rw locks" > >> >> messages and a > >> >> >> duration of more than 30 secs: > >> >> >> > >> >> >> "age": 366.044746, > >> >> >> "duration": 32.491506, > >> >> >> "type_data": [ > >> >> >> "commit sent; apply or cleanup", > >> >> >> { > >> >> >> "client": "client.9664429", > >> >> >> "tid": 130439910 > >> >> >> }, > >> >> >> [ > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:15.227649", > >> >> >> "event": "initiated" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:15.232310", > >> >> >> "event": "reached_pg" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:15.232341", > >> >> >> "event": "waiting for rw locks" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:15.268819", > >> >> >> "event": "reached_pg" > >> >> >> }, > >> >> >> [ > >> >> >> . > >> >> >> . > >> >> >> . > >> >> >> ] > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:45.515055", > >> >> >> "event": "waiting for rw locks" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.921095", > >> >> >> "event": "reached_pg" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.921157", > >> >> >> "event": "started" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.921342", > >> >> >> "event": "waiting for subops from 9,15" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.921724", > >> >> >> "event": "commit_queued_for_journal_write" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.922186", > >> >> >> "event": "write_thread_in_journal_buffer" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.931103", > >> >> >> "event": "sub_op_commit_rec" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:46.968730", > >> >> >> "event": "sub_op_commit_rec" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:47.717770", > >> >> >> "event": "journaled_completion_queued" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:47.718280", > >> >> >> "event": "op_commit" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:47.718359", > >> >> >> "event": "commit_sent" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:47.718890", > >> >> >> "event": "op_applied" > >> >> >> }, > >> >> >> { > >> >> >> "time": "2017-01-12 10:38:47.719154", > >> >> >> "event": "done" > >> >> >> } > >> >> >> > >> >> >> > >> >> >> There were about 70 events "waiting for rw locks", I truncated > >> >> the output. > >> >> >> Based on the message "waiting for subops from 9,15" I also dumped the > >> >> >> historic_ops for these two OSDs. > >> >> >> > >> >> >> Duration on OSD.9 > >> >> >> > >> >> >> "initiated_at": "2017-01-12 10:38:29.258221", > >> >> >> "age": 54.069919, > >> >> >> "duration": 20.831568, > >> >> >> > >> >> >> Duration on OSD.15 > >> >> >> > >> >> >> "initiated_at": "2017-01-12 10:38:23.695098", > >> >> >> "age": 112.118210, > >> >> >> "duration": 26.452526, > >> >> >> > >> >> >> They also contain many "waiting for rw locks" messages, but not > >> >> as much as > >> >> >> the dump from the reporting OSD. > >> >> >> To me it seems that because two OSDs take a lot of time to > >> process their > >> >> >> requests (only slightly less than 30 secs), it sums up to more > >> >> than 30 secs > >> >> >> on the reporting (primary?) OSD. Is the reporting OSD always > >> the primary? > >> >> >> > >> >> >> How can I debug this further? I searched the web for "waiting > >> >> for rw locks", > >> >> >> I also found Wido's blog [1] about my exact problem, but I'm not > >> >> sure how to > >> >> >> continue. Our admin says our network should be fine, but what > >> can I do to > >> >> >> rule that out? > >> >> >> > >> >> >> I don't think I have provided information about our cluster yet: > >> >> >> > >> >> >> 4 nodes, 3 mons, 20 OSDs on > >> >> >> ceph version 0.94.7-84-g8e6f430 > >> >> (8e6f430683e4d8293e31fd4eb6cb09be96960cfa) > >> >> >> > >> >> >> [1] > >> >> >> > >> >> > >> https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/ > >> >> >> > >> >> >> Thanks! > >> >> >> Eugen > >> >> >> > >> >> >> > >> >> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > >> >> >> > >> >> >> > >> >> >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote: > >> >> >>>> > >> >> >>>> Hi, > >> >> >>>> > >> >> >>>> I simply grepped for "slow request" in ceph.log. What exactly > >> >> do you mean > >> >> >>>> by > >> >> >>>> "effective OSD"? > >> >> >>>> > >> >> >>>> If I have this log line: > >> >> >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141 > >> >> >>>> seconds > >> >> >>>> old, received at 2017-01-11 [...] > >> ack+ondisk+write+known_if_redirected > >> >> >>>> e12440) currently waiting for subops from 0,12 > >> >> >>>> > >> >> >>>> I assumed that osd.16 is the one causing problems. > >> >> >>> > >> >> >>> > >> >> >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the > >> output of > >> >> >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may > >> >> >>> help to narrow down what the issue is. > >> >> >>> > >> >> >>>> But now that you mention > >> >> >>>> the subops, I only noticed them yesterday, but didn't have > >> the time yet > >> >> >>>> to > >> >> >>>> investigate further. I'll have a look into the subops messages > >> >> and report > >> >> >>>> back. > >> >> >>>> > >> >> >>>> Thanks! > >> >> >>>> Eugen > >> >> >>>> > >> >> >>>> Zitat von Burkhard Linke > >> >> >>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>: > >> >> >>>> > >> >> >>>> > >> >> >>>>> Hi, > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> just for clarity: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> Did you parse the slow request messages and use the effective > >> >> OSD in the > >> >> >>>>> statistics? Some message may refer to other OSDs, e.g. > >> >> "waiting for sub > >> >> >>>>> op > >> >> >>>>> on OSD X,Y". The reporting OSD is not the root cause in > >> that case, but > >> >> >>>>> one > >> >> >>>>> of the mentioned OSDs (and I'm currently not aware of a method to > >> >> >>>>> determine > >> >> >>>>> which of the both OSD is the cause in case of 3 replicates.....). > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> Regards, > >> >> >>>>> > >> >> >>>>> Burkhard > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> _______________________________________________ > >> >> >>>>> ceph-users mailing list > >> >> >>>>> ceph-users@xxxxxxxxxxxxxx > >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> -- > >> >> >>>> Eugen Block voice : +49-40-559 51 75 > >> >> >>>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> >> >>>> Postfach 61 03 15 > >> >> >>>> D-22423 Hamburg e-mail : eblock@xxxxxx > >> >> >>>> > >> >> >>>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >> >> >>>> Sitz und Registergericht: Hamburg, HRB 90934 > >> >> >>>> Vorstand: Jens-U. Mozdzen > >> >> >>>> USt-IdNr. DE 814 013 983 > >> >> >>>> > >> >> >>>> _______________________________________________ > >> >> >>>> ceph-users mailing list > >> >> >>>> ceph-users@xxxxxxxxxxxxxx > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> -- > >> >> >>> Cheers, > >> >> >>> Brad > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> Eugen Block voice : +49-40-559 51 75 > >> >> >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> >> >> Postfach 61 03 15 > >> >> >> D-22423 Hamburg e-mail : eblock@xxxxxx > >> >> >> > >> >> >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >> >> >> Sitz und Registergericht: Hamburg, HRB 90934 > >> >> >> Vorstand: Jens-U. Mozdzen > >> >> >> USt-IdNr. DE 814 013 983 > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Cheers, > >> >> > Brad > >> >> > >> >> > >> >> > >> > > >> > > >> > -- > >> > Christian Balzer Network/Systems Engineer > >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> > http://www.gol.com/ > >> > >> > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com