Hey Samuel, I repeated the same test several times before my post and just now 2 more times. It holds up and is also repeatable in reverse order, with the same results. Remember, I restart all OSDs between tests, so any caches should get destroyed and besides, I'm writing. That shouldn't involve very large-scale caching, right? I also waited for the cluster to come to a healthy state after restarting the OSDs, so it's not related to rebalancing or peering-activity, either. Regards, Oliver On wo, 2013-08-21 at 14:07 -0700, Samuel Just wrote: > Try it again in the reverse order, I strongly suspect caching effects. > -Sam > > On Wed, Aug 21, 2013 at 1:34 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > > Hey Samuel, > > > > Finally got it reproduced on my test-cluster, which was otherwise > > unloaded at the time. First, with Dumpling: > > > > # rbd create --size 102400 test > > # ceph-osd --version > > ceph version 0.67.1-6-g0c4f2f3 > > (0c4f2f34b78b634efe7f4d56694e2edeeda5a130) > > # rbd bench-write test > > bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > > SEC OPS OPS/SEC BYTES/SEC > > 1 204 200.67 592272.58 > > 2 401 200.24 607462.05 > > 3 625 204.62 639655.53 > > 4 824 205.20 636488.76 > > 5 1026 204.08 630587.77 > > 6 1217 202.69 628965.39 > > 7 1429 203.19 633671.84 > > 8 1626 202.93 635419.90 > > 9 1837 203.11 630421.40 > > 10 2045 204.16 631364.11 > > 11 2247 204.12 625833.91 > > 12 2456 204.29 629952.57 > > 13 2655 204.15 634639.37 > > 14 2888 206.15 637373.53 > > 15 3115 207.12 640826.18 > > 16 3330 207.75 640120.38 > > 17 3529 207.57 636277.18 > > 18 3750 208.24 635036.47 > > 19 3965 208.65 631118.13 > > 20 4216 210.75 638197.93 > > 21 4534 215.83 656890.46 > > 22 4809 218.50 666612.75 > > 23 5076 220.63 678678.36 > > 24 5320 221.39 686760.90 > > 25 5651 226.02 701345.26 > > 26 5967 229.44 706839.16 > > 27 6271 232.05 707958.62 > > 28 6522 232.82 709890.37 > > 29 6773 233.37 711718.21 > > 30 7151 238.23 722924.18 > > > > Then I replaced the OSD with Cuttlefish on all nodes and restarted it > > everywhere. > > > > # ceph-osd --version > > ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) > > # rbd bench-write test > > bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > > SEC OPS OPS/SEC BYTES/SEC > > 1 556 555.63 1915665.58 > > 2 1095 545.08 1839139.42 > > 3 1696 564.45 1840316.22 > > 4 2312 577.25 1832641.60 > > 5 2870 573.46 1800558.64 > > 6 3461 576.81 1780993.33 > > 7 4062 579.26 1821246.36 > > 8 4678 581.51 1862003.10 > > 9 5269 585.12 1902219.78 > > 10 5883 587.11 1914700.43 > > 11 6471 587.59 1913208.96 > > 12 7022 585.12 1914375.79 > > 13 7568 580.15 1899650.34 > > 14 8122 579.87 1864572.53 > > 15 8659 576.18 1826927.83 > > 16 9259 576.81 1796653.40 > > 17 9826 577.60 1785328.57 > > 18 10401 576.30 1777479.51 > > 19 10962 576.94 1777431.02 > > 20 11399 569.84 1756633.76 > > 21 11937 567.45 1740147.43 > > 22 12491 567.47 1726288.46 > > 23 13109 569.92 1727347.20 > > 24 13629 567.64 1721996.79 > > 25 14153 566.11 1720782.06 > > 26 14669 563.23 1716275.34 > > 27 15191 562.61 1717677.74 > > 28 15768 562.49 1713059.35 > > 29 16281 561.40 1707289.23 > > 31 16719 536.96 1628452.43 > > ^C > > > > As you can see, the difference is quite dramatic and this confirms it's > > probably something RBD-related. The test-cluster is a standard 3-node > > cluster with 3xMON, 3xMDS, 3xOSD, tied together with 2x1Gbit/s ethernet. > > I replaced only the OSD-binary for these tests. The rest of the > > Ceph-installation is at Dumpling. > > > > > > Regards, > > > > Oliver > > > > On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote: > >> There haven't been any significant osd side changes that I can think > >> of. Is cpu usage still high? If so, can you post the profiler > >> results again? > >> -Sam > >> > >> On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > >> > Hey Samuel, > >> > > >> > I had a good run on the production-cluster with it and unfortunately, it > >> > still doesn't seem to have solved the problem. It seemed OK for a while > >> > and individual OSD CPU-usage seemed quite low, but as the cluster's load > >> > increased during the day, things got slower again. Write-performance > >> > within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec > >> > on reads in that same VM. I also did RADOS bench-tests with `rados > >> > --pool rbd bench 120 write' and those got several hundreds of MB's/sec > >> > on the same cluster at the same time of day, so maybe the problem is > >> > RBD-related. Is there any code in the OSD that could influence > >> > RBD-performance alone? Do you know of any other significant changes to > >> > the OSD between Cuttlefish and Dumpling that could result in this? > >> > > >> > PS: I also did the same RADOS bench-tests on my test-cluster, both with > >> > Cuttlefish and Dumpling without your fix and got almost identical > >> > results. This confirms that the problem might be in RBD, as Mark suggested. > >> > > >> > > >> > Regards, > >> > > >> > Oliver > >> > > >> > On 20-08-13 19:40, Samuel Just wrote: > >> >> Can you try dumpling head without the option? > >> >> -Sam > >> >> > >> >> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > >> >>> Hey Mark, > >> >>> > >> >>> Sorry, but after some more tests I have to report that it only worked > >> >>> partly. The load seems lower with "wip-dumpling-pglog-undirty" in > >> >>> place, but the Cuttlefish-osd still seems significantly faster and even > >> >>> with "wip-dumpling-pglog-undirty" in place, things slow down way too > >> >>> much under load. Unfortunately, only my production-cluster seems busy > >> >>> enough to actually show the problem clearly by slowing down. Below is > >> >>> `perf top'-output, fresh from my production-cluster under it's regular > >> >>> load: > >> >>> > >> >>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout = > >> >>> false": > >> >>> 16.53% [kernel] [k] > >> >>> intel_idle > >> >>> 6.47% libleveldb.so.1.9 [.] > >> >>> 0x380a1 > >> >>> 5.76% [kernel] [k] > >> >>> find_busiest_group > >> >>> 4.11% libc-2.11.3.so [.] > >> >>> memcmp > >> >>> 3.95% kvm [.] > >> >>> 0x1f6f31 > >> >>> 2.05% [kernel] [k] > >> >>> default_send_IPI_mask_sequence_phys > >> >>> 2.03% [kernel] [k] > >> >>> _raw_spin_lock > >> >>> 1.87% libleveldb.so.1.9 [.] > >> >>> leveldb::InternalKeyComparator::Compar > >> >>> 1.57% libc-2.11.3.so [.] > >> >>> memcpy > >> >>> 1.37% libleveldb.so.1.9 [.] > >> >>> leveldb::Block::Iter::Next() > >> >>> 1.26% [kernel] [k] > >> >>> hrtimer_interrupt > >> >>> 1.12% [kernel] [k] > >> >>> __hrtimer_start_range_ns > >> >>> 1.09% [kernel] [k] > >> >>> native_write_cr0 > >> >>> 1.05% libstdc++.so.6.0.13 [.] > >> >>> std::string::_M_mutate(unsigned long, > >> >>> 1.00% [kernel] [k] > >> >>> native_write_msr_safe > >> >>> 0.99% [kernel] [k] > >> >>> apic_timer_interrupt > >> >>> 0.98% [kernel] [k] > >> >>> clockevents_program_event > >> >>> 0.96% [kernel] [k] > >> >>> _raw_spin_unlock_irqrestore > >> >>> 0.95% ceph-osd [.] > >> >>> PGLog::undirty() > >> >>> 0.92% [kernel] [k] > >> >>> find_next_bit > >> >>> 0.91% libsnappy.so.1.1.2 [.] > >> >>> snappy::RawUncompress(snappy::Source*, > >> >>> 0.88% [kernel] [k] > >> >>> __schedule > >> >>> 0.87% [kernel] [k] > >> >>> cpumask_next_and > >> >>> 0.84% [kernel] [k] > >> >>> do_select > >> >>> 0.80% [kernel] [k] > >> >>> fget_light > >> >>> 0.77% [kernel] [k] > >> >>> reschedule_interrupt > >> >>> 0.75% [kernel] [k] > >> >>> _raw_spin_lock_irqsave > >> >>> 0.62% libstdc++.so.6.0.13 [.] std::string::append(char > >> >>> const*, unsig > >> >>> 0.59% [kvm_intel] [k] > >> >>> vmx_vcpu_run > >> >>> 0.58% [kernel] [k] > >> >>> copy_user_generic_string > >> >>> 0.56% [kernel] [k] > >> >>> load_balance > >> >>> 0.54% [kernel] [k] > >> >>> tg_load_down > >> >>> 0.53% libpthread-2.11.3.so [.] > >> >>> pthread_mutex_lock > >> >>> 0.52% [kernel] [k] sync_inodes_sb > >> >>> > >> >>> Second, the 0.61.8 osd, under identical load: > >> >>> 21.51% [kernel] [k] > >> >>> intel_idle > >> >>> 6.66% [kernel] [k] > >> >>> find_busiest_group > >> >>> 6.25% kvm [.] > >> >>> 0x2d214b > >> >>> 1.97% [kernel] [k] > >> >>> _raw_spin_lock > >> >>> 1.47% [kernel] [k] > >> >>> native_write_msr_safe > >> >>> 1.44% [kernel] [k] > >> >>> hrtimer_interrupt > >> >>> 1.37% [kernel] [k] > >> >>> __hrtimer_start_range_ns > >> >>> 1.34% [kernel] [k] > >> >>> do_select > >> >>> 1.29% [kernel] [k] > >> >>> fget_light > >> >>> 1.24% [kernel] [k] > >> >>> clockevents_program_event > >> >>> 1.21% [kernel] [k] > >> >>> default_send_IPI_mask_sequence_phys > >> >>> 1.18% [kernel] [k] > >> >>> cpumask_next_and > >> >>> 1.18% [kernel] [k] > >> >>> _raw_spin_unlock_irqrestore > >> >>> 1.15% [kernel] [k] > >> >>> find_next_bit > >> >>> 1.14% [kernel] [k] > >> >>> __schedule > >> >>> 1.11% [kernel] [k] > >> >>> _raw_spin_lock_irqsave > >> >>> 0.98% [kernel] [k] > >> >>> apic_timer_interrupt > >> >>> 0.86% [kernel] [k] > >> >>> copy_user_generic_string > >> >>> 0.83% [kernel] [k] > >> >>> native_write_cr0 > >> >>> 0.76% [kernel] [k] > >> >>> sync_inodes_sb > >> >>> 0.71% [kernel] [k] > >> >>> rcu_needs_cpu > >> >>> 0.69% libpthread-2.11.3.so [.] > >> >>> pthread_mutex_lock > >> >>> 0.66% [kernel] [k] > >> >>> fput > >> >>> 0.62% [kernel] [k] > >> >>> load_balance > >> >>> 0.57% [vdso] [.] > >> >>> 0x7fff3a976700 > >> >>> 0.56% libc-2.11.3.so [.] > >> >>> memcpy > >> >>> 0.56% [kernel] [k] > >> >>> reschedule_interrupt > >> >>> 0.56% [kernel] [k] > >> >>> tg_load_down > >> >>> 0.50% [kernel] [k] iput > >> >>> > >> >>> I see lots of new differences, but again don't know what to make of it > >> >>> and what might be related or significant. LevelDB seems to jump out > >> >>> this time, amongst others. Let me know if you need more info. > >> >>> > >> >>> > >> >>> Regards, > >> >>> > >> >>> Oliver > >> >>> > >> >>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: > >> >>>> Hi Oliver, > >> >>>> > >> >>>> Glad that helped! How much more efficient do the cuttlefish OSDs seem > >> >>>> at this point (with wip-dumpling-pglog-undirty)? On modern Intel > >> >>>> platforms we were actually hoping to see CPU usage go down in many cases > >> >>>> due to the use of hardware CRC32 instructions. > >> >>>> > >> >>>> Mark > >> >>>> > >> >>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote: > >> >>>>> Hey Samuel, > >> >>>>> > >> >>>>> Thanks! I installed your version, repeated the same tests on my > >> >>>>> test-cluster and the extra CPU-loading seems to have disappeared. Then > >> >>>>> I replaced one osd of my production-cluster with your modified version > >> >>>>> and it's config-option and it seems to be a lot less CPU-hungry now. > >> >>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient, > >> >>>>> your changes have definitely helped a lot. We seem to be looking in the > >> >>>>> right direction, at least for this part of the problem. > >> >>>>> > >> >>>>> BTW, I ran `perf top' on the production-node with your modified osd and > >> >>>>> didn't see anything osd-related stand out on top. "PGLog::undirty()" > >> >>>>> was in there, but with much lower usage, right at the bottom of the > >> >>>>> green part of the output. > >> >>>>> > >> >>>>> Many thanks for your help so far! > >> >>>>> > >> >>>>> > >> >>>>> Regards, > >> >>>>> > >> >>>>> Oliver > >> >>>>> > >> >>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: > >> >>>>>> You're right, PGLog::undirty() looks suspicious. I just pushed a > >> >>>>>> branch wip-dumpling-pglog-undirty with a new config > >> >>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some > >> >>>>>> strictly debugging checks which occur in PGLog::undirty(). We haven't > >> >>>>>> actually seen these checks causing excessive cpu usage, so this may be > >> >>>>>> a red herring. > >> >>>>>> -Sam > >> >>>>>> > >> >>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > >> >>>>>>> Hey Mark, > >> >>>>>>> > >> >>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: > >> >>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote: > >> >>>>>>>>> Hey all, > >> >>>>>>>>> > >> >>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I > >> >>>>>>>>> created in the tracker. Thought I would pass it through the list as > >> >>>>>>>>> well, to get an idea if anyone else is running into it. It may only > >> >>>>>>>>> show under higher loads. More info about my setup is in the bug-report > >> >>>>>>>>> above. Here goes: > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd > >> >>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is > >> >>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 > >> >>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this > >> >>>>>>>>> cluster under regular load and the osds usually averaged 20-100% > >> >>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for > >> >>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the > >> >>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users > >> >>>>>>>>> complained that disk-access inside the VMs was significantly slower and > >> >>>>>>>>> the backups of the RBD-store I was running, also got behind quickly. > >> >>>>>>>>> > >> >>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the > >> >>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have > >> >>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster > >> >>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't > >> >>>>>>>>> as dramatic during these tests because there was no base-load from other > >> >>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these > >> >>>>>>>>> tests as well, and returned to normal after downgrading the osds. > >> >>>>>>>>> > >> >>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs > >> >>>>>>>>> and everything runs and reports good health, it's just a lot slower, > >> >>>>>>>>> with a lot more CPU-usage. > >> >>>>>>>> > >> >>>>>>>> Hi Oliver, > >> >>>>>>>> > >> >>>>>>>> If you have access to the perf command on this system, could you try > >> >>>>>>>> running: > >> >>>>>>>> > >> >>>>>>>> "sudo perf top" > >> >>>>>>>> > >> >>>>>>>> And if that doesn't give you much, > >> >>>>>>>> > >> >>>>>>>> "sudo perf record -g" > >> >>>>>>>> > >> >>>>>>>> then: > >> >>>>>>>> > >> >>>>>>>> "sudo perf report | less" > >> >>>>>>>> > >> >>>>>>>> during the period of high CPU usage? This will give you a call graph. > >> >>>>>>>> There may be symbols missing, but it might help track down what the OSDs > >> >>>>>>>> are doing. > >> >>>>>>> > >> >>>>>>> Thanks for your help! I did a couple of runs on my test-cluster, > >> >>>>>>> loading it with writes from 3 VMs concurrently and measuring the results > >> >>>>>>> at the first node with all 0.67 Dumpling-components and with the osds > >> >>>>>>> replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a > >> >>>>>>> while, then copied anything that showed in red and green into this post. > >> >>>>>>> Here are the results (sorry for the word-wraps): > >> >>>>>>> > >> >>>>>>> First, with 0.61.7 osds: > >> >>>>>>> > >> >>>>>>> 19.91% [kernel] [k] intel_idle > >> >>>>>>> 10.18% [kernel] [k] _raw_spin_lock_irqsave > >> >>>>>>> 6.79% ceph-osd [.] ceph_crc32c_le > >> >>>>>>> 4.93% [kernel] [k] > >> >>>>>>> default_send_IPI_mask_sequence_phys > >> >>>>>>> 2.71% [kernel] [k] copy_user_generic_string > >> >>>>>>> 1.42% libc-2.11.3.so [.] memcpy > >> >>>>>>> 1.23% [kernel] [k] find_busiest_group > >> >>>>>>> 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel > >> >>>>>>> 1.11% [kernel] [k] _raw_spin_lock > >> >>>>>>> 0.99% kvm [.] 0x1931f8 > >> >>>>>>> 0.92% [igb] [k] igb_poll > >> >>>>>>> 0.87% [kernel] [k] native_write_cr0 > >> >>>>>>> 0.80% [kernel] [k] csum_partial > >> >>>>>>> 0.78% [kernel] [k] __do_softirq > >> >>>>>>> 0.63% [kernel] [k] hpet_legacy_next_event > >> >>>>>>> 0.53% [ip_tables] [k] ipt_do_table > >> >>>>>>> 0.50% libc-2.11.3.so [.] 0x74433 > >> >>>>>>> > >> >>>>>>> Second test, with 0.67 osds: > >> >>>>>>> > >> >>>>>>> 18.32% [kernel] [k] intel_idle > >> >>>>>>> 7.58% [kernel] [k] _raw_spin_lock_irqsave > >> >>>>>>> 7.04% ceph-osd [.] PGLog::undirty() > >> >>>>>>> 4.39% ceph-osd [.] ceph_crc32c_le_intel > >> >>>>>>> 3.92% [kernel] [k] > >> >>>>>>> default_send_IPI_mask_sequence_phys > >> >>>>>>> 2.25% [kernel] [k] copy_user_generic_string > >> >>>>>>> 1.76% libc-2.11.3.so [.] memcpy > >> >>>>>>> 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel > >> >>>>>>> 1.40% libc-2.11.3.so [.] vfprintf > >> >>>>>>> 1.12% libc-2.11.3.so [.] 0x7217b > >> >>>>>>> 1.05% [kernel] [k] _raw_spin_lock > >> >>>>>>> 1.01% [kernel] [k] find_busiest_group > >> >>>>>>> 0.83% kvm [.] 0x193ab8 > >> >>>>>>> 0.80% [kernel] [k] native_write_cr0 > >> >>>>>>> 0.76% [kernel] [k] __do_softirq > >> >>>>>>> 0.73% libc-2.11.3.so [.] _IO_default_xsputn > >> >>>>>>> 0.70% [kernel] [k] csum_partial > >> >>>>>>> 0.68% [igb] [k] igb_poll > >> >>>>>>> 0.58% [kernel] [k] hpet_legacy_next_event > >> >>>>>>> 0.54% [kernel] [k] __schedule > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up > >> >>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67. > >> >>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise, > >> >>>>>>> because of network-constraints and I don't want to take any extra risks > >> >>>>>>> on the production-cluster and test it there, but it seems we found a > >> >>>>>>> possible culprit. > >> >>>>>>> > >> >>>>>>> Any ideas? Thanks again! > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Regards, > >> >>>>>>> > >> >>>>>>> Oliver > >> >>>>>>> > >> >>>>>>> _______________________________________________ > >> >>>>>>> ceph-users mailing list > >> >>>>>>> ceph-users@xxxxxxxxxxxxxx > >> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>>> _______________________________________________ > >> >>>>> ceph-users mailing list > >> >>>>> ceph-users@xxxxxxxxxxxxxx > >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >>>>> > >> >>>> > >> >>>> _______________________________________________ > >> >>>> ceph-users mailing list > >> >>>> ceph-users@xxxxxxxxxxxxxx > >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >>>> > >> >>> > >> >>> > >> >>> _______________________________________________ > >> >>> ceph-users mailing list > >> >>> ceph-users@xxxxxxxxxxxxxx > >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > >> > > >> > > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com