Hey Samuel, CPU-usage still seems a bit higher, but not always equally on every OSD. I profiled the node with the most CPU-usage on the OSD. Note the libleveldb-related stuff right at the top. The Cuttlefish-OSD doesn't show those at all. Could those be related to the problem? OSD version 0.67.1-10-g47c8949 (47c89497b7f69cbf1557cd05b89837c388e2ba2f) on my production-cluster: 15.11% [kernel] [k] intel_idle 8.61% libleveldb.so.1.9 [.] 0x3b1b5 7.47% libc-2.11.3.so [.] memcmp 5.47% [kernel] [k] find_busiest_group 3.97% kvm [.] 0x1fb2d0 2.90% libleveldb.so.1.9 [.] leveldb::InternalKeyComparator::Compar 1.72% libc-2.11.3.so [.] memcpy 1.56% [kernel] [k] _raw_spin_lock 1.49% libleveldb.so.1.9 [.] leveldb::Block::Iter::Next() 1.48% libsnappy.so.1.1.2 [.] snappy::RawUncompress(snappy::Source*, 1.15% libstdc++.so.6.0.13 [.] std::string::_M_mutate(unsigned long, 1.00% [kernel] [k] default_send_IPI_mask_sequence_phys 1.00% [kernel] [k] hrtimer_interrupt 0.99% [kernel] [k] __hrtimer_start_range_ns 0.92% [kernel] [k] cpumask_next_and 0.91% [kernel] [k] __schedule 0.90% [kernel] [k] find_next_bit 0.88% [kernel] [k] do_select 0.86% [kernel] [k] _raw_spin_unlock_irqrestore 0.84% [kernel] [k] clockevents_program_event 0.83% [kernel] [k] fget_light 0.76% libstdc++.so.6.0.13 [.] std::string::append(char const*, unsig 0.76% [kernel] [k] _raw_spin_lock_irqsave 0.71% [kernel] [k] apic_timer_interrupt 0.69% [kernel] [k] rcu_needs_cpu 0.62% [kernel] [k] sync_inodes_sb 0.60% ceph-osd [.] PGLog::undirty() 0.58% [kvm_intel] [k] vmx_vcpu_run 0.56% [kernel] [k] native_write_cr0 0.54% libpthread-2.11.3.so [.] pthread_mutex_lock 0.54% [kernel] [k] native_write_msr_safe 0.51% [kernel] [k] load_balance Regards, Oliver On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote: > There haven't been any significant osd side changes that I can think > of. Is cpu usage still high? If so, can you post the profiler > results again? > -Sam > > On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > > Hey Samuel, > > > > I had a good run on the production-cluster with it and unfortunately, it > > still doesn't seem to have solved the problem. It seemed OK for a while > > and individual OSD CPU-usage seemed quite low, but as the cluster's load > > increased during the day, things got slower again. Write-performance > > within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec > > on reads in that same VM. I also did RADOS bench-tests with `rados > > --pool rbd bench 120 write' and those got several hundreds of MB's/sec > > on the same cluster at the same time of day, so maybe the problem is > > RBD-related. Is there any code in the OSD that could influence > > RBD-performance alone? Do you know of any other significant changes to > > the OSD between Cuttlefish and Dumpling that could result in this? > > > > PS: I also did the same RADOS bench-tests on my test-cluster, both with > > Cuttlefish and Dumpling without your fix and got almost identical > > results. This confirms that the problem might be in RBD, as Mark suggested. > > > > > > Regards, > > > > Oliver > > > > On 20-08-13 19:40, Samuel Just wrote: > >> Can you try dumpling head without the option? > >> -Sam > >> > >> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > >>> Hey Mark, > >>> > >>> Sorry, but after some more tests I have to report that it only worked > >>> partly. The load seems lower with "wip-dumpling-pglog-undirty" in > >>> place, but the Cuttlefish-osd still seems significantly faster and even > >>> with "wip-dumpling-pglog-undirty" in place, things slow down way too > >>> much under load. Unfortunately, only my production-cluster seems busy > >>> enough to actually show the problem clearly by slowing down. Below is > >>> `perf top'-output, fresh from my production-cluster under it's regular > >>> load: > >>> > >>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout = > >>> false": > >>> 16.53% [kernel] [k] > >>> intel_idle > >>> 6.47% libleveldb.so.1.9 [.] > >>> 0x380a1 > >>> 5.76% [kernel] [k] > >>> find_busiest_group > >>> 4.11% libc-2.11.3.so [.] > >>> memcmp > >>> 3.95% kvm [.] > >>> 0x1f6f31 > >>> 2.05% [kernel] [k] > >>> default_send_IPI_mask_sequence_phys > >>> 2.03% [kernel] [k] > >>> _raw_spin_lock > >>> 1.87% libleveldb.so.1.9 [.] > >>> leveldb::InternalKeyComparator::Compar > >>> 1.57% libc-2.11.3.so [.] > >>> memcpy > >>> 1.37% libleveldb.so.1.9 [.] > >>> leveldb::Block::Iter::Next() > >>> 1.26% [kernel] [k] > >>> hrtimer_interrupt > >>> 1.12% [kernel] [k] > >>> __hrtimer_start_range_ns > >>> 1.09% [kernel] [k] > >>> native_write_cr0 > >>> 1.05% libstdc++.so.6.0.13 [.] > >>> std::string::_M_mutate(unsigned long, > >>> 1.00% [kernel] [k] > >>> native_write_msr_safe > >>> 0.99% [kernel] [k] > >>> apic_timer_interrupt > >>> 0.98% [kernel] [k] > >>> clockevents_program_event > >>> 0.96% [kernel] [k] > >>> _raw_spin_unlock_irqrestore > >>> 0.95% ceph-osd [.] > >>> PGLog::undirty() > >>> 0.92% [kernel] [k] > >>> find_next_bit > >>> 0.91% libsnappy.so.1.1.2 [.] > >>> snappy::RawUncompress(snappy::Source*, > >>> 0.88% [kernel] [k] > >>> __schedule > >>> 0.87% [kernel] [k] > >>> cpumask_next_and > >>> 0.84% [kernel] [k] > >>> do_select > >>> 0.80% [kernel] [k] > >>> fget_light > >>> 0.77% [kernel] [k] > >>> reschedule_interrupt > >>> 0.75% [kernel] [k] > >>> _raw_spin_lock_irqsave > >>> 0.62% libstdc++.so.6.0.13 [.] std::string::append(char > >>> const*, unsig > >>> 0.59% [kvm_intel] [k] > >>> vmx_vcpu_run > >>> 0.58% [kernel] [k] > >>> copy_user_generic_string > >>> 0.56% [kernel] [k] > >>> load_balance > >>> 0.54% [kernel] [k] > >>> tg_load_down > >>> 0.53% libpthread-2.11.3.so [.] > >>> pthread_mutex_lock > >>> 0.52% [kernel] [k] sync_inodes_sb > >>> > >>> Second, the 0.61.8 osd, under identical load: > >>> 21.51% [kernel] [k] > >>> intel_idle > >>> 6.66% [kernel] [k] > >>> find_busiest_group > >>> 6.25% kvm [.] > >>> 0x2d214b > >>> 1.97% [kernel] [k] > >>> _raw_spin_lock > >>> 1.47% [kernel] [k] > >>> native_write_msr_safe > >>> 1.44% [kernel] [k] > >>> hrtimer_interrupt > >>> 1.37% [kernel] [k] > >>> __hrtimer_start_range_ns > >>> 1.34% [kernel] [k] > >>> do_select > >>> 1.29% [kernel] [k] > >>> fget_light > >>> 1.24% [kernel] [k] > >>> clockevents_program_event > >>> 1.21% [kernel] [k] > >>> default_send_IPI_mask_sequence_phys > >>> 1.18% [kernel] [k] > >>> cpumask_next_and > >>> 1.18% [kernel] [k] > >>> _raw_spin_unlock_irqrestore > >>> 1.15% [kernel] [k] > >>> find_next_bit > >>> 1.14% [kernel] [k] > >>> __schedule > >>> 1.11% [kernel] [k] > >>> _raw_spin_lock_irqsave > >>> 0.98% [kernel] [k] > >>> apic_timer_interrupt > >>> 0.86% [kernel] [k] > >>> copy_user_generic_string > >>> 0.83% [kernel] [k] > >>> native_write_cr0 > >>> 0.76% [kernel] [k] > >>> sync_inodes_sb > >>> 0.71% [kernel] [k] > >>> rcu_needs_cpu > >>> 0.69% libpthread-2.11.3.so [.] > >>> pthread_mutex_lock > >>> 0.66% [kernel] [k] > >>> fput > >>> 0.62% [kernel] [k] > >>> load_balance > >>> 0.57% [vdso] [.] > >>> 0x7fff3a976700 > >>> 0.56% libc-2.11.3.so [.] > >>> memcpy > >>> 0.56% [kernel] [k] > >>> reschedule_interrupt > >>> 0.56% [kernel] [k] > >>> tg_load_down > >>> 0.50% [kernel] [k] iput > >>> > >>> I see lots of new differences, but again don't know what to make of it > >>> and what might be related or significant. LevelDB seems to jump out > >>> this time, amongst others. Let me know if you need more info. > >>> > >>> > >>> Regards, > >>> > >>> Oliver > >>> > >>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: > >>>> Hi Oliver, > >>>> > >>>> Glad that helped! How much more efficient do the cuttlefish OSDs seem > >>>> at this point (with wip-dumpling-pglog-undirty)? On modern Intel > >>>> platforms we were actually hoping to see CPU usage go down in many cases > >>>> due to the use of hardware CRC32 instructions. > >>>> > >>>> Mark > >>>> > >>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote: > >>>>> Hey Samuel, > >>>>> > >>>>> Thanks! I installed your version, repeated the same tests on my > >>>>> test-cluster and the extra CPU-loading seems to have disappeared. Then > >>>>> I replaced one osd of my production-cluster with your modified version > >>>>> and it's config-option and it seems to be a lot less CPU-hungry now. > >>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient, > >>>>> your changes have definitely helped a lot. We seem to be looking in the > >>>>> right direction, at least for this part of the problem. > >>>>> > >>>>> BTW, I ran `perf top' on the production-node with your modified osd and > >>>>> didn't see anything osd-related stand out on top. "PGLog::undirty()" > >>>>> was in there, but with much lower usage, right at the bottom of the > >>>>> green part of the output. > >>>>> > >>>>> Many thanks for your help so far! > >>>>> > >>>>> > >>>>> Regards, > >>>>> > >>>>> Oliver > >>>>> > >>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: > >>>>>> You're right, PGLog::undirty() looks suspicious. I just pushed a > >>>>>> branch wip-dumpling-pglog-undirty with a new config > >>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some > >>>>>> strictly debugging checks which occur in PGLog::undirty(). We haven't > >>>>>> actually seen these checks causing excessive cpu usage, so this may be > >>>>>> a red herring. > >>>>>> -Sam > >>>>>> > >>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > >>>>>>> Hey Mark, > >>>>>>> > >>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: > >>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote: > >>>>>>>>> Hey all, > >>>>>>>>> > >>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I > >>>>>>>>> created in the tracker. Thought I would pass it through the list as > >>>>>>>>> well, to get an idea if anyone else is running into it. It may only > >>>>>>>>> show under higher loads. More info about my setup is in the bug-report > >>>>>>>>> above. Here goes: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd > >>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is > >>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 > >>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this > >>>>>>>>> cluster under regular load and the osds usually averaged 20-100% > >>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for > >>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the > >>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users > >>>>>>>>> complained that disk-access inside the VMs was significantly slower and > >>>>>>>>> the backups of the RBD-store I was running, also got behind quickly. > >>>>>>>>> > >>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the > >>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have > >>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster > >>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't > >>>>>>>>> as dramatic during these tests because there was no base-load from other > >>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these > >>>>>>>>> tests as well, and returned to normal after downgrading the osds. > >>>>>>>>> > >>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs > >>>>>>>>> and everything runs and reports good health, it's just a lot slower, > >>>>>>>>> with a lot more CPU-usage. > >>>>>>>> > >>>>>>>> Hi Oliver, > >>>>>>>> > >>>>>>>> If you have access to the perf command on this system, could you try > >>>>>>>> running: > >>>>>>>> > >>>>>>>> "sudo perf top" > >>>>>>>> > >>>>>>>> And if that doesn't give you much, > >>>>>>>> > >>>>>>>> "sudo perf record -g" > >>>>>>>> > >>>>>>>> then: > >>>>>>>> > >>>>>>>> "sudo perf report | less" > >>>>>>>> > >>>>>>>> during the period of high CPU usage? This will give you a call graph. > >>>>>>>> There may be symbols missing, but it might help track down what the OSDs > >>>>>>>> are doing. > >>>>>>> > >>>>>>> Thanks for your help! I did a couple of runs on my test-cluster, > >>>>>>> loading it with writes from 3 VMs concurrently and measuring the results > >>>>>>> at the first node with all 0.67 Dumpling-components and with the osds > >>>>>>> replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a > >>>>>>> while, then copied anything that showed in red and green into this post. > >>>>>>> Here are the results (sorry for the word-wraps): > >>>>>>> > >>>>>>> First, with 0.61.7 osds: > >>>>>>> > >>>>>>> 19.91% [kernel] [k] intel_idle > >>>>>>> 10.18% [kernel] [k] _raw_spin_lock_irqsave > >>>>>>> 6.79% ceph-osd [.] ceph_crc32c_le > >>>>>>> 4.93% [kernel] [k] > >>>>>>> default_send_IPI_mask_sequence_phys > >>>>>>> 2.71% [kernel] [k] copy_user_generic_string > >>>>>>> 1.42% libc-2.11.3.so [.] memcpy > >>>>>>> 1.23% [kernel] [k] find_busiest_group > >>>>>>> 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel > >>>>>>> 1.11% [kernel] [k] _raw_spin_lock > >>>>>>> 0.99% kvm [.] 0x1931f8 > >>>>>>> 0.92% [igb] [k] igb_poll > >>>>>>> 0.87% [kernel] [k] native_write_cr0 > >>>>>>> 0.80% [kernel] [k] csum_partial > >>>>>>> 0.78% [kernel] [k] __do_softirq > >>>>>>> 0.63% [kernel] [k] hpet_legacy_next_event > >>>>>>> 0.53% [ip_tables] [k] ipt_do_table > >>>>>>> 0.50% libc-2.11.3.so [.] 0x74433 > >>>>>>> > >>>>>>> Second test, with 0.67 osds: > >>>>>>> > >>>>>>> 18.32% [kernel] [k] intel_idle > >>>>>>> 7.58% [kernel] [k] _raw_spin_lock_irqsave > >>>>>>> 7.04% ceph-osd [.] PGLog::undirty() > >>>>>>> 4.39% ceph-osd [.] ceph_crc32c_le_intel > >>>>>>> 3.92% [kernel] [k] > >>>>>>> default_send_IPI_mask_sequence_phys > >>>>>>> 2.25% [kernel] [k] copy_user_generic_string > >>>>>>> 1.76% libc-2.11.3.so [.] memcpy > >>>>>>> 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel > >>>>>>> 1.40% libc-2.11.3.so [.] vfprintf > >>>>>>> 1.12% libc-2.11.3.so [.] 0x7217b > >>>>>>> 1.05% [kernel] [k] _raw_spin_lock > >>>>>>> 1.01% [kernel] [k] find_busiest_group > >>>>>>> 0.83% kvm [.] 0x193ab8 > >>>>>>> 0.80% [kernel] [k] native_write_cr0 > >>>>>>> 0.76% [kernel] [k] __do_softirq > >>>>>>> 0.73% libc-2.11.3.so [.] _IO_default_xsputn > >>>>>>> 0.70% [kernel] [k] csum_partial > >>>>>>> 0.68% [igb] [k] igb_poll > >>>>>>> 0.58% [kernel] [k] hpet_legacy_next_event > >>>>>>> 0.54% [kernel] [k] __schedule > >>>>>>> > >>>>>>> > >>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up > >>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67. > >>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise, > >>>>>>> because of network-constraints and I don't want to take any extra risks > >>>>>>> on the production-cluster and test it there, but it seems we found a > >>>>>>> possible culprit. > >>>>>>> > >>>>>>> Any ideas? Thanks again! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Oliver > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> ceph-users mailing list > >>>>>>> ceph-users@xxxxxxxxxxxxxx > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@xxxxxxxxxxxxxx > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com