Hey Mark, I'm working on a simple way to reproduce this if we fail to identify it and `rados bench' indeed looks promising, as it can also simulate many concurrent ops, which is probably what sets the 3 VM load on my test-cluster apart from the +-80 VM load on my production-cluster. I'll look into it and try to get it reproduced on as simple a setup as possible, thanks! The command I used in VMs is a simple `dd if=/dev/zero of=<somefile> bs=1024M count=2'. Samuel asked me to try the dumpling head-branch, so I'll do that first. Thanks! Regards, Oliver On 20-08-13 19:53, Mark Nelson wrote: > Hi Oliver, > > Have you ever tried using RADOS bench? > > On one of the client nodes, you can do: > > rados -p <pool> -b <size in bytes> bench write 120 -t <concurrent ops> > > It would be useful to know if that is also slower without any of the RBD > code involved. Btw, what was the dd command line you were using for > testing? > > Thanks, > Mark > > > On 08/20/2013 03:44 AM, Oliver Daudey wrote: >> Hey Mark, >> >> Sorry, but after some more tests I have to report that it only worked >> partly. The load seems lower with "wip-dumpling-pglog-undirty" in >> place, but the Cuttlefish-osd still seems significantly faster and even >> with "wip-dumpling-pglog-undirty" in place, things slow down way too >> much under load. Unfortunately, only my production-cluster seems busy >> enough to actually show the problem clearly by slowing down. Below is >> `perf top'-output, fresh from my production-cluster under it's regular >> load: >> >> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout = >> false": >> 16.53% [kernel] [k] >> intel_idle >> 6.47% libleveldb.so.1.9 [.] >> 0x380a1 >> 5.76% [kernel] [k] >> find_busiest_group >> 4.11% libc-2.11.3.so [.] >> memcmp >> 3.95% kvm [.] >> 0x1f6f31 >> 2.05% [kernel] [k] >> default_send_IPI_mask_sequence_phys >> 2.03% [kernel] [k] >> _raw_spin_lock >> 1.87% libleveldb.so.1.9 [.] >> leveldb::InternalKeyComparator::Compar >> 1.57% libc-2.11.3.so [.] >> memcpy >> 1.37% libleveldb.so.1.9 [.] >> leveldb::Block::Iter::Next() >> 1.26% [kernel] [k] >> hrtimer_interrupt >> 1.12% [kernel] [k] >> __hrtimer_start_range_ns >> 1.09% [kernel] [k] >> native_write_cr0 >> 1.05% libstdc++.so.6.0.13 [.] >> std::string::_M_mutate(unsigned long, >> 1.00% [kernel] [k] >> native_write_msr_safe >> 0.99% [kernel] [k] >> apic_timer_interrupt >> 0.98% [kernel] [k] >> clockevents_program_event >> 0.96% [kernel] [k] >> _raw_spin_unlock_irqrestore >> 0.95% ceph-osd [.] >> PGLog::undirty() >> 0.92% [kernel] [k] >> find_next_bit >> 0.91% libsnappy.so.1.1.2 [.] >> snappy::RawUncompress(snappy::Source*, >> 0.88% [kernel] [k] >> __schedule >> 0.87% [kernel] [k] >> cpumask_next_and >> 0.84% [kernel] [k] >> do_select >> 0.80% [kernel] [k] >> fget_light >> 0.77% [kernel] [k] >> reschedule_interrupt >> 0.75% [kernel] [k] >> _raw_spin_lock_irqsave >> 0.62% libstdc++.so.6.0.13 [.] std::string::append(char >> const*, unsig >> 0.59% [kvm_intel] [k] >> vmx_vcpu_run >> 0.58% [kernel] [k] >> copy_user_generic_string >> 0.56% [kernel] [k] >> load_balance >> 0.54% [kernel] [k] >> tg_load_down >> 0.53% libpthread-2.11.3.so [.] >> pthread_mutex_lock >> 0.52% [kernel] [k] sync_inodes_sb >> >> Second, the 0.61.8 osd, under identical load: >> 21.51% [kernel] [k] >> intel_idle >> 6.66% [kernel] [k] >> find_busiest_group >> 6.25% kvm [.] >> 0x2d214b >> 1.97% [kernel] [k] >> _raw_spin_lock >> 1.47% [kernel] [k] >> native_write_msr_safe >> 1.44% [kernel] [k] >> hrtimer_interrupt >> 1.37% [kernel] [k] >> __hrtimer_start_range_ns >> 1.34% [kernel] [k] >> do_select >> 1.29% [kernel] [k] >> fget_light >> 1.24% [kernel] [k] >> clockevents_program_event >> 1.21% [kernel] [k] >> default_send_IPI_mask_sequence_phys >> 1.18% [kernel] [k] >> cpumask_next_and >> 1.18% [kernel] [k] >> _raw_spin_unlock_irqrestore >> 1.15% [kernel] [k] >> find_next_bit >> 1.14% [kernel] [k] >> __schedule >> 1.11% [kernel] [k] >> _raw_spin_lock_irqsave >> 0.98% [kernel] [k] >> apic_timer_interrupt >> 0.86% [kernel] [k] >> copy_user_generic_string >> 0.83% [kernel] [k] >> native_write_cr0 >> 0.76% [kernel] [k] >> sync_inodes_sb >> 0.71% [kernel] [k] >> rcu_needs_cpu >> 0.69% libpthread-2.11.3.so [.] >> pthread_mutex_lock >> 0.66% [kernel] [k] >> fput >> 0.62% [kernel] [k] >> load_balance >> 0.57% [vdso] [.] >> 0x7fff3a976700 >> 0.56% libc-2.11.3.so [.] >> memcpy >> 0.56% [kernel] [k] >> reschedule_interrupt >> 0.56% [kernel] [k] >> tg_load_down >> 0.50% [kernel] [k] iput >> >> I see lots of new differences, but again don't know what to make of it >> and what might be related or significant. LevelDB seems to jump out >> this time, amongst others. Let me know if you need more info. >> >> >> Regards, >> >> Oliver >> >> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: >>> Hi Oliver, >>> >>> Glad that helped! How much more efficient do the cuttlefish OSDs seem >>> at this point (with wip-dumpling-pglog-undirty)? On modern Intel >>> platforms we were actually hoping to see CPU usage go down in many cases >>> due to the use of hardware CRC32 instructions. >>> >>> Mark >>> >>> On 08/19/2013 03:06 PM, Oliver Daudey wrote: >>>> Hey Samuel, >>>> >>>> Thanks! I installed your version, repeated the same tests on my >>>> test-cluster and the extra CPU-loading seems to have disappeared. Then >>>> I replaced one osd of my production-cluster with your modified version >>>> and it's config-option and it seems to be a lot less CPU-hungry now. >>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient, >>>> your changes have definitely helped a lot. We seem to be looking in >>>> the >>>> right direction, at least for this part of the problem. >>>> >>>> BTW, I ran `perf top' on the production-node with your modified osd and >>>> didn't see anything osd-related stand out on top. "PGLog::undirty()" >>>> was in there, but with much lower usage, right at the bottom of the >>>> green part of the output. >>>> >>>> Many thanks for your help so far! >>>> >>>> >>>> Regards, >>>> >>>> Oliver >>>> >>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: >>>>> You're right, PGLog::undirty() looks suspicious. I just pushed a >>>>> branch wip-dumpling-pglog-undirty with a new config >>>>> (osd_debug_pg_log_writeout) which if set to false will disable some >>>>> strictly debugging checks which occur in PGLog::undirty(). We haven't >>>>> actually seen these checks causing excessive cpu usage, so this may be >>>>> a red herring. >>>>> -Sam >>>>> >>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> >>>>> wrote: >>>>>> Hey Mark, >>>>>> >>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: >>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote: >>>>>>>> Hey all, >>>>>>>> >>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I >>>>>>>> created in the tracker. Thought I would pass it through the >>>>>>>> list as >>>>>>>> well, to get an idea if anyone else is running into it. It may >>>>>>>> only >>>>>>>> show under higher loads. More info about my setup is in the >>>>>>>> bug-report >>>>>>>> above. Here goes: >>>>>>>> >>>>>>>> >>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a >>>>>>>> mon, osd >>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, >>>>>>>> CephFS is >>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got >>>>>>>> 70-100 >>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM >>>>>>>> on this >>>>>>>> cluster under regular load and the osds usually averaged 20-100% >>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, >>>>>>>> CPU-usage for >>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) >>>>>>>> and the >>>>>>>> speed for my writes with `dd' inside a VM dropped to >>>>>>>> 20-40MB/sec. Users >>>>>>>> complained that disk-access inside the VMs was significantly >>>>>>>> slower and >>>>>>>> the backups of the RBD-store I was running, also got behind >>>>>>>> quickly. >>>>>>>> >>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and >>>>>>>> leaving the >>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have >>>>>>>> repeated this performance-hit upon upgrade on a similar >>>>>>>> test-cluster >>>>>>>> under no additional load at all. Although CPU-usage for the osds >>>>>>>> wasn't >>>>>>>> as dramatic during these tests because there was no base-load >>>>>>>> from other >>>>>>>> VMs, I/O-performance dropped significantly after upgrading >>>>>>>> during these >>>>>>>> tests as well, and returned to normal after downgrading the osds. >>>>>>>> >>>>>>>> I'm not sure what to make of it. There are no visible errors in >>>>>>>> the logs >>>>>>>> and everything runs and reports good health, it's just a lot >>>>>>>> slower, >>>>>>>> with a lot more CPU-usage. >>>>>>> >>>>>>> Hi Oliver, >>>>>>> >>>>>>> If you have access to the perf command on this system, could you try >>>>>>> running: >>>>>>> >>>>>>> "sudo perf top" >>>>>>> >>>>>>> And if that doesn't give you much, >>>>>>> >>>>>>> "sudo perf record -g" >>>>>>> >>>>>>> then: >>>>>>> >>>>>>> "sudo perf report | less" >>>>>>> >>>>>>> during the period of high CPU usage? This will give you a call >>>>>>> graph. >>>>>>> There may be symbols missing, but it might help track down what >>>>>>> the OSDs >>>>>>> are doing. >>>>>> >>>>>> Thanks for your help! I did a couple of runs on my test-cluster, >>>>>> loading it with writes from 3 VMs concurrently and measuring the >>>>>> results >>>>>> at the first node with all 0.67 Dumpling-components and with the osds >>>>>> replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a >>>>>> while, then copied anything that showed in red and green into this >>>>>> post. >>>>>> Here are the results (sorry for the word-wraps): >>>>>> >>>>>> First, with 0.61.7 osds: >>>>>> >>>>>> 19.91% [kernel] [k] intel_idle >>>>>> 10.18% [kernel] [k] _raw_spin_lock_irqsave >>>>>> 6.79% ceph-osd [.] ceph_crc32c_le >>>>>> 4.93% [kernel] [k] >>>>>> default_send_IPI_mask_sequence_phys >>>>>> 2.71% [kernel] [k] copy_user_generic_string >>>>>> 1.42% libc-2.11.3.so [.] memcpy >>>>>> 1.23% [kernel] [k] find_busiest_group >>>>>> 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>> 1.11% [kernel] [k] _raw_spin_lock >>>>>> 0.99% kvm [.] 0x1931f8 >>>>>> 0.92% [igb] [k] igb_poll >>>>>> 0.87% [kernel] [k] native_write_cr0 >>>>>> 0.80% [kernel] [k] csum_partial >>>>>> 0.78% [kernel] [k] __do_softirq >>>>>> 0.63% [kernel] [k] hpet_legacy_next_event >>>>>> 0.53% [ip_tables] [k] ipt_do_table >>>>>> 0.50% libc-2.11.3.so [.] 0x74433 >>>>>> >>>>>> Second test, with 0.67 osds: >>>>>> >>>>>> 18.32% [kernel] [k] intel_idle >>>>>> 7.58% [kernel] [k] _raw_spin_lock_irqsave >>>>>> 7.04% ceph-osd [.] PGLog::undirty() >>>>>> 4.39% ceph-osd [.] ceph_crc32c_le_intel >>>>>> 3.92% [kernel] [k] >>>>>> default_send_IPI_mask_sequence_phys >>>>>> 2.25% [kernel] [k] copy_user_generic_string >>>>>> 1.76% libc-2.11.3.so [.] memcpy >>>>>> 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>> 1.40% libc-2.11.3.so [.] vfprintf >>>>>> 1.12% libc-2.11.3.so [.] 0x7217b >>>>>> 1.05% [kernel] [k] _raw_spin_lock >>>>>> 1.01% [kernel] [k] find_busiest_group >>>>>> 0.83% kvm [.] 0x193ab8 >>>>>> 0.80% [kernel] [k] native_write_cr0 >>>>>> 0.76% [kernel] [k] __do_softirq >>>>>> 0.73% libc-2.11.3.so [.] _IO_default_xsputn >>>>>> 0.70% [kernel] [k] csum_partial >>>>>> 0.68% [igb] [k] igb_poll >>>>>> 0.58% [kernel] [k] hpet_legacy_next_event >>>>>> 0.54% [kernel] [k] __schedule >>>>>> >>>>>> >>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't >>>>>> show up >>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67. >>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise, >>>>>> because of network-constraints and I don't want to take any extra >>>>>> risks >>>>>> on the production-cluster and test it there, but it seems we found a >>>>>> possible culprit. >>>>>> >>>>>> Any ideas? Thanks again! >>>>>> >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> Oliver >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com