Hey Sage, I'm all for it and will help testing. Regards, Oliver On 22-08-13 17:23, Sage Weil wrote: > We should perhaps hack the old (cuttlefish and earlier) flushing behavior > into the new code so that we can confirm that it is really the writeback > that is causing the problem and not something else... > > sage > > On Thu, 22 Aug 2013, Oliver Daudey wrote: > >> Hey Samuel, >> >> On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote: >>> I think the rbd cache one you'd need to run for a few minutes to get >>> meaningful results. It should stabilize somewhere around the actual >>> throughput of your hardware. >> >> Ok, I now also ran this test on Cuttlefish as well as Dumpling. >> >> Cuttlefish: >> # rbd bench-write test --rbd-cache >> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >> SEC OPS OPS/SEC BYTES/SEC >> 1 13265 13252.45 3466029.67 >> 2 25956 12975.60 3589479.95 >> 3 38475 12818.61 3598590.70 >> 4 50184 12545.16 3530516.34 >> 5 59263 11852.22 3292258.13 >> <...> >> 300 3421530 11405.08 3191555.35 >> 301 3430755 11397.83 3189251.09 >> 302 3443345 11401.73 3190694.98 >> 303 3455230 11403.37 3191478.97 >> 304 3467014 11404.62 3192136.82 >> 305 3475355 11394.57 3189525.71 >> 306 3488067 11398.90 3190553.96 >> 307 3499789 11399.96 3190770.21 >> 308 3510566 11397.93 3190289.49 >> 309 3519829 11390.98 3188620.93 >> 310 3532539 11395.25 3189544.03 >> >> Dumpling: >> # rbd bench-write test --rbd-cache >> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >> SEC OPS OPS/SEC BYTES/SEC >> 1 13201 13194.63 3353004.50 >> 2 25926 12957.05 3379695.03 >> 3 36624 12206.06 3182087.11 >> 4 46547 11635.35 3035794.95 >> 5 59290 11856.27 3090389.79 >> <...> >> 300 3405215 11350.66 3130092.00 >> 301 3417789 11354.76 3131106.34 >> 302 3430067 11357.83 3131933.41 >> 303 3438792 11349.14 3129734.88 >> 304 3450237 11349.45 3129689.62 >> 305 3462840 11353.53 3130406.43 >> 306 3473151 11350.17 3128942.32 >> 307 3482327 11343.00 3126771.34 >> 308 3495020 11347.44 3127502.07 >> 309 3506894 11349.13 3127781.70 >> 310 3516532 11343.65 3126714.62 >> >> As you can see, the result is virtually identical. What jumps out >> during the cached tests, is that the CPU used by the OSDs is negligible >> in both cases, while without caching, the OSDs get loaded quite well. >> Perhaps the cache masks the problem we're seeing in Dumpling somehow? >> And I'm not changing anything but the OSD-binary during my tests, so >> cache-settings used in VMs are identical in both scenarios. >> >>> >>> Hmm, 10k ios I guess is only 10 rbd chunks. What replication level >>> are you using? Try setting them to 10000000 (you only need to set the >>> xfs ones). >>> >>> For the rand test, try increasing >>> filestore_wbthrottle_xfs_inodes_hard_limit and >>> filestore_wbthrottle_xfs_inodes_start_flusher to 10000 as well as >>> setting the above ios limits. >> >> Ok, my current config: >> filestore wbthrottle xfs ios start flusher = 10000000 >> filestore wbthrottle xfs ios hard limit = 10000000 >> filestore wbthrottle xfs inodes hard limit = 10000 >> filestore wbthrottle xfs inodes start flusher = 10000 >> >> Unfortunately, that still makes no difference at all in the original >> standard-tests. >> >> Random IO on Dumpling, after 120 secs of runtime: >> # rbd bench-write test --io-pattern=rand >> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >> SEC OPS OPS/SEC BYTES/SEC >> 1 545 534.98 1515804.02 >> 2 1162 580.80 1662416.60 >> 3 1731 576.52 1662966.61 >> 4 2317 579.04 1695129.94 >> 5 2817 562.56 1672754.87 >> <...> >> 120 43564 362.91 1080512.00 >> 121 43774 361.76 1077368.28 >> 122 44419 364.06 1083894.31 >> 123 45046 366.22 1090518.68 >> 124 45287 364.01 1084437.37 >> 125 45334 361.54 1077035.12 >> 126 45336 359.40 1070678.36 >> 127 45797 360.60 1073985.78 >> 128 46388 362.40 1080056.75 >> 129 46984 364.21 1086068.63 >> 130 47604 366.11 1092712.51 >> >> Random IO on Cuttlefish, after 120 secs of runtime: >> rbd bench-write test --io-pattern=rand >> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >> SEC OPS OPS/SEC BYTES/SEC >> 1 1066 1065.54 3115713.13 >> 2 2099 1049.31 2936300.53 >> 3 3218 1072.32 3028707.50 >> 4 4026 1003.23 2807859.15 >> 5 4272 793.80 2226962.63 >> <...> >> 120 66935 557.79 1612483.74 >> 121 68011 562.01 1625419.34 >> 122 68428 558.59 1615376.62 >> 123 68579 557.06 1610780.38 >> 125 68777 549.73 1589816.94 >> 126 69745 553.52 1601671.46 >> 127 70855 557.91 1614293.12 >> 128 71962 562.20 1627070.81 >> 129 72529 562.22 1627120.59 >> 130 73146 562.66 1628818.79 >> >> Confirming your setting took properly: >> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep >> wbthrottle >> "filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040", >> "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400", >> "filestore_wbthrottle_btrfs_ios_start_flusher": "500", >> "filestore_wbthrottle_btrfs_ios_hard_limit": "5000", >> "filestore_wbthrottle_btrfs_inodes_start_flusher": "500", >> "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040", >> "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400", >> "filestore_wbthrottle_xfs_ios_start_flusher": "10000000", >> "filestore_wbthrottle_xfs_ios_hard_limit": "10000000", >> "filestore_wbthrottle_xfs_inodes_start_flusher": "10000", >> "filestore_wbthrottle_btrfs_inodes_hard_limit": "5000", >> "filestore_wbthrottle_xfs_inodes_hard_limit": "10000", >> >> As you can see, your suggested settings made little difference anywhere >> after you let it settle and in the end, there is still a huge difference >> in performance between Cuttlefish and Dumpling. Caching seems to mask >> the problem, though, at least with the type of IO generated by these >> tests. It also reduces CPU-usage by the OSDs to almost none. >> >> >> Regards, >> >> Oliver >> >>> -Sam >>> >>> On Wed, Aug 21, 2013 at 8:13 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>> Hey Samuel, >>>> >>>> On wo, 2013-08-21 at 18:33 -0700, Samuel Just wrote: >>>>> I am dumb. There *has* been a change in the osd which can account for >>>>> this: the wbthrottle limits. We added some logic to force the kernel >>>>> to start flushing writes out earlier, normally a good thing. In this >>>>> case, it's probably doing an fsync every 500 writes. >>>>> >>>>> Can you run 3 tests? >>>>> 1) rerun with --rbd-cache (rbd bench-write test --rbd-cache) >>>> >>>> That flies, with very little load on the OSDs. That load is much higher >>>> without caching. This test was on Dumpling. You get the idea. >>>> >>>> # rbd bench-write test --rbd-cache >>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>> SEC OPS OPS/SEC BYTES/SEC >>>> 1 13832 13817.17 3678357.12 >>>> 2 27590 13793.28 3747373.45 >>>> 3 41144 13708.02 3781505.90 >>>> 4 54510 13622.96 3744543.19 >>>> 5 67249 13448.05 3714589.10 >>>> 6 78778 13128.73 3632892.25 >>>> 7 89507 12783.12 3544970.08 >>>> 8 103171 12895.06 3564180.90 >>>> 9 116306 12921.45 3568128.57 >>>> 10 128946 12894.02 3557625.15 >>>> >>>>> 2) rerun without rbd-cache, but add the following configs to the osds: >>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 >>>> >>>> That didn't seem to change the results in any way on Dumpling. I'm >>>> running the OSD-storage on XFS, BTW. >>>> >>>>> 3) rerun with the random pattern (rbd bench-write test --io-pattern=rand) >>>> >>>> With Dumpling, it starts out high, then drops off: >>>> >>>> # rbd bench-write test --io-pattern=rand >>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>>> SEC OPS OPS/SEC BYTES/SEC >>>> 1 584 569.09 1700360.17 >>>> 2 1165 582.37 1687174.07 >>>> 3 1568 468.14 1364737.79 >>>> 4 1649 411.79 1197760.41 >>>> 5 1819 363.30 1061872.78 >>>> 6 2369 394.17 1146998.37 >>>> 7 2796 395.00 1153270.94 >>>> 8 2939 360.59 1050822.48 >>>> 9 3221 357.87 1040777.01 >>>> 10 3515 339.64 991027.17 >>>> 11 3834 347.46 1021930.90 >>>> 12 4206 350.18 1024435.83 >>>> 13 4424 337.30 981849.71 >>>> 14 4493 320.86 936308.80 >>>> 15 4734 315.29 921501.51 >>>> 16 5285 330.31 969201.95 >>>> 17 5436 319.68 936765.93 >>>> 18 6011 333.91 980649.27 >>>> 19 6376 334.35 983297.54 >>>> 20 6398 311.90 917323.37 >>>> 21 6602 314.36 926410.86 >>>> 22 7167 325.67 962430.15 >>>> 24 7543 310.73 917734.40 >>>> 25 7704 307.98 909611.56 >>>> 26 8279 318.30 938557.27 >>>> 27 8826 323.99 953434.67 >>>> 28 9296 331.87 977987.40 >>>> 29 9476 322.55 951696.77 >>>> 30 9628 320.92 946692.41 >>>> >>>> Same for Cuttlefish, but with significantly more performance over the >>>> entire run: >>>> >>>> # rbd bench-write test --io-pattern=rand >>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>>> SEC OPS OPS/SEC BYTES/SEC >>>> 1 920 919.34 2464017.91 >>>> 2 1228 575.73 1597724.10 >>>> 3 2016 669.87 1835987.05 >>>> 4 2222 528.39 1443494.02 >>>> 5 2339 467.32 1268464.39 >>>> 6 3341 556.76 1532374.18 >>>> 7 3677 500.11 1385500.24 >>>> 8 3925 490.60 1361346.40 >>>> 9 4633 510.17 1430694.07 >>>> 10 5275 506.04 1412213.94 >>>> 11 5713 518.16 1449597.21 >>>> 12 6279 523.22 1466611.93 >>>> 13 7224 529.66 1485373.17 >>>> 14 7292 520.08 1461556.30 >>>> 15 8007 533.80 1498859.94 >>>> 16 8218 510.93 1436516.13 >>>> 17 8707 511.54 1439508.19 >>>> 18 9699 538.56 1513392.51 >>>> 19 10264 539.98 1522426.08 >>>> 20 10684 530.23 1496947.12 >>>> 21 10999 523.54 1478991.54 >>>> 22 11357 512.25 1450630.22 >>>> 23 12079 520.24 1475359.56 >>>> 24 12868 530.69 1504088.36 >>>> 25 13315 529.92 1504462.99 >>>> 26 13647 512.39 1451005.94 >>>> 27 14002 517.90 1467738.55 >>>> 28 14343 504.96 1430645.32 >>>> 29 14575 501.53 1419589.36 >>>> 30 15408 513.55 1452033.93 >>>> >>>> Weird that the config-options didn't change anything. I had hoped to >>>> try them on the production-cluster. Anything I could try there? Had to >>>> revert the OSDs there to Cuttlefish again for now, as Dumpling is just >>>> too slow. >>>> >>>> >>>> Regards, >>>> >>>> Oliver >>>>> >>>>> I expect 1) to largely eliminate the difference (these are sequential >>>>> 4k ios and should be coalesced into much bigger writes). 2) should >>>>> basically restore the same filestore behavior as in cuttlefish. 3) >>>>> should also show little difference since the io will be random. >>>>> >>>>> In normal use, your vms should probably be using rbd caching anyway >>>>> (in addition to the write coalescing already occurring in the guest >>>>> os), so this may not be what you are seeing on your cluster. >>>>> -Sam >>>>> >>>>> On Wed, Aug 21, 2013 at 6:00 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: >>>>>> You'd need to unmount the fs to actually clear the cache. Did you see >>>>>> a significant difference in load between the runs? To confirm, the >>>>>> rbd client is dumpling the entire time? >>>>>> -Sam >>>>>> >>>>>> On Wed, Aug 21, 2013 at 2:28 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>> Hey Samuel, >>>>>>> >>>>>>> I repeated the same test several times before my post and just now 2 >>>>>>> more times. It holds up and is also repeatable in reverse order, with >>>>>>> the same results. Remember, I restart all OSDs between tests, so any >>>>>>> caches should get destroyed and besides, I'm writing. That shouldn't >>>>>>> involve very large-scale caching, right? I also waited for the cluster >>>>>>> to come to a healthy state after restarting the OSDs, so it's not >>>>>>> related to rebalancing or peering-activity, either. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Oliver >>>>>>> >>>>>>> On wo, 2013-08-21 at 14:07 -0700, Samuel Just wrote: >>>>>>>> Try it again in the reverse order, I strongly suspect caching effects. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Wed, Aug 21, 2013 at 1:34 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>> Hey Samuel, >>>>>>>>> >>>>>>>>> Finally got it reproduced on my test-cluster, which was otherwise >>>>>>>>> unloaded at the time. First, with Dumpling: >>>>>>>>> >>>>>>>>> # rbd create --size 102400 test >>>>>>>>> # ceph-osd --version >>>>>>>>> ceph version 0.67.1-6-g0c4f2f3 >>>>>>>>> (0c4f2f34b78b634efe7f4d56694e2edeeda5a130) >>>>>>>>> # rbd bench-write test >>>>>>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>>>> 1 204 200.67 592272.58 >>>>>>>>> 2 401 200.24 607462.05 >>>>>>>>> 3 625 204.62 639655.53 >>>>>>>>> 4 824 205.20 636488.76 >>>>>>>>> 5 1026 204.08 630587.77 >>>>>>>>> 6 1217 202.69 628965.39 >>>>>>>>> 7 1429 203.19 633671.84 >>>>>>>>> 8 1626 202.93 635419.90 >>>>>>>>> 9 1837 203.11 630421.40 >>>>>>>>> 10 2045 204.16 631364.11 >>>>>>>>> 11 2247 204.12 625833.91 >>>>>>>>> 12 2456 204.29 629952.57 >>>>>>>>> 13 2655 204.15 634639.37 >>>>>>>>> 14 2888 206.15 637373.53 >>>>>>>>> 15 3115 207.12 640826.18 >>>>>>>>> 16 3330 207.75 640120.38 >>>>>>>>> 17 3529 207.57 636277.18 >>>>>>>>> 18 3750 208.24 635036.47 >>>>>>>>> 19 3965 208.65 631118.13 >>>>>>>>> 20 4216 210.75 638197.93 >>>>>>>>> 21 4534 215.83 656890.46 >>>>>>>>> 22 4809 218.50 666612.75 >>>>>>>>> 23 5076 220.63 678678.36 >>>>>>>>> 24 5320 221.39 686760.90 >>>>>>>>> 25 5651 226.02 701345.26 >>>>>>>>> 26 5967 229.44 706839.16 >>>>>>>>> 27 6271 232.05 707958.62 >>>>>>>>> 28 6522 232.82 709890.37 >>>>>>>>> 29 6773 233.37 711718.21 >>>>>>>>> 30 7151 238.23 722924.18 >>>>>>>>> >>>>>>>>> Then I replaced the OSD with Cuttlefish on all nodes and restarted it >>>>>>>>> everywhere. >>>>>>>>> >>>>>>>>> # ceph-osd --version >>>>>>>>> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) >>>>>>>>> # rbd bench-write test >>>>>>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>>>> 1 556 555.63 1915665.58 >>>>>>>>> 2 1095 545.08 1839139.42 >>>>>>>>> 3 1696 564.45 1840316.22 >>>>>>>>> 4 2312 577.25 1832641.60 >>>>>>>>> 5 2870 573.46 1800558.64 >>>>>>>>> 6 3461 576.81 1780993.33 >>>>>>>>> 7 4062 579.26 1821246.36 >>>>>>>>> 8 4678 581.51 1862003.10 >>>>>>>>> 9 5269 585.12 1902219.78 >>>>>>>>> 10 5883 587.11 1914700.43 >>>>>>>>> 11 6471 587.59 1913208.96 >>>>>>>>> 12 7022 585.12 1914375.79 >>>>>>>>> 13 7568 580.15 1899650.34 >>>>>>>>> 14 8122 579.87 1864572.53 >>>>>>>>> 15 8659 576.18 1826927.83 >>>>>>>>> 16 9259 576.81 1796653.40 >>>>>>>>> 17 9826 577.60 1785328.57 >>>>>>>>> 18 10401 576.30 1777479.51 >>>>>>>>> 19 10962 576.94 1777431.02 >>>>>>>>> 20 11399 569.84 1756633.76 >>>>>>>>> 21 11937 567.45 1740147.43 >>>>>>>>> 22 12491 567.47 1726288.46 >>>>>>>>> 23 13109 569.92 1727347.20 >>>>>>>>> 24 13629 567.64 1721996.79 >>>>>>>>> 25 14153 566.11 1720782.06 >>>>>>>>> 26 14669 563.23 1716275.34 >>>>>>>>> 27 15191 562.61 1717677.74 >>>>>>>>> 28 15768 562.49 1713059.35 >>>>>>>>> 29 16281 561.40 1707289.23 >>>>>>>>> 31 16719 536.96 1628452.43 >>>>>>>>> ^C >>>>>>>>> >>>>>>>>> As you can see, the difference is quite dramatic and this confirms it's >>>>>>>>> probably something RBD-related. The test-cluster is a standard 3-node >>>>>>>>> cluster with 3xMON, 3xMDS, 3xOSD, tied together with 2x1Gbit/s ethernet. >>>>>>>>> I replaced only the OSD-binary for these tests. The rest of the >>>>>>>>> Ceph-installation is at Dumpling. >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Oliver >>>>>>>>> >>>>>>>>> On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote: >>>>>>>>>> There haven't been any significant osd side changes that I can think >>>>>>>>>> of. Is cpu usage still high? If so, can you post the profiler >>>>>>>>>> results again? >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>> Hey Samuel, >>>>>>>>>>> >>>>>>>>>>> I had a good run on the production-cluster with it and unfortunately, it >>>>>>>>>>> still doesn't seem to have solved the problem. It seemed OK for a while >>>>>>>>>>> and individual OSD CPU-usage seemed quite low, but as the cluster's load >>>>>>>>>>> increased during the day, things got slower again. Write-performance >>>>>>>>>>> within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec >>>>>>>>>>> on reads in that same VM. I also did RADOS bench-tests with `rados >>>>>>>>>>> --pool rbd bench 120 write' and those got several hundreds of MB's/sec >>>>>>>>>>> on the same cluster at the same time of day, so maybe the problem is >>>>>>>>>>> RBD-related. Is there any code in the OSD that could influence >>>>>>>>>>> RBD-performance alone? Do you know of any other significant changes to >>>>>>>>>>> the OSD between Cuttlefish and Dumpling that could result in this? >>>>>>>>>>> >>>>>>>>>>> PS: I also did the same RADOS bench-tests on my test-cluster, both with >>>>>>>>>>> Cuttlefish and Dumpling without your fix and got almost identical >>>>>>>>>>> results. This confirms that the problem might be in RBD, as Mark suggested. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Oliver >>>>>>>>>>> >>>>>>>>>>> On 20-08-13 19:40, Samuel Just wrote: >>>>>>>>>>>> Can you try dumpling head without the option? >>>>>>>>>>>> -Sam >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>>>> Hey Mark, >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry, but after some more tests I have to report that it only worked >>>>>>>>>>>>> partly. The load seems lower with "wip-dumpling-pglog-undirty" in >>>>>>>>>>>>> place, but the Cuttlefish-osd still seems significantly faster and even >>>>>>>>>>>>> with "wip-dumpling-pglog-undirty" in place, things slow down way too >>>>>>>>>>>>> much under load. Unfortunately, only my production-cluster seems busy >>>>>>>>>>>>> enough to actually show the problem clearly by slowing down. Below is >>>>>>>>>>>>> `perf top'-output, fresh from my production-cluster under it's regular >>>>>>>>>>>>> load: >>>>>>>>>>>>> >>>>>>>>>>>>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout = >>>>>>>>>>>>> false": >>>>>>>>>>>>> 16.53% [kernel] [k] >>>>>>>>>>>>> intel_idle >>>>>>>>>>>>> 6.47% libleveldb.so.1.9 [.] >>>>>>>>>>>>> 0x380a1 >>>>>>>>>>>>> 5.76% [kernel] [k] >>>>>>>>>>>>> find_busiest_group >>>>>>>>>>>>> 4.11% libc-2.11.3.so [.] >>>>>>>>>>>>> memcmp >>>>>>>>>>>>> 3.95% kvm [.] >>>>>>>>>>>>> 0x1f6f31 >>>>>>>>>>>>> 2.05% [kernel] [k] >>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>> 2.03% [kernel] [k] >>>>>>>>>>>>> _raw_spin_lock >>>>>>>>>>>>> 1.87% libleveldb.so.1.9 [.] >>>>>>>>>>>>> leveldb::InternalKeyComparator::Compar >>>>>>>>>>>>> 1.57% libc-2.11.3.so [.] >>>>>>>>>>>>> memcpy >>>>>>>>>>>>> 1.37% libleveldb.so.1.9 [.] >>>>>>>>>>>>> leveldb::Block::Iter::Next() >>>>>>>>>>>>> 1.26% [kernel] [k] >>>>>>>>>>>>> hrtimer_interrupt >>>>>>>>>>>>> 1.12% [kernel] [k] >>>>>>>>>>>>> __hrtimer_start_range_ns >>>>>>>>>>>>> 1.09% [kernel] [k] >>>>>>>>>>>>> native_write_cr0 >>>>>>>>>>>>> 1.05% libstdc++.so.6.0.13 [.] >>>>>>>>>>>>> std::string::_M_mutate(unsigned long, >>>>>>>>>>>>> 1.00% [kernel] [k] >>>>>>>>>>>>> native_write_msr_safe >>>>>>>>>>>>> 0.99% [kernel] [k] >>>>>>>>>>>>> apic_timer_interrupt >>>>>>>>>>>>> 0.98% [kernel] [k] >>>>>>>>>>>>> clockevents_program_event >>>>>>>>>>>>> 0.96% [kernel] [k] >>>>>>>>>>>>> _raw_spin_unlock_irqrestore >>>>>>>>>>>>> 0.95% ceph-osd [.] >>>>>>>>>>>>> PGLog::undirty() >>>>>>>>>>>>> 0.92% [kernel] [k] >>>>>>>>>>>>> find_next_bit >>>>>>>>>>>>> 0.91% libsnappy.so.1.1.2 [.] >>>>>>>>>>>>> snappy::RawUncompress(snappy::Source*, >>>>>>>>>>>>> 0.88% [kernel] [k] >>>>>>>>>>>>> __schedule >>>>>>>>>>>>> 0.87% [kernel] [k] >>>>>>>>>>>>> cpumask_next_and >>>>>>>>>>>>> 0.84% [kernel] [k] >>>>>>>>>>>>> do_select >>>>>>>>>>>>> 0.80% [kernel] [k] >>>>>>>>>>>>> fget_light >>>>>>>>>>>>> 0.77% [kernel] [k] >>>>>>>>>>>>> reschedule_interrupt >>>>>>>>>>>>> 0.75% [kernel] [k] >>>>>>>>>>>>> _raw_spin_lock_irqsave >>>>>>>>>>>>> 0.62% libstdc++.so.6.0.13 [.] std::string::append(char >>>>>>>>>>>>> const*, unsig >>>>>>>>>>>>> 0.59% [kvm_intel] [k] >>>>>>>>>>>>> vmx_vcpu_run >>>>>>>>>>>>> 0.58% [kernel] [k] >>>>>>>>>>>>> copy_user_generic_string >>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>> load_balance >>>>>>>>>>>>> 0.54% [kernel] [k] >>>>>>>>>>>>> tg_load_down >>>>>>>>>>>>> 0.53% libpthread-2.11.3.so [.] >>>>>>>>>>>>> pthread_mutex_lock >>>>>>>>>>>>> 0.52% [kernel] [k] sync_inodes_sb >>>>>>>>>>>>> >>>>>>>>>>>>> Second, the 0.61.8 osd, under identical load: >>>>>>>>>>>>> 21.51% [kernel] [k] >>>>>>>>>>>>> intel_idle >>>>>>>>>>>>> 6.66% [kernel] [k] >>>>>>>>>>>>> find_busiest_group >>>>>>>>>>>>> 6.25% kvm [.] >>>>>>>>>>>>> 0x2d214b >>>>>>>>>>>>> 1.97% [kernel] [k] >>>>>>>>>>>>> _raw_spin_lock >>>>>>>>>>>>> 1.47% [kernel] [k] >>>>>>>>>>>>> native_write_msr_safe >>>>>>>>>>>>> 1.44% [kernel] [k] >>>>>>>>>>>>> hrtimer_interrupt >>>>>>>>>>>>> 1.37% [kernel] [k] >>>>>>>>>>>>> __hrtimer_start_range_ns >>>>>>>>>>>>> 1.34% [kernel] [k] >>>>>>>>>>>>> do_select >>>>>>>>>>>>> 1.29% [kernel] [k] >>>>>>>>>>>>> fget_light >>>>>>>>>>>>> 1.24% [kernel] [k] >>>>>>>>>>>>> clockevents_program_event >>>>>>>>>>>>> 1.21% [kernel] [k] >>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>> 1.18% [kernel] [k] >>>>>>>>>>>>> cpumask_next_and >>>>>>>>>>>>> 1.18% [kernel] [k] >>>>>>>>>>>>> _raw_spin_unlock_irqrestore >>>>>>>>>>>>> 1.15% [kernel] [k] >>>>>>>>>>>>> find_next_bit >>>>>>>>>>>>> 1.14% [kernel] [k] >>>>>>>>>>>>> __schedule >>>>>>>>>>>>> 1.11% [kernel] [k] >>>>>>>>>>>>> _raw_spin_lock_irqsave >>>>>>>>>>>>> 0.98% [kernel] [k] >>>>>>>>>>>>> apic_timer_interrupt >>>>>>>>>>>>> 0.86% [kernel] [k] >>>>>>>>>>>>> copy_user_generic_string >>>>>>>>>>>>> 0.83% [kernel] [k] >>>>>>>>>>>>> native_write_cr0 >>>>>>>>>>>>> 0.76% [kernel] [k] >>>>>>>>>>>>> sync_inodes_sb >>>>>>>>>>>>> 0.71% [kernel] [k] >>>>>>>>>>>>> rcu_needs_cpu >>>>>>>>>>>>> 0.69% libpthread-2.11.3.so [.] >>>>>>>>>>>>> pthread_mutex_lock >>>>>>>>>>>>> 0.66% [kernel] [k] >>>>>>>>>>>>> fput >>>>>>>>>>>>> 0.62% [kernel] [k] >>>>>>>>>>>>> load_balance >>>>>>>>>>>>> 0.57% [vdso] [.] >>>>>>>>>>>>> 0x7fff3a976700 >>>>>>>>>>>>> 0.56% libc-2.11.3.so [.] >>>>>>>>>>>>> memcpy >>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>> reschedule_interrupt >>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>> tg_load_down >>>>>>>>>>>>> 0.50% [kernel] [k] iput >>>>>>>>>>>>> >>>>>>>>>>>>> I see lots of new differences, but again don't know what to make of it >>>>>>>>>>>>> and what might be related or significant. LevelDB seems to jump out >>>>>>>>>>>>> this time, amongst others. Let me know if you need more info. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Oliver >>>>>>>>>>>>> >>>>>>>>>>>>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: >>>>>>>>>>>>>> Hi Oliver, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Glad that helped! How much more efficient do the cuttlefish OSDs seem >>>>>>>>>>>>>> at this point (with wip-dumpling-pglog-undirty)? On modern Intel >>>>>>>>>>>>>> platforms we were actually hoping to see CPU usage go down in many cases >>>>>>>>>>>>>> due to the use of hardware CRC32 instructions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mark >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote: >>>>>>>>>>>>>>> Hey Samuel, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! I installed your version, repeated the same tests on my >>>>>>>>>>>>>>> test-cluster and the extra CPU-loading seems to have disappeared. Then >>>>>>>>>>>>>>> I replaced one osd of my production-cluster with your modified version >>>>>>>>>>>>>>> and it's config-option and it seems to be a lot less CPU-hungry now. >>>>>>>>>>>>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient, >>>>>>>>>>>>>>> your changes have definitely helped a lot. We seem to be looking in the >>>>>>>>>>>>>>> right direction, at least for this part of the problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> BTW, I ran `perf top' on the production-node with your modified osd and >>>>>>>>>>>>>>> didn't see anything osd-related stand out on top. "PGLog::undirty()" >>>>>>>>>>>>>>> was in there, but with much lower usage, right at the bottom of the >>>>>>>>>>>>>>> green part of the output. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many thanks for your help so far! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Oliver >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: >>>>>>>>>>>>>>>> You're right, PGLog::undirty() looks suspicious. I just pushed a >>>>>>>>>>>>>>>> branch wip-dumpling-pglog-undirty with a new config >>>>>>>>>>>>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some >>>>>>>>>>>>>>>> strictly debugging checks which occur in PGLog::undirty(). We haven't >>>>>>>>>>>>>>>> actually seen these checks causing excessive cpu usage, so this may be >>>>>>>>>>>>>>>> a red herring. >>>>>>>>>>>>>>>> -Sam >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> Hey Mark, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: >>>>>>>>>>>>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote: >>>>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I >>>>>>>>>>>>>>>>>>> created in the tracker. Thought I would pass it through the list as >>>>>>>>>>>>>>>>>>> well, to get an idea if anyone else is running into it. It may only >>>>>>>>>>>>>>>>>>> show under higher loads. More info about my setup is in the bug-report >>>>>>>>>>>>>>>>>>> above. Here goes: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd >>>>>>>>>>>>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is >>>>>>>>>>>>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 >>>>>>>>>>>>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this >>>>>>>>>>>>>>>>>>> cluster under regular load and the osds usually averaged 20-100% >>>>>>>>>>>>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for >>>>>>>>>>>>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the >>>>>>>>>>>>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users >>>>>>>>>>>>>>>>>>> complained that disk-access inside the VMs was significantly slower and >>>>>>>>>>>>>>>>>>> the backups of the RBD-store I was running, also got behind quickly. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the >>>>>>>>>>>>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have >>>>>>>>>>>>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster >>>>>>>>>>>>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't >>>>>>>>>>>>>>>>>>> as dramatic during these tests because there was no base-load from other >>>>>>>>>>>>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these >>>>>>>>>>>>>>>>>>> tests as well, and returned to normal after downgrading the osds. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs >>>>>>>>>>>>>>>>>>> and everything runs and reports good health, it's just a lot slower, >>>>>>>>>>>>>>>>>>> with a lot more CPU-usage. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Oliver, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you have access to the perf command on this system, could you try >>>>>>>>>>>>>>>>>> running: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> "sudo perf top" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And if that doesn't give you much, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> "sudo perf record -g" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> then: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> "sudo perf report | less" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> during the period of high CPU usage? This will give you a call graph. >>>>>>>>>>>>>>>>>> There may be symbols missing, but it might help track down what the OSDs >>>>>>>>>>>>>>>>>> are doing. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for your help! I did a couple of runs on my test-cluster, >>>>>>>>>>>>>>>>> loading it with writes from 3 VMs concurrently and measuring the results >>>>>>>>>>>>>>>>> at the first node with all 0.67 Dumpling-components and with the osds >>>>>>>>>>>>>>>>> replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a >>>>>>>>>>>>>>>>> while, then copied anything that showed in red and green into this post. >>>>>>>>>>>>>>>>> Here are the results (sorry for the word-wraps): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> First, with 0.61.7 osds: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 19.91% [kernel] [k] intel_idle >>>>>>>>>>>>>>>>> 10.18% [kernel] [k] _raw_spin_lock_irqsave >>>>>>>>>>>>>>>>> 6.79% ceph-osd [.] ceph_crc32c_le >>>>>>>>>>>>>>>>> 4.93% [kernel] [k] >>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>>>>> 2.71% [kernel] [k] copy_user_generic_string >>>>>>>>>>>>>>>>> 1.42% libc-2.11.3.so [.] memcpy >>>>>>>>>>>>>>>>> 1.23% [kernel] [k] find_busiest_group >>>>>>>>>>>>>>>>> 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>> 1.11% [kernel] [k] _raw_spin_lock >>>>>>>>>>>>>>>>> 0.99% kvm [.] 0x1931f8 >>>>>>>>>>>>>>>>> 0.92% [igb] [k] igb_poll >>>>>>>>>>>>>>>>> 0.87% [kernel] [k] native_write_cr0 >>>>>>>>>>>>>>>>> 0.80% [kernel] [k] csum_partial >>>>>>>>>>>>>>>>> 0.78% [kernel] [k] __do_softirq >>>>>>>>>>>>>>>>> 0.63% [kernel] [k] hpet_legacy_next_event >>>>>>>>>>>>>>>>> 0.53% [ip_tables] [k] ipt_do_table >>>>>>>>>>>>>>>>> 0.50% libc-2.11.3.so [.] 0x74433 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Second test, with 0.67 osds: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 18.32% [kernel] [k] intel_idle >>>>>>>>>>>>>>>>> 7.58% [kernel] [k] _raw_spin_lock_irqsave >>>>>>>>>>>>>>>>> 7.04% ceph-osd [.] PGLog::undirty() >>>>>>>>>>>>>>>>> 4.39% ceph-osd [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>> 3.92% [kernel] [k] >>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>>>>> 2.25% [kernel] [k] copy_user_generic_string >>>>>>>>>>>>>>>>> 1.76% libc-2.11.3.so [.] memcpy >>>>>>>>>>>>>>>>> 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>> 1.40% libc-2.11.3.so [.] vfprintf >>>>>>>>>>>>>>>>> 1.12% libc-2.11.3.so [.] 0x7217b >>>>>>>>>>>>>>>>> 1.05% [kernel] [k] _raw_spin_lock >>>>>>>>>>>>>>>>> 1.01% [kernel] [k] find_busiest_group >>>>>>>>>>>>>>>>> 0.83% kvm [.] 0x193ab8 >>>>>>>>>>>>>>>>> 0.80% [kernel] [k] native_write_cr0 >>>>>>>>>>>>>>>>> 0.76% [kernel] [k] __do_softirq >>>>>>>>>>>>>>>>> 0.73% libc-2.11.3.so [.] _IO_default_xsputn >>>>>>>>>>>>>>>>> 0.70% [kernel] [k] csum_partial >>>>>>>>>>>>>>>>> 0.68% [igb] [k] igb_poll >>>>>>>>>>>>>>>>> 0.58% [kernel] [k] hpet_legacy_next_event >>>>>>>>>>>>>>>>> 0.54% [kernel] [k] __schedule >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up >>>>>>>>>>>>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67. >>>>>>>>>>>>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise, >>>>>>>>>>>>>>>>> because of network-constraints and I don't want to take any extra risks >>>>>>>>>>>>>>>>> on the production-cluster and test it there, but it seems we found a >>>>>>>>>>>>>>>>> possible culprit. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Any ideas? Thanks again! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Oliver >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>> >>> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com