I pushed a branch, wip-dumpling-perf. It does two things: 1) adds a config filestore_wbthrottle_enable (defaults to true) to allow you to disable the wbthrottle altogether 2) causes the wbthrottle when enabled to fdatasync rather than fsync. Can you rerun the random workload with that branch with filestore_wbthrottle_enable on and then off? -Sam On Fri, Aug 23, 2013 at 10:26 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote: > Hey Sage, > > I'm all for it and will help testing. > > > Regards, > > Oliver > > On 22-08-13 17:23, Sage Weil wrote: >> We should perhaps hack the old (cuttlefish and earlier) flushing behavior >> into the new code so that we can confirm that it is really the writeback >> that is causing the problem and not something else... >> >> sage >> >> On Thu, 22 Aug 2013, Oliver Daudey wrote: >> >>> Hey Samuel, >>> >>> On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote: >>>> I think the rbd cache one you'd need to run for a few minutes to get >>>> meaningful results. It should stabilize somewhere around the actual >>>> throughput of your hardware. >>> >>> Ok, I now also ran this test on Cuttlefish as well as Dumpling. >>> >>> Cuttlefish: >>> # rbd bench-write test --rbd-cache >>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>> SEC OPS OPS/SEC BYTES/SEC >>> 1 13265 13252.45 3466029.67 >>> 2 25956 12975.60 3589479.95 >>> 3 38475 12818.61 3598590.70 >>> 4 50184 12545.16 3530516.34 >>> 5 59263 11852.22 3292258.13 >>> <...> >>> 300 3421530 11405.08 3191555.35 >>> 301 3430755 11397.83 3189251.09 >>> 302 3443345 11401.73 3190694.98 >>> 303 3455230 11403.37 3191478.97 >>> 304 3467014 11404.62 3192136.82 >>> 305 3475355 11394.57 3189525.71 >>> 306 3488067 11398.90 3190553.96 >>> 307 3499789 11399.96 3190770.21 >>> 308 3510566 11397.93 3190289.49 >>> 309 3519829 11390.98 3188620.93 >>> 310 3532539 11395.25 3189544.03 >>> >>> Dumpling: >>> # rbd bench-write test --rbd-cache >>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>> SEC OPS OPS/SEC BYTES/SEC >>> 1 13201 13194.63 3353004.50 >>> 2 25926 12957.05 3379695.03 >>> 3 36624 12206.06 3182087.11 >>> 4 46547 11635.35 3035794.95 >>> 5 59290 11856.27 3090389.79 >>> <...> >>> 300 3405215 11350.66 3130092.00 >>> 301 3417789 11354.76 3131106.34 >>> 302 3430067 11357.83 3131933.41 >>> 303 3438792 11349.14 3129734.88 >>> 304 3450237 11349.45 3129689.62 >>> 305 3462840 11353.53 3130406.43 >>> 306 3473151 11350.17 3128942.32 >>> 307 3482327 11343.00 3126771.34 >>> 308 3495020 11347.44 3127502.07 >>> 309 3506894 11349.13 3127781.70 >>> 310 3516532 11343.65 3126714.62 >>> >>> As you can see, the result is virtually identical. What jumps out >>> during the cached tests, is that the CPU used by the OSDs is negligible >>> in both cases, while without caching, the OSDs get loaded quite well. >>> Perhaps the cache masks the problem we're seeing in Dumpling somehow? >>> And I'm not changing anything but the OSD-binary during my tests, so >>> cache-settings used in VMs are identical in both scenarios. >>> >>>> >>>> Hmm, 10k ios I guess is only 10 rbd chunks. What replication level >>>> are you using? Try setting them to 10000000 (you only need to set the >>>> xfs ones). >>>> >>>> For the rand test, try increasing >>>> filestore_wbthrottle_xfs_inodes_hard_limit and >>>> filestore_wbthrottle_xfs_inodes_start_flusher to 10000 as well as >>>> setting the above ios limits. >>> >>> Ok, my current config: >>> filestore wbthrottle xfs ios start flusher = 10000000 >>> filestore wbthrottle xfs ios hard limit = 10000000 >>> filestore wbthrottle xfs inodes hard limit = 10000 >>> filestore wbthrottle xfs inodes start flusher = 10000 >>> >>> Unfortunately, that still makes no difference at all in the original >>> standard-tests. >>> >>> Random IO on Dumpling, after 120 secs of runtime: >>> # rbd bench-write test --io-pattern=rand >>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>> SEC OPS OPS/SEC BYTES/SEC >>> 1 545 534.98 1515804.02 >>> 2 1162 580.80 1662416.60 >>> 3 1731 576.52 1662966.61 >>> 4 2317 579.04 1695129.94 >>> 5 2817 562.56 1672754.87 >>> <...> >>> 120 43564 362.91 1080512.00 >>> 121 43774 361.76 1077368.28 >>> 122 44419 364.06 1083894.31 >>> 123 45046 366.22 1090518.68 >>> 124 45287 364.01 1084437.37 >>> 125 45334 361.54 1077035.12 >>> 126 45336 359.40 1070678.36 >>> 127 45797 360.60 1073985.78 >>> 128 46388 362.40 1080056.75 >>> 129 46984 364.21 1086068.63 >>> 130 47604 366.11 1092712.51 >>> >>> Random IO on Cuttlefish, after 120 secs of runtime: >>> rbd bench-write test --io-pattern=rand >>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>> SEC OPS OPS/SEC BYTES/SEC >>> 1 1066 1065.54 3115713.13 >>> 2 2099 1049.31 2936300.53 >>> 3 3218 1072.32 3028707.50 >>> 4 4026 1003.23 2807859.15 >>> 5 4272 793.80 2226962.63 >>> <...> >>> 120 66935 557.79 1612483.74 >>> 121 68011 562.01 1625419.34 >>> 122 68428 558.59 1615376.62 >>> 123 68579 557.06 1610780.38 >>> 125 68777 549.73 1589816.94 >>> 126 69745 553.52 1601671.46 >>> 127 70855 557.91 1614293.12 >>> 128 71962 562.20 1627070.81 >>> 129 72529 562.22 1627120.59 >>> 130 73146 562.66 1628818.79 >>> >>> Confirming your setting took properly: >>> # ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep >>> wbthrottle >>> "filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040", >>> "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400", >>> "filestore_wbthrottle_btrfs_ios_start_flusher": "500", >>> "filestore_wbthrottle_btrfs_ios_hard_limit": "5000", >>> "filestore_wbthrottle_btrfs_inodes_start_flusher": "500", >>> "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040", >>> "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400", >>> "filestore_wbthrottle_xfs_ios_start_flusher": "10000000", >>> "filestore_wbthrottle_xfs_ios_hard_limit": "10000000", >>> "filestore_wbthrottle_xfs_inodes_start_flusher": "10000", >>> "filestore_wbthrottle_btrfs_inodes_hard_limit": "5000", >>> "filestore_wbthrottle_xfs_inodes_hard_limit": "10000", >>> >>> As you can see, your suggested settings made little difference anywhere >>> after you let it settle and in the end, there is still a huge difference >>> in performance between Cuttlefish and Dumpling. Caching seems to mask >>> the problem, though, at least with the type of IO generated by these >>> tests. It also reduces CPU-usage by the OSDs to almost none. >>> >>> >>> Regards, >>> >>> Oliver >>> >>>> -Sam >>>> >>>> On Wed, Aug 21, 2013 at 8:13 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>> Hey Samuel, >>>>> >>>>> On wo, 2013-08-21 at 18:33 -0700, Samuel Just wrote: >>>>>> I am dumb. There *has* been a change in the osd which can account for >>>>>> this: the wbthrottle limits. We added some logic to force the kernel >>>>>> to start flushing writes out earlier, normally a good thing. In this >>>>>> case, it's probably doing an fsync every 500 writes. >>>>>> >>>>>> Can you run 3 tests? >>>>>> 1) rerun with --rbd-cache (rbd bench-write test --rbd-cache) >>>>> >>>>> That flies, with very little load on the OSDs. That load is much higher >>>>> without caching. This test was on Dumpling. You get the idea. >>>>> >>>>> # rbd bench-write test --rbd-cache >>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>>> SEC OPS OPS/SEC BYTES/SEC >>>>> 1 13832 13817.17 3678357.12 >>>>> 2 27590 13793.28 3747373.45 >>>>> 3 41144 13708.02 3781505.90 >>>>> 4 54510 13622.96 3744543.19 >>>>> 5 67249 13448.05 3714589.10 >>>>> 6 78778 13128.73 3632892.25 >>>>> 7 89507 12783.12 3544970.08 >>>>> 8 103171 12895.06 3564180.90 >>>>> 9 116306 12921.45 3568128.57 >>>>> 10 128946 12894.02 3557625.15 >>>>> >>>>>> 2) rerun without rbd-cache, but add the following configs to the osds: >>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 >>>>> >>>>> That didn't seem to change the results in any way on Dumpling. I'm >>>>> running the OSD-storage on XFS, BTW. >>>>> >>>>>> 3) rerun with the random pattern (rbd bench-write test --io-pattern=rand) >>>>> >>>>> With Dumpling, it starts out high, then drops off: >>>>> >>>>> # rbd bench-write test --io-pattern=rand >>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>>>> SEC OPS OPS/SEC BYTES/SEC >>>>> 1 584 569.09 1700360.17 >>>>> 2 1165 582.37 1687174.07 >>>>> 3 1568 468.14 1364737.79 >>>>> 4 1649 411.79 1197760.41 >>>>> 5 1819 363.30 1061872.78 >>>>> 6 2369 394.17 1146998.37 >>>>> 7 2796 395.00 1153270.94 >>>>> 8 2939 360.59 1050822.48 >>>>> 9 3221 357.87 1040777.01 >>>>> 10 3515 339.64 991027.17 >>>>> 11 3834 347.46 1021930.90 >>>>> 12 4206 350.18 1024435.83 >>>>> 13 4424 337.30 981849.71 >>>>> 14 4493 320.86 936308.80 >>>>> 15 4734 315.29 921501.51 >>>>> 16 5285 330.31 969201.95 >>>>> 17 5436 319.68 936765.93 >>>>> 18 6011 333.91 980649.27 >>>>> 19 6376 334.35 983297.54 >>>>> 20 6398 311.90 917323.37 >>>>> 21 6602 314.36 926410.86 >>>>> 22 7167 325.67 962430.15 >>>>> 24 7543 310.73 917734.40 >>>>> 25 7704 307.98 909611.56 >>>>> 26 8279 318.30 938557.27 >>>>> 27 8826 323.99 953434.67 >>>>> 28 9296 331.87 977987.40 >>>>> 29 9476 322.55 951696.77 >>>>> 30 9628 320.92 946692.41 >>>>> >>>>> Same for Cuttlefish, but with significantly more performance over the >>>>> entire run: >>>>> >>>>> # rbd bench-write test --io-pattern=rand >>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern rand >>>>> SEC OPS OPS/SEC BYTES/SEC >>>>> 1 920 919.34 2464017.91 >>>>> 2 1228 575.73 1597724.10 >>>>> 3 2016 669.87 1835987.05 >>>>> 4 2222 528.39 1443494.02 >>>>> 5 2339 467.32 1268464.39 >>>>> 6 3341 556.76 1532374.18 >>>>> 7 3677 500.11 1385500.24 >>>>> 8 3925 490.60 1361346.40 >>>>> 9 4633 510.17 1430694.07 >>>>> 10 5275 506.04 1412213.94 >>>>> 11 5713 518.16 1449597.21 >>>>> 12 6279 523.22 1466611.93 >>>>> 13 7224 529.66 1485373.17 >>>>> 14 7292 520.08 1461556.30 >>>>> 15 8007 533.80 1498859.94 >>>>> 16 8218 510.93 1436516.13 >>>>> 17 8707 511.54 1439508.19 >>>>> 18 9699 538.56 1513392.51 >>>>> 19 10264 539.98 1522426.08 >>>>> 20 10684 530.23 1496947.12 >>>>> 21 10999 523.54 1478991.54 >>>>> 22 11357 512.25 1450630.22 >>>>> 23 12079 520.24 1475359.56 >>>>> 24 12868 530.69 1504088.36 >>>>> 25 13315 529.92 1504462.99 >>>>> 26 13647 512.39 1451005.94 >>>>> 27 14002 517.90 1467738.55 >>>>> 28 14343 504.96 1430645.32 >>>>> 29 14575 501.53 1419589.36 >>>>> 30 15408 513.55 1452033.93 >>>>> >>>>> Weird that the config-options didn't change anything. I had hoped to >>>>> try them on the production-cluster. Anything I could try there? Had to >>>>> revert the OSDs there to Cuttlefish again for now, as Dumpling is just >>>>> too slow. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Oliver >>>>>> >>>>>> I expect 1) to largely eliminate the difference (these are sequential >>>>>> 4k ios and should be coalesced into much bigger writes). 2) should >>>>>> basically restore the same filestore behavior as in cuttlefish. 3) >>>>>> should also show little difference since the io will be random. >>>>>> >>>>>> In normal use, your vms should probably be using rbd caching anyway >>>>>> (in addition to the write coalescing already occurring in the guest >>>>>> os), so this may not be what you are seeing on your cluster. >>>>>> -Sam >>>>>> >>>>>> On Wed, Aug 21, 2013 at 6:00 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote: >>>>>>> You'd need to unmount the fs to actually clear the cache. Did you see >>>>>>> a significant difference in load between the runs? To confirm, the >>>>>>> rbd client is dumpling the entire time? >>>>>>> -Sam >>>>>>> >>>>>>> On Wed, Aug 21, 2013 at 2:28 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>> Hey Samuel, >>>>>>>> >>>>>>>> I repeated the same test several times before my post and just now 2 >>>>>>>> more times. It holds up and is also repeatable in reverse order, with >>>>>>>> the same results. Remember, I restart all OSDs between tests, so any >>>>>>>> caches should get destroyed and besides, I'm writing. That shouldn't >>>>>>>> involve very large-scale caching, right? I also waited for the cluster >>>>>>>> to come to a healthy state after restarting the OSDs, so it's not >>>>>>>> related to rebalancing or peering-activity, either. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Oliver >>>>>>>> >>>>>>>> On wo, 2013-08-21 at 14:07 -0700, Samuel Just wrote: >>>>>>>>> Try it again in the reverse order, I strongly suspect caching effects. >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Wed, Aug 21, 2013 at 1:34 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>> Hey Samuel, >>>>>>>>>> >>>>>>>>>> Finally got it reproduced on my test-cluster, which was otherwise >>>>>>>>>> unloaded at the time. First, with Dumpling: >>>>>>>>>> >>>>>>>>>> # rbd create --size 102400 test >>>>>>>>>> # ceph-osd --version >>>>>>>>>> ceph version 0.67.1-6-g0c4f2f3 >>>>>>>>>> (0c4f2f34b78b634efe7f4d56694e2edeeda5a130) >>>>>>>>>> # rbd bench-write test >>>>>>>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>>>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>>>>> 1 204 200.67 592272.58 >>>>>>>>>> 2 401 200.24 607462.05 >>>>>>>>>> 3 625 204.62 639655.53 >>>>>>>>>> 4 824 205.20 636488.76 >>>>>>>>>> 5 1026 204.08 630587.77 >>>>>>>>>> 6 1217 202.69 628965.39 >>>>>>>>>> 7 1429 203.19 633671.84 >>>>>>>>>> 8 1626 202.93 635419.90 >>>>>>>>>> 9 1837 203.11 630421.40 >>>>>>>>>> 10 2045 204.16 631364.11 >>>>>>>>>> 11 2247 204.12 625833.91 >>>>>>>>>> 12 2456 204.29 629952.57 >>>>>>>>>> 13 2655 204.15 634639.37 >>>>>>>>>> 14 2888 206.15 637373.53 >>>>>>>>>> 15 3115 207.12 640826.18 >>>>>>>>>> 16 3330 207.75 640120.38 >>>>>>>>>> 17 3529 207.57 636277.18 >>>>>>>>>> 18 3750 208.24 635036.47 >>>>>>>>>> 19 3965 208.65 631118.13 >>>>>>>>>> 20 4216 210.75 638197.93 >>>>>>>>>> 21 4534 215.83 656890.46 >>>>>>>>>> 22 4809 218.50 666612.75 >>>>>>>>>> 23 5076 220.63 678678.36 >>>>>>>>>> 24 5320 221.39 686760.90 >>>>>>>>>> 25 5651 226.02 701345.26 >>>>>>>>>> 26 5967 229.44 706839.16 >>>>>>>>>> 27 6271 232.05 707958.62 >>>>>>>>>> 28 6522 232.82 709890.37 >>>>>>>>>> 29 6773 233.37 711718.21 >>>>>>>>>> 30 7151 238.23 722924.18 >>>>>>>>>> >>>>>>>>>> Then I replaced the OSD with Cuttlefish on all nodes and restarted it >>>>>>>>>> everywhere. >>>>>>>>>> >>>>>>>>>> # ceph-osd --version >>>>>>>>>> ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) >>>>>>>>>> # rbd bench-write test >>>>>>>>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq >>>>>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>>>>> 1 556 555.63 1915665.58 >>>>>>>>>> 2 1095 545.08 1839139.42 >>>>>>>>>> 3 1696 564.45 1840316.22 >>>>>>>>>> 4 2312 577.25 1832641.60 >>>>>>>>>> 5 2870 573.46 1800558.64 >>>>>>>>>> 6 3461 576.81 1780993.33 >>>>>>>>>> 7 4062 579.26 1821246.36 >>>>>>>>>> 8 4678 581.51 1862003.10 >>>>>>>>>> 9 5269 585.12 1902219.78 >>>>>>>>>> 10 5883 587.11 1914700.43 >>>>>>>>>> 11 6471 587.59 1913208.96 >>>>>>>>>> 12 7022 585.12 1914375.79 >>>>>>>>>> 13 7568 580.15 1899650.34 >>>>>>>>>> 14 8122 579.87 1864572.53 >>>>>>>>>> 15 8659 576.18 1826927.83 >>>>>>>>>> 16 9259 576.81 1796653.40 >>>>>>>>>> 17 9826 577.60 1785328.57 >>>>>>>>>> 18 10401 576.30 1777479.51 >>>>>>>>>> 19 10962 576.94 1777431.02 >>>>>>>>>> 20 11399 569.84 1756633.76 >>>>>>>>>> 21 11937 567.45 1740147.43 >>>>>>>>>> 22 12491 567.47 1726288.46 >>>>>>>>>> 23 13109 569.92 1727347.20 >>>>>>>>>> 24 13629 567.64 1721996.79 >>>>>>>>>> 25 14153 566.11 1720782.06 >>>>>>>>>> 26 14669 563.23 1716275.34 >>>>>>>>>> 27 15191 562.61 1717677.74 >>>>>>>>>> 28 15768 562.49 1713059.35 >>>>>>>>>> 29 16281 561.40 1707289.23 >>>>>>>>>> 31 16719 536.96 1628452.43 >>>>>>>>>> ^C >>>>>>>>>> >>>>>>>>>> As you can see, the difference is quite dramatic and this confirms it's >>>>>>>>>> probably something RBD-related. The test-cluster is a standard 3-node >>>>>>>>>> cluster with 3xMON, 3xMDS, 3xOSD, tied together with 2x1Gbit/s ethernet. >>>>>>>>>> I replaced only the OSD-binary for these tests. The rest of the >>>>>>>>>> Ceph-installation is at Dumpling. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Oliver >>>>>>>>>> >>>>>>>>>> On wo, 2013-08-21 at 12:05 -0700, Samuel Just wrote: >>>>>>>>>>> There haven't been any significant osd side changes that I can think >>>>>>>>>>> of. Is cpu usage still high? If so, can you post the profiler >>>>>>>>>>> results again? >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 21, 2013 at 12:02 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>>> Hey Samuel, >>>>>>>>>>>> >>>>>>>>>>>> I had a good run on the production-cluster with it and unfortunately, it >>>>>>>>>>>> still doesn't seem to have solved the problem. It seemed OK for a while >>>>>>>>>>>> and individual OSD CPU-usage seemed quite low, but as the cluster's load >>>>>>>>>>>> increased during the day, things got slower again. Write-performance >>>>>>>>>>>> within a VM crawled to 30MB/sec and at some point, I got only 10MB/sec >>>>>>>>>>>> on reads in that same VM. I also did RADOS bench-tests with `rados >>>>>>>>>>>> --pool rbd bench 120 write' and those got several hundreds of MB's/sec >>>>>>>>>>>> on the same cluster at the same time of day, so maybe the problem is >>>>>>>>>>>> RBD-related. Is there any code in the OSD that could influence >>>>>>>>>>>> RBD-performance alone? Do you know of any other significant changes to >>>>>>>>>>>> the OSD between Cuttlefish and Dumpling that could result in this? >>>>>>>>>>>> >>>>>>>>>>>> PS: I also did the same RADOS bench-tests on my test-cluster, both with >>>>>>>>>>>> Cuttlefish and Dumpling without your fix and got almost identical >>>>>>>>>>>> results. This confirms that the problem might be in RBD, as Mark suggested. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Oliver >>>>>>>>>>>> >>>>>>>>>>>> On 20-08-13 19:40, Samuel Just wrote: >>>>>>>>>>>>> Can you try dumpling head without the option? >>>>>>>>>>>>> -Sam >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Aug 20, 2013 at 1:44 AM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>>>>> Hey Mark, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry, but after some more tests I have to report that it only worked >>>>>>>>>>>>>> partly. The load seems lower with "wip-dumpling-pglog-undirty" in >>>>>>>>>>>>>> place, but the Cuttlefish-osd still seems significantly faster and even >>>>>>>>>>>>>> with "wip-dumpling-pglog-undirty" in place, things slow down way too >>>>>>>>>>>>>> much under load. Unfortunately, only my production-cluster seems busy >>>>>>>>>>>>>> enough to actually show the problem clearly by slowing down. Below is >>>>>>>>>>>>>> `perf top'-output, fresh from my production-cluster under it's regular >>>>>>>>>>>>>> load: >>>>>>>>>>>>>> >>>>>>>>>>>>>> First, the 0.67.1-6-g0c4f2f3 osd with "osd debug pg log writeout = >>>>>>>>>>>>>> false": >>>>>>>>>>>>>> 16.53% [kernel] [k] >>>>>>>>>>>>>> intel_idle >>>>>>>>>>>>>> 6.47% libleveldb.so.1.9 [.] >>>>>>>>>>>>>> 0x380a1 >>>>>>>>>>>>>> 5.76% [kernel] [k] >>>>>>>>>>>>>> find_busiest_group >>>>>>>>>>>>>> 4.11% libc-2.11.3.so [.] >>>>>>>>>>>>>> memcmp >>>>>>>>>>>>>> 3.95% kvm [.] >>>>>>>>>>>>>> 0x1f6f31 >>>>>>>>>>>>>> 2.05% [kernel] [k] >>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>> 2.03% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_lock >>>>>>>>>>>>>> 1.87% libleveldb.so.1.9 [.] >>>>>>>>>>>>>> leveldb::InternalKeyComparator::Compar >>>>>>>>>>>>>> 1.57% libc-2.11.3.so [.] >>>>>>>>>>>>>> memcpy >>>>>>>>>>>>>> 1.37% libleveldb.so.1.9 [.] >>>>>>>>>>>>>> leveldb::Block::Iter::Next() >>>>>>>>>>>>>> 1.26% [kernel] [k] >>>>>>>>>>>>>> hrtimer_interrupt >>>>>>>>>>>>>> 1.12% [kernel] [k] >>>>>>>>>>>>>> __hrtimer_start_range_ns >>>>>>>>>>>>>> 1.09% [kernel] [k] >>>>>>>>>>>>>> native_write_cr0 >>>>>>>>>>>>>> 1.05% libstdc++.so.6.0.13 [.] >>>>>>>>>>>>>> std::string::_M_mutate(unsigned long, >>>>>>>>>>>>>> 1.00% [kernel] [k] >>>>>>>>>>>>>> native_write_msr_safe >>>>>>>>>>>>>> 0.99% [kernel] [k] >>>>>>>>>>>>>> apic_timer_interrupt >>>>>>>>>>>>>> 0.98% [kernel] [k] >>>>>>>>>>>>>> clockevents_program_event >>>>>>>>>>>>>> 0.96% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_unlock_irqrestore >>>>>>>>>>>>>> 0.95% ceph-osd [.] >>>>>>>>>>>>>> PGLog::undirty() >>>>>>>>>>>>>> 0.92% [kernel] [k] >>>>>>>>>>>>>> find_next_bit >>>>>>>>>>>>>> 0.91% libsnappy.so.1.1.2 [.] >>>>>>>>>>>>>> snappy::RawUncompress(snappy::Source*, >>>>>>>>>>>>>> 0.88% [kernel] [k] >>>>>>>>>>>>>> __schedule >>>>>>>>>>>>>> 0.87% [kernel] [k] >>>>>>>>>>>>>> cpumask_next_and >>>>>>>>>>>>>> 0.84% [kernel] [k] >>>>>>>>>>>>>> do_select >>>>>>>>>>>>>> 0.80% [kernel] [k] >>>>>>>>>>>>>> fget_light >>>>>>>>>>>>>> 0.77% [kernel] [k] >>>>>>>>>>>>>> reschedule_interrupt >>>>>>>>>>>>>> 0.75% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_lock_irqsave >>>>>>>>>>>>>> 0.62% libstdc++.so.6.0.13 [.] std::string::append(char >>>>>>>>>>>>>> const*, unsig >>>>>>>>>>>>>> 0.59% [kvm_intel] [k] >>>>>>>>>>>>>> vmx_vcpu_run >>>>>>>>>>>>>> 0.58% [kernel] [k] >>>>>>>>>>>>>> copy_user_generic_string >>>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>>> load_balance >>>>>>>>>>>>>> 0.54% [kernel] [k] >>>>>>>>>>>>>> tg_load_down >>>>>>>>>>>>>> 0.53% libpthread-2.11.3.so [.] >>>>>>>>>>>>>> pthread_mutex_lock >>>>>>>>>>>>>> 0.52% [kernel] [k] sync_inodes_sb >>>>>>>>>>>>>> >>>>>>>>>>>>>> Second, the 0.61.8 osd, under identical load: >>>>>>>>>>>>>> 21.51% [kernel] [k] >>>>>>>>>>>>>> intel_idle >>>>>>>>>>>>>> 6.66% [kernel] [k] >>>>>>>>>>>>>> find_busiest_group >>>>>>>>>>>>>> 6.25% kvm [.] >>>>>>>>>>>>>> 0x2d214b >>>>>>>>>>>>>> 1.97% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_lock >>>>>>>>>>>>>> 1.47% [kernel] [k] >>>>>>>>>>>>>> native_write_msr_safe >>>>>>>>>>>>>> 1.44% [kernel] [k] >>>>>>>>>>>>>> hrtimer_interrupt >>>>>>>>>>>>>> 1.37% [kernel] [k] >>>>>>>>>>>>>> __hrtimer_start_range_ns >>>>>>>>>>>>>> 1.34% [kernel] [k] >>>>>>>>>>>>>> do_select >>>>>>>>>>>>>> 1.29% [kernel] [k] >>>>>>>>>>>>>> fget_light >>>>>>>>>>>>>> 1.24% [kernel] [k] >>>>>>>>>>>>>> clockevents_program_event >>>>>>>>>>>>>> 1.21% [kernel] [k] >>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>> 1.18% [kernel] [k] >>>>>>>>>>>>>> cpumask_next_and >>>>>>>>>>>>>> 1.18% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_unlock_irqrestore >>>>>>>>>>>>>> 1.15% [kernel] [k] >>>>>>>>>>>>>> find_next_bit >>>>>>>>>>>>>> 1.14% [kernel] [k] >>>>>>>>>>>>>> __schedule >>>>>>>>>>>>>> 1.11% [kernel] [k] >>>>>>>>>>>>>> _raw_spin_lock_irqsave >>>>>>>>>>>>>> 0.98% [kernel] [k] >>>>>>>>>>>>>> apic_timer_interrupt >>>>>>>>>>>>>> 0.86% [kernel] [k] >>>>>>>>>>>>>> copy_user_generic_string >>>>>>>>>>>>>> 0.83% [kernel] [k] >>>>>>>>>>>>>> native_write_cr0 >>>>>>>>>>>>>> 0.76% [kernel] [k] >>>>>>>>>>>>>> sync_inodes_sb >>>>>>>>>>>>>> 0.71% [kernel] [k] >>>>>>>>>>>>>> rcu_needs_cpu >>>>>>>>>>>>>> 0.69% libpthread-2.11.3.so [.] >>>>>>>>>>>>>> pthread_mutex_lock >>>>>>>>>>>>>> 0.66% [kernel] [k] >>>>>>>>>>>>>> fput >>>>>>>>>>>>>> 0.62% [kernel] [k] >>>>>>>>>>>>>> load_balance >>>>>>>>>>>>>> 0.57% [vdso] [.] >>>>>>>>>>>>>> 0x7fff3a976700 >>>>>>>>>>>>>> 0.56% libc-2.11.3.so [.] >>>>>>>>>>>>>> memcpy >>>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>>> reschedule_interrupt >>>>>>>>>>>>>> 0.56% [kernel] [k] >>>>>>>>>>>>>> tg_load_down >>>>>>>>>>>>>> 0.50% [kernel] [k] iput >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see lots of new differences, but again don't know what to make of it >>>>>>>>>>>>>> and what might be related or significant. LevelDB seems to jump out >>>>>>>>>>>>>> this time, amongst others. Let me know if you need more info. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Oliver >>>>>>>>>>>>>> >>>>>>>>>>>>>> On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: >>>>>>>>>>>>>>> Hi Oliver, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Glad that helped! How much more efficient do the cuttlefish OSDs seem >>>>>>>>>>>>>>> at this point (with wip-dumpling-pglog-undirty)? On modern Intel >>>>>>>>>>>>>>> platforms we were actually hoping to see CPU usage go down in many cases >>>>>>>>>>>>>>> due to the use of hardware CRC32 instructions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mark >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 08/19/2013 03:06 PM, Oliver Daudey wrote: >>>>>>>>>>>>>>>> Hey Samuel, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks! I installed your version, repeated the same tests on my >>>>>>>>>>>>>>>> test-cluster and the extra CPU-loading seems to have disappeared. Then >>>>>>>>>>>>>>>> I replaced one osd of my production-cluster with your modified version >>>>>>>>>>>>>>>> and it's config-option and it seems to be a lot less CPU-hungry now. >>>>>>>>>>>>>>>> Although the Cuttlefish-osds still seem to be even more CPU-efficient, >>>>>>>>>>>>>>>> your changes have definitely helped a lot. We seem to be looking in the >>>>>>>>>>>>>>>> right direction, at least for this part of the problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> BTW, I ran `perf top' on the production-node with your modified osd and >>>>>>>>>>>>>>>> didn't see anything osd-related stand out on top. "PGLog::undirty()" >>>>>>>>>>>>>>>> was in there, but with much lower usage, right at the bottom of the >>>>>>>>>>>>>>>> green part of the output. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Many thanks for your help so far! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Oliver >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: >>>>>>>>>>>>>>>>> You're right, PGLog::undirty() looks suspicious. I just pushed a >>>>>>>>>>>>>>>>> branch wip-dumpling-pglog-undirty with a new config >>>>>>>>>>>>>>>>> (osd_debug_pg_log_writeout) which if set to false will disable some >>>>>>>>>>>>>>>>> strictly debugging checks which occur in PGLog::undirty(). We haven't >>>>>>>>>>>>>>>>> actually seen these checks causing excessive cpu usage, so this may be >>>>>>>>>>>>>>>>> a red herring. >>>>>>>>>>>>>>>>> -Sam >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>> Hey Mark, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: >>>>>>>>>>>>>>>>>>> On 08/17/2013 06:13 AM, Oliver Daudey wrote: >>>>>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I >>>>>>>>>>>>>>>>>>>> created in the tracker. Thought I would pass it through the list as >>>>>>>>>>>>>>>>>>>> well, to get an idea if anyone else is running into it. It may only >>>>>>>>>>>>>>>>>>>> show under higher loads. More info about my setup is in the bug-report >>>>>>>>>>>>>>>>>>>> above. Here goes: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd >>>>>>>>>>>>>>>>>>>> and mds. I'm using RBD on this cluster as storage for KVM, CephFS is >>>>>>>>>>>>>>>>>>>> unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 >>>>>>>>>>>>>>>>>>>> +MB/sec on simple linear writes to a file with `dd' inside a VM on this >>>>>>>>>>>>>>>>>>>> cluster under regular load and the osds usually averaged 20-100% >>>>>>>>>>>>>>>>>>>> CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for >>>>>>>>>>>>>>>>>>>> the osds shot up to 100% to 400% in `top' (multi-core system) and the >>>>>>>>>>>>>>>>>>>> speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users >>>>>>>>>>>>>>>>>>>> complained that disk-access inside the VMs was significantly slower and >>>>>>>>>>>>>>>>>>>> the backups of the RBD-store I was running, also got behind quickly. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> After downgrading only the osds to v0.61.7 Cuttlefish and leaving the >>>>>>>>>>>>>>>>>>>> rest at 0.67 Dumpling, speed and load returned to normal. I have >>>>>>>>>>>>>>>>>>>> repeated this performance-hit upon upgrade on a similar test-cluster >>>>>>>>>>>>>>>>>>>> under no additional load at all. Although CPU-usage for the osds wasn't >>>>>>>>>>>>>>>>>>>> as dramatic during these tests because there was no base-load from other >>>>>>>>>>>>>>>>>>>> VMs, I/O-performance dropped significantly after upgrading during these >>>>>>>>>>>>>>>>>>>> tests as well, and returned to normal after downgrading the osds. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm not sure what to make of it. There are no visible errors in the logs >>>>>>>>>>>>>>>>>>>> and everything runs and reports good health, it's just a lot slower, >>>>>>>>>>>>>>>>>>>> with a lot more CPU-usage. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Oliver, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If you have access to the perf command on this system, could you try >>>>>>>>>>>>>>>>>>> running: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> "sudo perf top" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> And if that doesn't give you much, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> "sudo perf record -g" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> then: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> "sudo perf report | less" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> during the period of high CPU usage? This will give you a call graph. >>>>>>>>>>>>>>>>>>> There may be symbols missing, but it might help track down what the OSDs >>>>>>>>>>>>>>>>>>> are doing. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for your help! I did a couple of runs on my test-cluster, >>>>>>>>>>>>>>>>>> loading it with writes from 3 VMs concurrently and measuring the results >>>>>>>>>>>>>>>>>> at the first node with all 0.67 Dumpling-components and with the osds >>>>>>>>>>>>>>>>>> replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a >>>>>>>>>>>>>>>>>> while, then copied anything that showed in red and green into this post. >>>>>>>>>>>>>>>>>> Here are the results (sorry for the word-wraps): >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> First, with 0.61.7 osds: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 19.91% [kernel] [k] intel_idle >>>>>>>>>>>>>>>>>> 10.18% [kernel] [k] _raw_spin_lock_irqsave >>>>>>>>>>>>>>>>>> 6.79% ceph-osd [.] ceph_crc32c_le >>>>>>>>>>>>>>>>>> 4.93% [kernel] [k] >>>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>>>>>> 2.71% [kernel] [k] copy_user_generic_string >>>>>>>>>>>>>>>>>> 1.42% libc-2.11.3.so [.] memcpy >>>>>>>>>>>>>>>>>> 1.23% [kernel] [k] find_busiest_group >>>>>>>>>>>>>>>>>> 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>>> 1.11% [kernel] [k] _raw_spin_lock >>>>>>>>>>>>>>>>>> 0.99% kvm [.] 0x1931f8 >>>>>>>>>>>>>>>>>> 0.92% [igb] [k] igb_poll >>>>>>>>>>>>>>>>>> 0.87% [kernel] [k] native_write_cr0 >>>>>>>>>>>>>>>>>> 0.80% [kernel] [k] csum_partial >>>>>>>>>>>>>>>>>> 0.78% [kernel] [k] __do_softirq >>>>>>>>>>>>>>>>>> 0.63% [kernel] [k] hpet_legacy_next_event >>>>>>>>>>>>>>>>>> 0.53% [ip_tables] [k] ipt_do_table >>>>>>>>>>>>>>>>>> 0.50% libc-2.11.3.so [.] 0x74433 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Second test, with 0.67 osds: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 18.32% [kernel] [k] intel_idle >>>>>>>>>>>>>>>>>> 7.58% [kernel] [k] _raw_spin_lock_irqsave >>>>>>>>>>>>>>>>>> 7.04% ceph-osd [.] PGLog::undirty() >>>>>>>>>>>>>>>>>> 4.39% ceph-osd [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>>> 3.92% [kernel] [k] >>>>>>>>>>>>>>>>>> default_send_IPI_mask_sequence_phys >>>>>>>>>>>>>>>>>> 2.25% [kernel] [k] copy_user_generic_string >>>>>>>>>>>>>>>>>> 1.76% libc-2.11.3.so [.] memcpy >>>>>>>>>>>>>>>>>> 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel >>>>>>>>>>>>>>>>>> 1.40% libc-2.11.3.so [.] vfprintf >>>>>>>>>>>>>>>>>> 1.12% libc-2.11.3.so [.] 0x7217b >>>>>>>>>>>>>>>>>> 1.05% [kernel] [k] _raw_spin_lock >>>>>>>>>>>>>>>>>> 1.01% [kernel] [k] find_busiest_group >>>>>>>>>>>>>>>>>> 0.83% kvm [.] 0x193ab8 >>>>>>>>>>>>>>>>>> 0.80% [kernel] [k] native_write_cr0 >>>>>>>>>>>>>>>>>> 0.76% [kernel] [k] __do_softirq >>>>>>>>>>>>>>>>>> 0.73% libc-2.11.3.so [.] _IO_default_xsputn >>>>>>>>>>>>>>>>>> 0.70% [kernel] [k] csum_partial >>>>>>>>>>>>>>>>>> 0.68% [igb] [k] igb_poll >>>>>>>>>>>>>>>>>> 0.58% [kernel] [k] hpet_legacy_next_event >>>>>>>>>>>>>>>>>> 0.54% [kernel] [k] __schedule >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> What jumps right out, is the "PGLog::undirty()", which doesn't show up >>>>>>>>>>>>>>>>>> with 0.61.7 at all, but is an extra drag right at top-usage in 0.67. >>>>>>>>>>>>>>>>>> Note that I didn't manage to fully load the test-cluster CPU-wise, >>>>>>>>>>>>>>>>>> because of network-constraints and I don't want to take any extra risks >>>>>>>>>>>>>>>>>> on the production-cluster and test it there, but it seems we found a >>>>>>>>>>>>>>>>>> possible culprit. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Any ideas? Thanks again! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Oliver >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com