Mark, As we discussed today, here is some data point for min_alloc_size set to 16K vs default 4K.. I am seeing 4K RW is stabilizing to a lower value (~25-30%) after ~1 hour run if I set min_alloc_size = 16K with both default rocksdb tuning and with my tuning I posted sometimes back. 16K min_alloc_size (after 1 and half hour) : ----------------------- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 3 1 95 2 0 0| 19M 68M| 0 0 | 160k 242k| 14k 54k 18 6 67 8 0 2| 364M 582M| 114M 74M| 0 0 | 313k 343k 18 6 66 8 0 2| 384M 614M| 122M 79M| 0 0 | 314k 344k 18 6 67 7 0 2| 337M 575M| 108M 71M| 0 0 | 316k 356k 18 5 68 7 0 2| 349M 556M| 111M 73M| 0 0 | 305k 344k 17 6 68 7 0 2| 426M 631M| 106M 69M| 0 0 | 306k 335k 19 6 66 7 0 2| 436M 661M| 129M 84M| 0 0 | 340k 365k 19 7 62 10 0 2| 450M 712M| 113M 75M| 0 0 | 330k 350k 20 7 60 11 0 2| 463M 717M| 120M 79M| 0 0 | 349k 363k 21 7 57 13 0 2| 494M 720M| 137M 89M| 0 0 | 367k 385k Default 4K min_alloc_size (after 10 hour run): -------------------------------- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 44 9 31 13 0 4| 158M 259M| 173M 113M| 0 0 | 451k 469k 41 9 34 12 0 3| 146M 250M| 162M 106M| 0 0 | 435k 461k 43 10 32 12 0 4| 141M 264M| 172M 112M| 0 0 | 446k 460k 45 10 28 14 0 4| 140M 282M| 180M 117M| 0 0 | 454k 458k 44 10 27 14 0 4| 139M 261M| 181M 119M| 0 0 | 467k 457k 46 10 28 12 0 4| 137M 264M| 185M 121M| 0 0 | 465k 458k 46 10 29 11 0 4| 143M 303M| 179M 116M| 0 0 | 457k 453k 46 10 28 12 0 4| 172M 325M| 173M 112M| 0 0 | 460k 454k 44 10 26 16 0 4| 206M 302M| 169M 110M| 0 0 | 463k 466k You can see way more read/write going on if I enable 16K min_alloc_size and that is degrading performance over time for me.. Thanks & Regards Somnath -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson Sent: Friday, October 07, 2016 7:04 AM To: ceph-devel Subject: bluestore performance snapshot - 20161006 Hi Guys, I wanted to give folks a quick snapshot of bluestore performance on our NVMe test setup. There's a lot of things happening very quickly in the code, so here's a limited snapshot of how we are doing. These are short running tests, so do keep that in mind (5 minutes each). https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZRXFKRDBrTjZiWGc The gist of it is: We are now basically faster than filestore in all short write tests. Increasing the min_alloc size (and to a lesser extent) the number of cached onodes brings 4K random write performance up even further. Likely increasing the min_alloc size will improve long running test performance as well due to drastically reduced metadata load on rocksdb. Related to this, the amount of memory that we cache with 4k min_alloc is pretty excessive, even when limiting the number of cached onodes to 4k The memory allocator work should allow us to make this more flexible, but I suspect that for now we will want to increase the min_alloc size to 16k to help alleviate memory consumption and metadata overhead (and improve 4k random write performance!). The extra WAL write is probably still worth the tradeoff for now. On the read side we are seeing some regressions. The sequential read case is interesting. We're doing quite a bit worse in recent bluestore even vs older bluestore, and generally quite a bit worse than filestore. Increasing the min_alloc size reduces the degredation, but we still have ground to make up. In these tests rbd readahead is being used in an attempt to achieve client-side readahead since bluestore no longer does it on the OSD side, but appears to be fairly ineffective. These are the settings used: rbd readahead disable after bytes = 0 rbd readahead max bytes = 4194304 By default we require 10 sequential reads to trigger it. I don't think that should be a problem, but perhaps lowering the threshold will help. In general this is an area we still need to focus. For random reads, the degradation was previously found to be due to the async messenger. Both filestore and bluestore performance has degraded relative to Jewel in these tests. Haomai suspects fast dispatch as the primarily bottleneck here. So in general, the areas I think we still need to focus: 1) memory allocator work (much easier caching configuration, better memory usage, better memory fragmentation, etc) 2) Long running tests (Somnath has been doing this, thanks Somnath!) 2) Sequential read performance in bluestore (Need to understand this better) 3) Fast dispatch performance improvement (Dan's RCU work?) Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f