Yeah, but it's not likely to make Kraken. -Sam On Fri, Oct 7, 2016 at 10:27 AM, Haomai Wang <haomai@xxxxxxxx> wrote: > On Fri, Oct 7, 2016 at 10:03 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >> Hi Guys, >> >> I wanted to give folks a quick snapshot of bluestore performance on our NVMe >> test setup. There's a lot of things happening very quickly in the code, so >> here's a limited snapshot of how we are doing. These are short running >> tests, so do keep that in mind (5 minutes each). >> >> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZRXFKRDBrTjZiWGc >> >> The gist of it is: >> >> We are now basically faster than filestore in all short write tests. >> Increasing the min_alloc size (and to a lesser extent) the number of cached >> onodes brings 4K random write performance up even further. Likely increasing >> the min_alloc size will improve long running test performance as well due to >> drastically reduced metadata load on rocksdb. Related to this, the amount >> of memory that we cache with 4k min_alloc is pretty excessive, even when >> limiting the number of cached onodes to 4k The memory allocator work should >> allow us to make this more flexible, but I suspect that for now we will want >> to increase the min_alloc size to 16k to help alleviate memory consumption >> and metadata overhead (and improve 4k random write performance!). The extra >> WAL write is probably still worth the tradeoff for now. >> >> On the read side we are seeing some regressions. The sequential read case >> is interesting. We're doing quite a bit worse in recent bluestore even vs >> older bluestore, and generally quite a bit worse than filestore. Increasing >> the min_alloc size reduces the degredation, but we still have ground to make >> up. In these tests rbd readahead is being used in an attempt to achieve >> client-side readahead since bluestore no longer does it on the OSD side, but >> appears to be fairly ineffective. These are the settings used: >> >> rbd readahead disable after bytes = 0 >> rbd readahead max bytes = 4194304 >> >> By default we require 10 sequential reads to trigger it. I don't think that >> should be a problem, but perhaps lowering the threshold will help. In >> general this is an area we still need to focus. >> >> For random reads, the degradation was previously found to be due to the >> async messenger. Both filestore and bluestore performance has degraded >> relative to Jewel in these tests. Haomai suspects fast dispatch as the >> primarily bottleneck here. >> >> So in general, the areas I think we still need to focus: >> >> >> 1) memory allocator work (much easier caching configuration, better memory >> usage, better memory fragmentation, etc) >> 2) Long running tests (Somnath has been doing this, thanks Somnath!) >> 2) Sequential read performance in bluestore (Need to understand this better) >> 3) Fast dispatch performance improvement (Dan's RCU work?) > > Oh, it would be great to here anyone is working on RCU. Is it really ongoing? > >> >> Mark >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html