On Fri, Oct 7, 2016 at 10:03 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > Hi Guys, > > I wanted to give folks a quick snapshot of bluestore performance on our NVMe > test setup. There's a lot of things happening very quickly in the code, so > here's a limited snapshot of how we are doing. These are short running > tests, so do keep that in mind (5 minutes each). > > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZRXFKRDBrTjZiWGc > > The gist of it is: > > We are now basically faster than filestore in all short write tests. > Increasing the min_alloc size (and to a lesser extent) the number of cached > onodes brings 4K random write performance up even further. Likely increasing > the min_alloc size will improve long running test performance as well due to > drastically reduced metadata load on rocksdb. Related to this, the amount > of memory that we cache with 4k min_alloc is pretty excessive, even when > limiting the number of cached onodes to 4k The memory allocator work should > allow us to make this more flexible, but I suspect that for now we will want > to increase the min_alloc size to 16k to help alleviate memory consumption > and metadata overhead (and improve 4k random write performance!). The extra > WAL write is probably still worth the tradeoff for now. > > On the read side we are seeing some regressions. The sequential read case > is interesting. We're doing quite a bit worse in recent bluestore even vs > older bluestore, and generally quite a bit worse than filestore. Increasing > the min_alloc size reduces the degredation, but we still have ground to make > up. In these tests rbd readahead is being used in an attempt to achieve > client-side readahead since bluestore no longer does it on the OSD side, but > appears to be fairly ineffective. These are the settings used: > > rbd readahead disable after bytes = 0 > rbd readahead max bytes = 4194304 > > By default we require 10 sequential reads to trigger it. I don't think that > should be a problem, but perhaps lowering the threshold will help. In > general this is an area we still need to focus. > > For random reads, the degradation was previously found to be due to the > async messenger. Both filestore and bluestore performance has degraded > relative to Jewel in these tests. Haomai suspects fast dispatch as the > primarily bottleneck here. > > So in general, the areas I think we still need to focus: > > > 1) memory allocator work (much easier caching configuration, better memory > usage, better memory fragmentation, etc) > 2) Long running tests (Somnath has been doing this, thanks Somnath!) > 2) Sequential read performance in bluestore (Need to understand this better) > 3) Fast dispatch performance improvement (Dan's RCU work?) Oh, it would be great to here anyone is working on RCU. Is it really ongoing? > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html