Hi Guys,
I wanted to give folks a quick snapshot of bluestore performance on our
NVMe test setup. There's a lot of things happening very quickly in the
code, so here's a limited snapshot of how we are doing. These are short
running tests, so do keep that in mind (5 minutes each).
https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZRXFKRDBrTjZiWGc
The gist of it is:
We are now basically faster than filestore in all short write tests.
Increasing the min_alloc size (and to a lesser extent) the number of
cached onodes brings 4K random write performance up even further.
Likely increasing the min_alloc size will improve long running test
performance as well due to drastically reduced metadata load on rocksdb.
Related to this, the amount of memory that we cache with 4k min_alloc
is pretty excessive, even when limiting the number of cached onodes to
4k The memory allocator work should allow us to make this more
flexible, but I suspect that for now we will want to increase the
min_alloc size to 16k to help alleviate memory consumption and metadata
overhead (and improve 4k random write performance!). The extra WAL
write is probably still worth the tradeoff for now.
On the read side we are seeing some regressions. The sequential read
case is interesting. We're doing quite a bit worse in recent bluestore
even vs older bluestore, and generally quite a bit worse than filestore.
Increasing the min_alloc size reduces the degredation, but we still
have ground to make up. In these tests rbd readahead is being used in
an attempt to achieve client-side readahead since bluestore no longer
does it on the OSD side, but appears to be fairly ineffective. These
are the settings used:
rbd readahead disable after bytes = 0
rbd readahead max bytes = 4194304
By default we require 10 sequential reads to trigger it. I don't think
that should be a problem, but perhaps lowering the threshold will help.
In general this is an area we still need to focus.
For random reads, the degradation was previously found to be due to the
async messenger. Both filestore and bluestore performance has degraded
relative to Jewel in these tests. Haomai suspects fast dispatch as the
primarily bottleneck here.
So in general, the areas I think we still need to focus:
1) memory allocator work (much easier caching configuration, better
memory usage, better memory fragmentation, etc)
2) Long running tests (Somnath has been doing this, thanks Somnath!)
2) Sequential read performance in bluestore (Need to understand this better)
3) Fast dispatch performance improvement (Dan's RCU work?)
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html