Performance Tuning and Ceph

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 5 Dec 2019 15:55:44 -0600

Hi Folks,

Recently there was a thread called "Tuning Nautilus for flash only" that 
included a reference to a bluestore performance blog post from earlier 
this year on the Ceph community website.  There was some concern in that 
thead regarding some of the tuning parameters presented in the article.  
We discussed it in the core standup earlier this week and felt like we 
should address it. I've included a reply that Paul made in that thread 
as I think it's particularly relevant.  Before I get into that though, I 
absolutely want to encourage folks to run performance tests and report 
their findings.  To that end I want to thank Karan and Daniel for their 
hard work and being willing to present their results.  This kind of work 
is difficult and presenting the results publicly can be a little rough!  
Thank you Karan and Daniel and please continue running tests and 
reporting your findings!

I also want to thank Paul for making several extremely important and 
valid points below.  I completely agree that some of the tuning 
parameters presented in the article shouldn't be used in production.  
Beyond disabling checksuming and authentication, I would highly 
encourage folks to think about the ramifications of setting very low 
numbers of pg log entries (especially when combined with low per-pool PG 
counts via the autotuner).  The effect on recovery could be 
significant.  Several other tunings in the article may have unintended 
consequences.  Imagine for instance what could happen with 32 concurrent 
rocksdb compaction threads per OSD on a server that has a large number 
of OSDs, oversubscribed DB devices, and underpowered CPUs.  Personally I 
would be concerned about the overhead under heavy load with large 
databases full of OMAP data.  There are cases where our defaults may not 
be optimal, but many were set after a fair amount of performance testing 
(and even more QE testing).  We tend to be more conservative than not, 
but often there is at least some level of thought and testing behind the 
defaults.

In some cases, the optimal tuning may also be hardware or workload 
specific.   In the community test lab we have two different classes of 
performance nodes.  One is about 4 years old and uses older Xeon 
processors and P3700 NVMe drives.  Several years ago when bluestore was 
young we saw that a 16K min alloc size was significantly faster than 4k 
for small write workloads primarily due to encode/decode overhead.  As 
bluestore matured and improved, the gap between 16k and 4k min_alloc 
sizes on that hardware largely evaporated.  On our new nodes however, we 
see a significant small write performance improvement when using a 4K 
min alloc size (Likely due to CPU overhead during WAL writes now being a 
bigger bottleneck than metadata IO in the DB).  Of course, the min_alloc 
size has a huge affect on the space-amplification of small objects as 
well.  This is just one example where an old set of tests on a single 
hardware configuration may not tell the whole story (or even tell the 
wrong story).

What I'm getting at here is that you shouldn't necessarily trust any 
single set of tests (including mine!).  This is especially true when 
multiple configuration parameters are changed at the same time and it's 
not clear how each parameter is affecting the results.  I would 
encourage folks to look at multiple sets of results, look especially at 
tests that change a single parameter at a time, and also give higher 
credence to results that provide evidence for why performance changed.  
This might include profiling data, examples where specific code is shown 
to be sub-optimal, or corroborating data from tests run by other users. 
And Paul's advice below to run your own benchmarks that are relevant to 
your use case is spot on as well.

Thanks,

Mark

On 11/28/19 10:46 AM, Paul Emmerich wrote:
Please don't run this config in production.
Disabling checksumming is a bad idea, disabling authentication is also
pretty bad.

There are also a few options in there that no longer exist (osd op
threads) or are no longer relevant (max open files), in general, you
should not blindly copy config files you find on the Internet. Only
set an option to its non-default value after carefully checking what
it does and whether it applies to your use case.

Also, run benchmarks yourself. Use benchmarks that are relevant to
your use case.

Paul

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx