Performance Tuning and Ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Folks,


Recently there was a thread called "Tuning Nautilus for flash only" that included a reference to a bluestore performance blog post from earlier this year on the Ceph community website.  There was some concern in that thead regarding some of the tuning parameters presented in the article.  We discussed it in the core standup earlier this week and felt like we should address it. I've included a reply that Paul made in that thread as I think it's particularly relevant.  Before I get into that though, I absolutely want to encourage folks to run performance tests and report their findings.  To that end I want to thank Karan and Daniel for their hard work and being willing to present their results.  This kind of work is difficult and presenting the results publicly can be a little rough!  Thank you Karan and Daniel and please continue running tests and reporting your findings!


I also want to thank Paul for making several extremely important and valid points below.  I completely agree that some of the tuning parameters presented in the article shouldn't be used in production.  Beyond disabling checksuming and authentication, I would highly encourage folks to think about the ramifications of setting very low numbers of pg log entries (especially when combined with low per-pool PG counts via the autotuner).  The effect on recovery could be significant.  Several other tunings in the article may have unintended consequences.  Imagine for instance what could happen with 32 concurrent rocksdb compaction threads per OSD on a server that has a large number of OSDs, oversubscribed DB devices, and underpowered CPUs.  Personally I would be concerned about the overhead under heavy load with large databases full of OMAP data.  There are cases where our defaults may not be optimal, but many were set after a fair amount of performance testing (and even more QE testing).  We tend to be more conservative than not, but often there is at least some level of thought and testing behind the defaults.


In some cases, the optimal tuning may also be hardware or workload specific.   In the community test lab we have two different classes of performance nodes.  One is about 4 years old and uses older Xeon processors and P3700 NVMe drives.  Several years ago when bluestore was young we saw that a 16K min alloc size was significantly faster than 4k for small write workloads primarily due to encode/decode overhead.  As bluestore matured and improved, the gap between 16k and 4k min_alloc sizes on that hardware largely evaporated.  On our new nodes however, we see a significant small write performance improvement when using a 4K min alloc size (Likely due to CPU overhead during WAL writes now being a bigger bottleneck than metadata IO in the DB).  Of course, the min_alloc size has a huge affect on the space-amplification of small objects as well.  This is just one example where an old set of tests on a single hardware configuration may not tell the whole story (or even tell the wrong story).


What I'm getting at here is that you shouldn't necessarily trust any single set of tests (including mine!).  This is especially true when multiple configuration parameters are changed at the same time and it's not clear how each parameter is affecting the results.  I would encourage folks to look at multiple sets of results, look especially at tests that change a single parameter at a time, and also give higher credence to results that provide evidence for why performance changed.  This might include profiling data, examples where specific code is shown to be sub-optimal, or corroborating data from tests run by other users. And Paul's advice below to run your own benchmarks that are relevant to your use case is spot on as well.


Thanks,

Mark



On 11/28/19 10:46 AM, Paul Emmerich wrote:
Please don't run this config in production.
Disabling checksumming is a bad idea, disabling authentication is also
pretty bad.

There are also a few options in there that no longer exist (osd op
threads) or are no longer relevant (max open files), in general, you
should not blindly copy config files you find on the Internet. Only
set an option to its non-default value after carefully checking what
it does and whether it applies to your use case.

Also, run benchmarks yourself. Use benchmarks that are relevant to
your use case.

Paul

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux