Re: Rados Bench Scaling question from today's Ceph Perf Call

"McFarland, Bruce" <Bruce.McFarland@xxxxxxxxxxxx> · Thu, 6 Jul 2017 20:29:34 +0000

Mark,
Thanks. Hopeing this response doesn’t get bounced by ceph-devel after changing output to text from html. The question that immediately comes to my mind is “what size SSD partition would be necessary to avoid this write cliff from ever occurring for a given sized HDD/OSD.” Which of course should be workload dependent and up to the user to determine, but something to consider when sizing Luminous clusters.

Bruce

From: Mark Nelson <mnelson@xxxxxxxxxx>
Date: Thursday, July 6, 2017 at 1:07 PM
To: "McFarland, Bruce" <Bruce.McFarland@xxxxxxxxxxxx>, ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: Rados Bench Scaling question from today's Ceph Perf Call

Hi Bruce,

Sorry, my earlier reply wasn't to the list so reposting here along with 
a bit more info.

In that specific test, bluestore was on an OSD with the data on an HDD 
and the metadata on an NVMe drive.  The cliff corresponded with reads 
during writes to the HDD, which typically means we've filled up the 
entire rocksdb metadata partition on the NVMe drive and bluefs is 
rolling new SST files over to the spinning disk (with the associated 
slowdown).

That was about 98GB of metadata for 6M objects.  I suspect that if I run 
another test with a larger metadata partition the cliff will get pushed 
farther out.  It's also possible that if rocksdb compression were 
enabled we might also be able to fit far more onodes in the database at 
the expense of higher CPU usage.

In this case larger onode cache doesn't seem to help much since these 
are new objects and the getattr reads happening in 
PGBackend::objects_get_attr don't return anything.  The trace from 
dequeue_op on looks something like:

+ 86.30% PrimaryLogPG::do_op
| + 84.75% PrimaryLogPG::find_object_context
| | + 84.75% PrimaryLogPG::get_object_context
| |   + 84.70% PGBackend::objects_get_attr
| |   | + 84.70% BlueStore::getattr
| |   |   + 84.70% BlueStore::Collection::get_onode
| |   |     + 84.65% RocksDBStore::get
| |   |     | + 84.65% rocksdb::DB::Get
| |   |     |   + 84.65% rocksdb::DB::Get
| |   |     |     + 84.65% rocksdb::DBImpl::Get
| |   |     |       + 84.65% rocksdb::DBImpl::GetImpl
| |   |     |         + 84.65% rocksdb::Version::Get
| |   |     |           + 84.65% rocksdb::TableCache::Get
| |   |     |             + 84.65% rocksdb::BlockBasedTable::Get
| |   |     |               + 84.50% 
rocksdb::BlockBasedTable::NewDataBlockIterator
| |   |     |               | + 84.50% 
rocksdb::BlockBasedTable::NewDataBlockIterator
| |   |     |               |   + 84.50% 
rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache
| |   |     |               |     + 84.45% rocksdb::(anonymous 
namespace)::ReadBlockFromFile
| |   |     |               |     | + 84.45% rocksdb::ReadBlockContents
| |   |     |               |     |   + 84.45% ReadBlock
| |   |     |               |     |     + 84.45% 
rocksdb::RandomAccessFileReader::Read
| |   |     |               |     |       + 84.45% 
BlueRocksRandomAccessFile::Read
| |   |     |               |     |         + 84.45% read_random
| |   |     |               |     |           + 84.45% BlueFS::_read_random
| |   |     |               |     |             + 84.45% 
KernelDevice::read_random
| |   |     |               |     |               + 84.45% 
KernelDevice::direct_read_unaligned
| |   |     |               |     |                 + 84.45% pread
| |   |     |               |     |                   + 84.45% pread64

Mark

On 07/06/2017 01:24 PM, McFarland, Bruce wrote:
Mark,

In today’s perf call you showed filestore and bluestore write cliffs.
What, in your opinion, is the cause of the bluestore write cliff? Is
that the size of the bluefs and/or rocksdb cache? You mentioned it could
be solved by more HW which I took to mean bigger cache. Is that a
correct assumption?

Thanks for the presentation.

Bruce

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f