Re: Rados Bench Scaling question from today's Ceph Perf Call

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 6 Jul 2017 17:02:16 -0500

That's the million dollar question, and sadly the answer as in most 
situations like this is that it depends.

If my math is right in this case it looks like we are using somewhere 
vaguely around 16K per onode for 4K objects in the DB once you factor in 
rocksdb space-amp.  I don't know how variable that might be as data gets 
compacted and moved around between different levels in rocksdb.  Beyond 
that, things like enabling/disabling crc checksums, the min_alloc size, 
and the number of extents per object are going to affect the onode size too.

We should probably do a more extensive analysis of all of the key/value 
pairs associated with the onode for different sized objects.  Things 
like the encoding method used (varint, etc) will have an impact on this too.

Mark

On 07/06/2017 03:29 PM, McFarland, Bruce wrote:
Mark,
Thanks. Hopeing this response doesn’t get bounced by ceph-devel after changing output to text from html. The question that immediately comes to my mind is “what size SSD partition would be necessary to avoid this write cliff from ever occurring for a given sized HDD/OSD.” Which of course should be workload dependent and up to the user to determine, but something to consider when sizing Luminous clusters.

Bruce

From: Mark Nelson <mnelson@xxxxxxxxxx>
Date: Thursday, July 6, 2017 at 1:07 PM
To: "McFarland, Bruce" <Bruce.McFarland@xxxxxxxxxxxx>, ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: Rados Bench Scaling question from today's Ceph Perf Call

Hi Bruce,

Sorry, my earlier reply wasn't to the list so reposting here along with
a bit more info.

In that specific test, bluestore was on an OSD with the data on an HDD
and the metadata on an NVMe drive.  The cliff corresponded with reads
during writes to the HDD, which typically means we've filled up the
entire rocksdb metadata partition on the NVMe drive and bluefs is
rolling new SST files over to the spinning disk (with the associated
slowdown).

That was about 98GB of metadata for 6M objects.  I suspect that if I run
another test with a larger metadata partition the cliff will get pushed
farther out.  It's also possible that if rocksdb compression were
enabled we might also be able to fit far more onodes in the database at
the expense of higher CPU usage.

In this case larger onode cache doesn't seem to help much since these
are new objects and the getattr reads happening in
PGBackend::objects_get_attr don't return anything.  The trace from
dequeue_op on looks something like:

+ 86.30% PrimaryLogPG::do_op
| + 84.75% PrimaryLogPG::find_object_context
| | + 84.75% PrimaryLogPG::get_object_context
| |   + 84.70% PGBackend::objects_get_attr
| |   | + 84.70% BlueStore::getattr
| |   |   + 84.70% BlueStore::Collection::get_onode
| |   |     + 84.65% RocksDBStore::get
| |   |     | + 84.65% rocksdb::DB::Get
| |   |     |   + 84.65% rocksdb::DB::Get
| |   |     |     + 84.65% rocksdb::DBImpl::Get
| |   |     |       + 84.65% rocksdb::DBImpl::GetImpl
| |   |     |         + 84.65% rocksdb::Version::Get
| |   |     |           + 84.65% rocksdb::TableCache::Get
| |   |     |             + 84.65% rocksdb::BlockBasedTable::Get
| |   |     |               + 84.50%
rocksdb::BlockBasedTable::NewDataBlockIterator
| |   |     |               | + 84.50%
rocksdb::BlockBasedTable::NewDataBlockIterator
| |   |     |               |   + 84.50%
rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache
| |   |     |               |     + 84.45% rocksdb::(anonymous
namespace)::ReadBlockFromFile
| |   |     |               |     | + 84.45% rocksdb::ReadBlockContents
| |   |     |               |     |   + 84.45% ReadBlock
| |   |     |               |     |     + 84.45%
rocksdb::RandomAccessFileReader::Read
| |   |     |               |     |       + 84.45%
BlueRocksRandomAccessFile::Read
| |   |     |               |     |         + 84.45% read_random
| |   |     |               |     |           + 84.45% BlueFS::_read_random
| |   |     |               |     |             + 84.45%
KernelDevice::read_random
| |   |     |               |     |               + 84.45%
KernelDevice::direct_read_unaligned
| |   |     |               |     |                 + 84.45% pread
| |   |     |               |     |                   + 84.45% pread64

Mark

On 07/06/2017 01:24 PM, McFarland, Bruce wrote:
Mark,

In today’s perf call you showed filestore and bluestore write cliffs.
What, in your opinion, is the cause of the bluestore write cliff? Is
that the size of the bluefs and/or rocksdb cache? You mentioned it could
be solved by more HW which I took to mean bigger cache. Is that a
correct assumption?

Thanks for the presentation.

Bruce

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html