Re: wallclock trace on bigger OSD

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 30 May 2017 15:15:15 -0500

CCing ceph-devel since this probably good to discuss in the community now.

Background:  A single NVMe OSD running bluestore (with a 4k min_alloc 
size).  With a 512GB RBD volume, 4k random write performance is quite a 
bit lower than with a smaller 16GB RBD volume.  Several of us have 
noticed this in the past and various efforts were made to improve 
encoding performance and eliminate other bottlenecks, though it appears 
that it's still a fairly big issue.  Last week we took a GDB wallclock 
trace that showed ~95% of time in the finisher thread spent in PG::Lock 
and the tp_osd_tp threads spending significant time doing onode cache 
trimming.

We have more data now.  Looking through the code and the trace, there 
are a couple of issues here.  Not only are we spending a ton of time in 
the tp_osd_tp threads just doing cache trimming, but all of this time is 
spent in OSD::ShardedOpWQ::_process where we are keeping the PG locked 
when this work is being done.  That's very likely why finisher is 
spending ~95% of it's time waiting on the PG lock.

Igor had the right idea regarding the cache, if we give the OSD 
(virtually) unlimited onode cache space we avoid the worst of this and 
go back up to 25-30K IOPS (though still short of the 36-37K IOPs with a 
smaller RBD volume).  I've included a partial gdbtrace showing an 
example of tp_osd_tp, bstore_kv_final, bstore_kv_sync, and finisher threads:

https://pastebin.com/JvPrAyRk

We still end up spending a lot of time in run, but now it's at least 
somewhat better looking.  Despite that, we're still spending ~31% of the 
time in finisher waiting on the PG lock (in this case 64 PGs for 1 NVMe 
backed OSD).  The gist of all of this is that bluestore performance on 
NVMe is really dependent on onodes being cached. I also suspect we may 
not be able to reduce the PG per OSD ratios due to Loic's work unless we 
can also change the locking behavior here.

Mark

On 05/26/2017 01:18 PM, Sage Weil wrote:
This is crazy!

          |                     + 72.55% PrimaryLogPG::do_request
          |                     | + 72.55% PrimaryLogPG::do_op
          |                     |   + 45.95% PrimaryLogPG::find_object_context
          |                     |   | + 45.95% PrimaryLogPG::get_object_context
          |                     |   |   + 43.75% PGBackend::objects_get_attr
          |                     |   |   | + 43.75% BlueStore::getattr
          |                     |   |   |   + 27.15% BlueStore::Collection::trim_cache
          |                     |   |   |   | + 27.15% BlueStore::Cache::trim
          |                     |   |   |   |   + 27.15% BlueStore::LRUCache::_trim
          |                     |   |   |   |     + 26.80% BlueStore::Onode::put

:/

On Fri, 26 May 2017, Mark Nelson wrote:

Hi Sage,

I noticed while doing 4K random writes to fragment a big(er) RBD partition
than my normal setup it was going much slower (~15K iops vs ~35-38K iops).  I
grabbed a wallclock trace while it was going too see if it looked interesting.
A ton of time is in ~Extent via _trim in the tp_osd_tp threads (See Ln #5881)
and again in the bstore_kv_final thread (See Ln #6508).  There are some other
things in the bstore_kv_final thread, though the next biggest one is on 6888
where we are in lock_guard via BufferSpace::finish_write.

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html