Re: bluestore prefer wal size

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 6 Mar 2017 22:34:32 -0600

On 03/06/2017 11:53 AM, Sage Weil wrote:
On Mon, 6 Mar 2017, Mark Nelson wrote:
On 03/06/2017 08:59 AM, Sage Weil wrote:
On Fri, 3 Feb 2017, Nick Fisk wrote:
Hi Mark,

-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: 03 February 2017 15:20
To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: bluestore prefer wal size

Hi Nick,

So I'm still catching up to your testing, but last night I ran through a
number of tests with a single OSD, a single client, and iodepth=1 fio
rbd tests:

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00

I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore,
bluestore with 1k and 16k prefer wal sizes.  I believe I'm
seeing similar results to what you saw.  On NVMe we are pretty similar
likely due to the raw backend throughput the NVMe drives can
maintain but on HDD, bluestore is doing quite a bit worse than filestore
for this specific use case.  During the bluestore HDD tests, I
watched disk access and it appears we are saturating the disk with small
writes despite the low client performance.  I'll be digging into
this more today but I wanted to send out an update to let you know I
believe I've reproduced your results and am looking into it.

Glad to hear that you can reproduce. After debugging that IF statement,
I think I'm at the point where I am out of my depth. But if there is
anything I can do, let me know.

I finally got my dev box HDD sorted out and retested this.  The IOPS were
being halved because bluestore was triggering a rocksdb commit (and the
associated latency) just to retire the WAL record.  I rebased and fixed it
so that the WAL entries are deleted lazily (in the next commit round) and
it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
neighborhood of what I'd expect from a spinner... it's basically seeking
back and forth between two write positions (the rocksdb log and the new
object data where the allocator position is).  We could probably bring
this up a bit by making the WAL work be batched on HDD (build up several
IOs worth and dispatch it all at once to reduce seeks).  That'll take a
bit more work, though.

I've made this change too, and now I get about 400 IOPS out of a straight
HDD (no SSD).  Much better!  I set up the tunables so that it does the
batching behavior only for HDD and not for SSD; we may want to revisit
that decision later.

With WAL on an NVMe, I get about ~450 IOPS.

Want to give it a try?

What's the branch?

wip-bluestore-prefer-wal-size

Got a chance to give it a quick shot, but it's consistently stuck 
creating PGs and eventually the monitor marks OSDs down due to no osd or 
PG stats.  I'll give it a try with a single OSD cluster tomorrow.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html