Re: bluestore prefer wal size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 03/06/2017 08:59 AM, Sage Weil wrote:
On Fri, 3 Feb 2017, Nick Fisk wrote:
Hi Mark,

-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: 03 February 2017 15:20
To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: bluestore prefer wal size

Hi Nick,

So I'm still catching up to your testing, but last night I ran through a number of tests with a single OSD, a single client, and iodepth=1 fio
rbd tests:

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00

I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore, bluestore with 1k and 16k prefer wal sizes.  I believe I'm
seeing similar results to what you saw.  On NVMe we are pretty similar likely due to the raw backend throughput the NVMe drives can
maintain but on HDD, bluestore is doing quite a bit worse than filestore for this specific use case.  During the bluestore HDD tests, I
watched disk access and it appears we are saturating the disk with small writes despite the low client performance.  I'll be digging into
this more today but I wanted to send out an update to let you know I believe I've reproduced your results and am looking into it.

Glad to hear that you can reproduce. After debugging that IF statement,
I think I'm at the point where I am out of my depth. But if there is
anything I can do, let me know.

I finally got my dev box HDD sorted out and retested this.  The IOPS were
being halved because bluestore was triggering a rocksdb commit (and the
associated latency) just to retire the WAL record.  I rebased and fixed it
so that the WAL entries are deleted lazily (in the next commit round) and
it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
neighborhood of what I'd expect from a spinner... it's basically seeking
back and forth between two write positions (the rocksdb log and the new
object data where the allocator position is).  We could probably bring
this up a bit by making the WAL work be batched on HDD (build up several
IOs worth and dispatch it all at once to reduce seeks).  That'll take a
bit more work, though.

With WAL on an NVMe, I get about ~450 IOPS.

Want to give it a try?

What's the branch?


sage


 >

Thanks!
Mark

On 02/03/2017 03:49 AM, Nick Fisk wrote:


-----Original Message-----
From: Nick Fisk [mailto:nick@xxxxxxxxxx]
Sent: 02 February 2017 21:15
To: 'Mark Nelson' <mnelson@xxxxxxxxxx>; 'Sage Weil'
<sage@xxxxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: bluestore prefer wal size

Hi Mark,
-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: 02 February 2017 20:17
To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: bluestore prefer wal size

Hi Nick,

On 02/02/2017 08:46 AM, Nick Fisk wrote:
Further update

If I do random writes instead of sequential, I see the "defering
small write
via wal" action. But from the do_alloc_write function not small write.
 Ie. log entry is:
_do_alloc_write defering small 0x1000 write via wal

I just ran some tests using the branch with 4k random and sequential
writes against a single OSD.  I see deferring small write via wal
from do_alloc_write both for both sequential and random when
bluestore_prefer_wal_size is set to be larger than the IO size.
Otherwise, I don't see the messages.  I am also preconditioning with
4MB writes.

So far I've just been testing on NVMe, but I should be able to do
some mixed tests soon to see if there are any interesting performance implications.

I've just looked at the RBD I ran those random tests on and I
realized that it was only about 90% pre-conditioned due to me leaving
the time_based fio option on. After filling completely with writes, now both random and sequential are doing _do_write_small
writes and no deferring wal writes.

From looking through the code today, as far as I understand it

Write_small = write to existing object under min_allow_size Write_big
= write to existing object over min_allow_size Alloc_write = write to
a new object

Is that correct?

If it is, I'm interested in why you are seeing alloc_writes, as I
don't see them if the RBD is fully pre-conditioned? Unless there is something changing with the min_alloc_size as you are using non
rotational media, which is causing this change?

I'm adding a few extra dout's to the write_small function to try and see what's going on, hopefully I will know more tomorrow.


Morning,

So, results are in. I added some debugging before the IF statement
(https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c127
66c7801ac/src/os/bluestore/BlueStore.cc#L7885)

So the following section now looks like below:

    dout(20) << __func__ << "  before IF 0x" << std::hex << b_off << "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " -
" << b->get_blob().is_unused(b_off, b_len) << " - " << b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
    if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
        b->get_blob().get_ondisk_length() >= b_off + b_len &&
        b->get_blob().is_unused(b_off, b_len) &&
        b->get_blob().is_allocated(b_off, b_len)) {

And the output was

_do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -

So the " b->get_blob().is_unused(b_off, b_len)" call is returning false, which stops it going into the IF block and doing the prefer wal
write. Is that expected?

Nick

Nick


Mark


And thatʼs for a 4kb random write on an RBD that has been fully
conditioned via fio 4MB writes

So it's something in the sequential nature of the test which trips
it up as I don't see any sign of deferred writes

From looking at the code in the small write function, I can only
see two possibilities that would stop if from reaching the prefer
wal IF statement

  while (ep != o->extent_map.extent_map.end()) {
    if (ep->blob_start() >= end) {
      break;

and

  if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
	b->get_blob().get_ondisk_length() >= b_off + b_len &&
	b->get_blob().is_unused(b_off, b_len) &&
	b->get_blob().is_allocated(b_off, b_len)) {

Unfortunately, thatʼs about as far as I have got, before my
knowledge of
Bluestore becomes a limitation.

Don't know if that triggers any thoughts?


2017-02-02 10:39:31.988782 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing
0x56102a5a4900
osr(9.73 0x56102a5db1a0)
2017-02-02 10:39:31.988786 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900
=
0x56102ab4be40 seq 260
2017-02-02 10:39:31.988797 7ff7282ff700 15
bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
s
2017-02-02 10:39:31.988802 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
s
= 0
2017-02-02 10:39:31.988806 7ff7282ff700 15
bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
0xfb000~
1000
2017-02-02 10:39:31.988809 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_write
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
2017-02-02 10:39:31.988812 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12
target_blob_size
0x80000
2017-02-02 10:39:31.988814 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000
2017-02-02 10:39:31.988816 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering
Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
bstart 0x80000
2017-02-02 10:39:31.988821 7ff7282ff700 20
bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard
0x7b000~1000
2017-02-02 10:39:31.988827 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
0x7b000~1000 of mutable Blob(0x56102aac05a0
blob([0x256480000~80000]
mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
2017-02-02 10:39:31.988832 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
0xfb000~1000: 0x7b000~1000
Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
crc32c/0x1000)
ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
SharedBlob(0x56102aa76260))
2017-02-02 10:39:31.988836 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
0x56102ab4be40 0 blobs
2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
instance 94627427524464 num_used 345738 total 7627973
2017-02-02 10:39:31.988840 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
0xfb000~1000: 0x7b000~1000
Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
crc32c/0x1000)
ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
SharedBlob(0x56102aa76260))
2017-02-02 10:39:31.988844 7ff7282ff700 20
bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
0xfb000~1000
next shard
0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
blob([0x256480000~80000] mutable+csum crc32c/0x1000)
ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000:
0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
SharedBlob(0x56102aa76260))
2017-02-02 10:39:31.988850 7ff7282ff700 20
bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
0xfb000~1000
next shard
0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
blob([0x256480000~80000] mutable+csum crc32c/0x1000)
ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000:
0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
SharedBlob(0x56102aa76260))
2017-02-02 10:39:31.988868 7ff7282ff700 20
bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000
dirty
2017-02-02 10:39:31.988870 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
0xfb000~1000 = 0
2017-02-02 10:39:31.988875 7ff7282ff700 15
bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
#9:ce000000::::head#
2017-02-02 10:39:31.988882 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
#9:ce000000::::head# = 0
2017-02-02 10:39:31.988885 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
0x56102ab4be40 onodes
0x56102a8ee300 shared_blobs
2017-02-02 10:39:31.988894 7ff7282ff700 20
bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531
bytes
(was 531) from 1 extents
2017-02-02 10:39:31.988902 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4)   onode
#9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is
424
(422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
extents)
2017-02-02 10:39:31.988914 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
0x56102ab4be40 prepare
2017-02-02 10:39:31.988916 7ff7282ff700 20
bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40
2017-02-02 10:39:31.988917 7ff7282ff700 10
bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
0x56102ab4be40 io_done
2017-02-02 10:39:31.988918 7ff7282ff700 20
bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
finish_write
buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing
nocache)
2017-02-02 10:39:31.988939 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
2017-02-02 10:39:31.988941 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1
submitting 1 cleaning 0
2017-02-02 10:39:31.988952 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
0x56102ab4be40 allocated 0x[] released 0x[]
2017-02-02 10:39:31.988999 7ff72e30b700 10
bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
bluefs
9535 M free
(0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143
2017-02-02 10:39:31.989400 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1
cleaned 0 in 0.000456
2017-02-02 10:39:31.989405 7ff72e30b700 10
bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
0x56102ab4be40 kv_submitted
2017-02-02 10:39:31.989407 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
0x56102ab4be40
2017-02-02 10:39:31.989413 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40
seq
255
2017-02-02 10:39:31.989414 7ff72e30b700 20
bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
[0x2564fb000~1000]



sage


sage



sage


Nick

-----Original Message-----
From: Nick Fisk [mailto:nick@xxxxxxxxxx]
Sent: 01 February 2017 19:03
To: 'Sage Weil' <sweil@xxxxxxxxxx>; 'Mark Nelson'
<mnelson@xxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: bluestore prefer wal size

Hi Sage,

First results not looking good. It looks like write IO to the
SSD's (sdd and sdi)
is now massively amplified, by somewhere in the region of
about 10x. But I'm still only getting around 100 4kb's seq
write iops from the fio client. This is in comparison to
2500-3000 iops on the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).

rbd engine: RBD version: 0.1.11
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s]
[0/119/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed
Feb
1
18:56:51 2017
  write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
    slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
    clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
     lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
    clat percentiles (msec):
     |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
     | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
     | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
     | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
     | 99.99th=[  273]
    bw (KB  /s): min=  191, max=  480, per=100.00%,
avg=464.18,
stdev=34.60
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
    lat (msec) : 500=0.01%
  cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
32=0.0%,
=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,
=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, =64=0.0%
     issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

I've checked via the admin socket and the prefer wal option
is set to
8192.

Random capture of iostat

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz
await r_await w_await  svctm  %util
sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05
11.64
0.00   11.64   8.73   4.80
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
0.00
0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54
3.59
0.00    3.59   3.58  54.20
sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09
0.06
0.00    0.06   0.06   9.20
sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91
4.20
0.00    4.20   4.20  91.20
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
0.00
0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18
2.68
0.00    2.68   2.68  17.80
sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80
3.71
0.00    3.71   3.71  80.40
sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09
0.08
0.00    0.08   0.07   8.40

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx]
Sent: 01 February 2017 15:34
To: nick@xxxxxxxxxx
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: bluestore prefer wal size

Hey Nick,

I've updated/improved the prefer wal size PR (that sends
small writes through the wal).  See

	https://github.com/ceph/ceph/pull/13217

if you want to try it out.

Thanks!
sage








--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux