Re: bluestore prefer wal size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 6 Mar 2017, Mark Nelson wrote:
> On 03/06/2017 08:59 AM, Sage Weil wrote:
> > On Fri, 3 Feb 2017, Nick Fisk wrote:
> > > Hi Mark,
> > > 
> > > > -----Original Message-----
> > > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > > > Sent: 03 February 2017 15:20
> > > > To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
> > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > Subject: Re: bluestore prefer wal size
> > > > 
> > > > Hi Nick,
> > > > 
> > > > So I'm still catching up to your testing, but last night I ran through a
> > > > number of tests with a single OSD, a single client, and iodepth=1 fio
> > > > rbd tests:
> > > > 
> > > > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00
> > > > 
> > > > I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore,
> > > > bluestore with 1k and 16k prefer wal sizes.  I believe I'm
> > > > seeing similar results to what you saw.  On NVMe we are pretty similar
> > > > likely due to the raw backend throughput the NVMe drives can
> > > > maintain but on HDD, bluestore is doing quite a bit worse than filestore
> > > > for this specific use case.  During the bluestore HDD tests, I
> > > > watched disk access and it appears we are saturating the disk with small
> > > > writes despite the low client performance.  I'll be digging into
> > > > this more today but I wanted to send out an update to let you know I
> > > > believe I've reproduced your results and am looking into it.
> > > 
> > > Glad to hear that you can reproduce. After debugging that IF statement,
> > > I think I'm at the point where I am out of my depth. But if there is
> > > anything I can do, let me know.
> > 
> > I finally got my dev box HDD sorted out and retested this.  The IOPS were
> > being halved because bluestore was triggering a rocksdb commit (and the
> > associated latency) just to retire the WAL record.  I rebased and fixed it
> > so that the WAL entries are deleted lazily (in the next commit round) and
> > it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
> > writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
> > neighborhood of what I'd expect from a spinner... it's basically seeking
> > back and forth between two write positions (the rocksdb log and the new
> > object data where the allocator position is).  We could probably bring
> > this up a bit by making the WAL work be batched on HDD (build up several
> > IOs worth and dispatch it all at once to reduce seeks).  That'll take a
> > bit more work, though.

I've made this change too, and now I get about 400 IOPS out of a straight 
HDD (no SSD).  Much better!  I set up the tunables so that it does the 
batching behavior only for HDD and not for SSD; we may want to revisit 
that decision later.

> > With WAL on an NVMe, I get about ~450 IOPS.
> > 
> > Want to give it a try?
> 
> What's the branch?

wip-bluestore-prefer-wal-size

sage


> 
> 
> > sage
> > 
> > 
> >  >
> > > > 
> > > > Thanks!
> > > > Mark
> > > > 
> > > > On 02/03/2017 03:49 AM, Nick Fisk wrote:
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > > > > > Sent: 02 February 2017 21:15
> > > > > > To: 'Mark Nelson' <mnelson@xxxxxxxxxx>; 'Sage Weil'
> > > > > > <sage@xxxxxxxxxxxx>
> > > > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > > > Subject: RE: bluestore prefer wal size
> > > > > > 
> > > > > > Hi Mark,
> > > > > > > -----Original Message-----
> > > > > > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > > > > > > Sent: 02 February 2017 20:17
> > > > > > > To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
> > > > > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > > > > Subject: Re: bluestore prefer wal size
> > > > > > > 
> > > > > > > Hi Nick,
> > > > > > > 
> > > > > > > On 02/02/2017 08:46 AM, Nick Fisk wrote:
> > > > > > > > Further update
> > > > > > > > 
> > > > > > > > If I do random writes instead of sequential, I see the "defering
> > > > > > > > small write
> > > > > > > via wal" action. But from the do_alloc_write function not small
> > > > > > > write.
> > > > > > > >  Ie. log entry is:
> > > > > > > > _do_alloc_write defering small 0x1000 write via wal
> > > > > > > 
> > > > > > > I just ran some tests using the branch with 4k random and
> > > > > > > sequential
> > > > > > > writes against a single OSD.  I see deferring small write via wal
> > > > > > > from do_alloc_write both for both sequential and random when
> > > > > > > bluestore_prefer_wal_size is set to be larger than the IO size.
> > > > > > > Otherwise, I don't see the messages.  I am also preconditioning
> > > > > > > with
> > > > > > > 4MB writes.
> > > > > > > 
> > > > > > > So far I've just been testing on NVMe, but I should be able to do
> > > > > > > some mixed tests soon to see if there are any interesting
> > > > > > > performance implications.
> > > > > > 
> > > > > > I've just looked at the RBD I ran those random tests on and I
> > > > > > realized that it was only about 90% pre-conditioned due to me
> > > > > > leaving
> > > > > > the time_based fio option on. After filling completely with writes,
> > > > > > now both random and sequential are doing _do_write_small
> > > > writes and no deferring wal writes.
> > > > > > 
> > > > > > From looking through the code today, as far as I understand it
> > > > > > 
> > > > > > Write_small = write to existing object under min_allow_size
> > > > > > Write_big
> > > > > > = write to existing object over min_allow_size Alloc_write = write
> > > > > > to
> > > > > > a new object
> > > > > > 
> > > > > > Is that correct?
> > > > > > 
> > > > > > If it is, I'm interested in why you are seeing alloc_writes, as I
> > > > > > don't see them if the RBD is fully pre-conditioned? Unless there is
> > > > > > something changing with the min_alloc_size as you are using non
> > > > rotational media, which is causing this change?
> > > > > > 
> > > > > > I'm adding a few extra dout's to the write_small function to try and
> > > > > > see what's going on, hopefully I will know more tomorrow.
> > > > > > 
> > > > > 
> > > > > Morning,
> > > > > 
> > > > > So, results are in. I added some debugging before the IF statement
> > > > > (https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c127
> > > > > 66c7801ac/src/os/bluestore/BlueStore.cc#L7885)
> > > > > 
> > > > > So the following section now looks like below:
> > > > > 
> > > > >     dout(20) << __func__ << "  before IF 0x" << std::hex << b_off <<
> > > > > "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " -
> > > > " << b->get_blob().is_unused(b_off, b_len) << " - " <<
> > > > b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
> > > > >     if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > > > >         b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > > > >         b->get_blob().is_unused(b_off, b_len) &&
> > > > >         b->get_blob().is_allocated(b_off, b_len)) {
> > > > > 
> > > > > And the output was
> > > > > 
> > > > > _do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -
> > > > > 
> > > > > So the " b->get_blob().is_unused(b_off, b_len)" call is returning
> > > > > false, which stops it going into the IF block and doing the prefer wal
> > > > write. Is that expected?
> > > > > 
> > > > > Nick
> > > > > 
> > > > > > Nick
> > > > > > 
> > > > > > > 
> > > > > > > Mark
> > > > > > > 
> > > > > > > > 
> > > > > > > > And thatʼs for a 4kb random write on an RBD that has been fully
> > > > > > > > conditioned via fio 4MB writes
> > > > > > > > 
> > > > > > > > So it's something in the sequential nature of the test which
> > > > > > > > trips
> > > > > > > > it up as I don't see any sign of deferred writes
> > > > > > > > 
> > > > > > > > From looking at the code in the small write function, I can only
> > > > > > > > see two possibilities that would stop if from reaching the
> > > > > > > > prefer
> > > > > > > > wal IF statement
> > > > > > > > 
> > > > > > > >   while (ep != o->extent_map.extent_map.end()) {
> > > > > > > >     if (ep->blob_start() >= end) {
> > > > > > > >       break;
> > > > > > > > 
> > > > > > > > and
> > > > > > > > 
> > > > > > > >   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > > > > > > > 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > > > > > > > 	b->get_blob().is_unused(b_off, b_len) &&
> > > > > > > > 	b->get_blob().is_allocated(b_off, b_len)) {
> > > > > > > > 
> > > > > > > > Unfortunately, thatʼs about as far as I have got, before my
> > > > > > > > knowledge of
> > > > > > > Bluestore becomes a limitation.
> > > > > > > > 
> > > > > > > > Don't know if that triggers any thoughts?
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 2017-02-02 10:39:31.988782 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions
> > > > > > > > > existing
> > > > > > > > > 0x56102a5a4900
> > > > > > > > > osr(9.73 0x56102a5db1a0)
> > > > > > > > > 2017-02-02 10:39:31.988786 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr
> > > > > > > > > 0x56102a5a4900
> > > > > > > > > =
> > > > > > > > > 0x56102ab4be40 seq 260
> > > > > > > > > 2017-02-02 10:39:31.988797 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2
> > > > > > > > > key
> > > > > > > s
> > > > > > > > > 2017-02-02 10:39:31.988802 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2
> > > > > > > > > key
> > > > > > > s
> > > > > > > > > = 0
> > > > > > > > > 2017-02-02 10:39:31.988806 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > 0xfb000~
> > > > > > > > > 1000
> > > > > > > > > 2017-02-02 10:39:31.988809 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
> > > > > > > > > 2017-02-02 10:39:31.988812 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer
> > > > > > > > > csum_order 12
> > > > > > > > > target_blob_size
> > > > > > > > > 0x80000
> > > > > > > > > 2017-02-02 10:39:31.988814 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small
> > > > > > > > > 0xfb000~1000
> > > > > > > > > 2017-02-02 10:39:31.988816 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small
> > > > > > > > > considering
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > bstart 0x80000
> > > > > > > > > 2017-02-02 10:39:31.988821 7ff7282ff700 20
> > > > > > > > > bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
> > > > > > > > > _discard
> > > > > > > > > 0x7b000~1000
> > > > > > > > > 2017-02-02 10:39:31.988827 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
> > > > > > > > > 0x7b000~1000 of mutable Blob(0x56102aac05a0
> > > > > > > blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
> > > > > > > > > 2017-02-02 10:39:31.988832 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
> > > > > > > > > 0xfb000~1000: 0x7b000~1000
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000)
> > > > > > > > > ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988836 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
> > > > > > > > > 0x56102ab4be40 0 blobs
> > > > > > > > > 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
> > > > > > > > > instance 94627427524464 num_used 345738 total 7627973
> > > > > > > > > 2017-02-02 10:39:31.988840 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
> > > > > > > > > 0xfb000~1000: 0x7b000~1000
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000)
> > > > > > > > > ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988844 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > > > > > > 0xfb000~1000
> > > > > > > > > next shard
> > > > > > > > > 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
> > > > > > > > > blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > > > > > > > > ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and
> > > > > > > > > 0xfb000~1000:
> > > > > > > > > 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988850 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > > > > > > 0xfb000~1000
> > > > > > > > > next shard
> > > > > > > > > 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
> > > > > > > > > blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > > > > > > > > ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and
> > > > > > > > > 0xfc000~4000:
> > > > > > > > > 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988868 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard
> > > > > > > > > 0x80000
> > > > > > > > > dirty
> > > > > > > > > 2017-02-02 10:39:31.988870 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > 0xfb000~1000 = 0
> > > > > > > > > 2017-02-02 10:39:31.988875 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > > > > > > > > #9:ce000000::::head#
> > > > > > > > > 2017-02-02 10:39:31.988882 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > > > > > > > > #9:ce000000::::head# = 0
> > > > > > > > > 2017-02-02 10:39:31.988885 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
> > > > > > > > > 0x56102ab4be40 onodes
> > > > > > > > > 0x56102a8ee300 shared_blobs
> > > > > > > > > 2017-02-02 10:39:31.988894 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is
> > > > > > > > > 531
> > > > > > > > > bytes
> > > > > > > (was 531) from 1 extents
> > > > > > > > > 2017-02-02 10:39:31.988902 7ff7282ff700 20
> > > > > > > bluestore(/var/lib/ceph/osd/ceph-4)   onode
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > is
> > > > > > > > > 424
> > > > > > > > > (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
> > > > > > > > > extents)
> > > > > > > > > 2017-02-02 10:39:31.988914 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 prepare
> > > > > > > > > 2017-02-02 10:39:31.988916 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > 2017-02-02 10:39:31.988917 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 io_done
> > > > > > > > > 2017-02-02 10:39:31.988918 7ff7282ff700 20
> > > > > > > > > bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
> > > > > > > > > finish_write
> > > > > > > > > buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000
> > > > > > > > > writing
> > > > > > > > > nocache)
> > > > > > > > > 2017-02-02 10:39:31.988939 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
> > > > > > > > > 2017-02-02 10:39:31.988941 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing
> > > > > > > > > 1
> > > > > > > > > submitting 1 cleaning 0
> > > > > > > > > 2017-02-02 10:39:31.988952 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
> > > > > > > > > 0x56102ab4be40 allocated 0x[] released 0x[]
> > > > > > > > > 2017-02-02 10:39:31.988999 7ff72e30b700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
> > > > > > > > > bluefs
> > > > > > > > > 9535 M free
> > > > > > > > > (0.999896) bluestore 222 G free (0.954675), bluefs_ratio
> > > > > > > > > 0.0402143
> > > > > > > > > 2017-02-02 10:39:31.989400 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed
> > > > > > > > > 1
> > > > > > > > > cleaned 0 in 0.000456
> > > > > > > > > 2017-02-02 10:39:31.989405 7ff72e30b700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 kv_submitted
> > > > > > > > > 2017-02-02 10:39:31.989407 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > 2017-02-02 10:39:31.989413 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > seq
> > > > > > > > > 255
> > > > > > > > > 2017-02-02 10:39:31.989414 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
> > > > > > > > > [0x2564fb000~1000]
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > sage
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > sage
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > sage
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Nick
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > > > > > > > > > > > > > Sent: 01 February 2017 19:03
> > > > > > > > > > > > > > To: 'Sage Weil' <sweil@xxxxxxxxxx>; 'Mark Nelson'
> > > > > > > > > > > > > <mnelson@xxxxxxxxxx>
> > > > > > > > > > > > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > > > > > > > > > > > Subject: RE: bluestore prefer wal size
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi Sage,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > First results not looking good. It looks like write
> > > > > > > > > > > > > > IO to the
> > > > > > > > > > > > > > SSD's (sdd and sdi)
> > > > > > > > > > > > > is now massively amplified, by somewhere in the region
> > > > > > > > > > > > > of
> > > > > > > > > > > > > about 10x. But I'm still only getting around 100 4kb's
> > > > > > > > > > > > > seq
> > > > > > > > > > > > > write iops from the fio client. This is in comparison
> > > > > > > > > > > > > to
> > > > > > > > > > > > > 2500-3000 iops on the SSD's and ~200 iops per spinning
> > > > > > > > > > > > > disk (sdc,sde,sdg,sdh).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > rbd engine: RBD version: 0.1.11
> > > > > > > > > > > > > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB
> > > > > > > > > > > > > > /s]
> > > > > > > > > > > > > > [0/119/0 iops] [eta 00m:00s]
> > > > > > > > > > > > > > rbd_iodepth32: (groupid=0, jobs=1): err= 0:
> > > > > > > > > > > > > > pid=31171: Wed
> > > > > > > > > > > > > > Feb
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > 18:56:51 2017
> > > > > > > > > > > > > >   write: io=27836KB, bw=475020B/s, iops=115, runt=
> > > > > > > > > > > > > > 60006msec
> > > > > > > > > > > > > >     slat (usec): min=6, max=142, avg=10.98, stdev=
> > > > > > > > > > > > > > 8.38
> > > > > > > > > > > > > >     clat (msec): min=1, max=271, avg= 8.61, stdev=
> > > > > > > > > > > > > > 3.63
> > > > > > > > > > > > > >      lat (msec): min=1, max=271, avg= 8.62, stdev=
> > > > > > > > > > > > > > 3.63
> > > > > > > > > > > > > >     clat percentiles (msec):
> > > > > > > > > > > > > >      |  1.00th=[    8],  5.00th=[    9], 10.00th=[
> > > > > > > > > > > > > > 9], 20.00th=[    9],
> > > > > > > > > > > > > >      | 30.00th=[    9], 40.00th=[    9], 50.00th=[
> > > > > > > > > > > > > > 9], 60.00th=[    9],
> > > > > > > > > > > > > >      | 70.00th=[    9], 80.00th=[    9], 90.00th=[
> > > > > > > > > > > > > > 9], 95.00th=[    9],
> > > > > > > > > > > > > >      | 99.00th=[   17], 99.50th=[   25], 99.90th=[
> > > > > > > > > > > > > > 33], 99.95th=[   36],
> > > > > > > > > > > > > >      | 99.99th=[  273]
> > > > > > > > > > > > > >     bw (KB  /s): min=  191, max=  480, per=100.00%,
> > > > > > > > > > > > > > avg=464.18,
> > > > > > > stdev=34.60
> > > > > > > > > > > > > >     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%,
> > > > > > > > > > > > > > 20=1.55%, 50=0.69%
> > > > > > > > > > > > > >     lat (msec) : 500=0.01%
> > > > > > > > > > > > > >   cpu          : usr=0.13%, sys=0.12%, ctx=7224,
> > > > > > > > > > > > > > majf=0, minf=5
> > > > > > > > > > > > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%,
> > > > > > > > > > > > > > 16=0.0%,
> > > > > > > 32=0.0%,
> > > > > > > > > > > > > > =64=0.0%
> > > > > > > > > > > > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> > > > > > > > > > > > > > 32=0.0%,
> > > > > > > 64=0.0%,
> > > > > > > > > > > > > > =64=0.0%
> > > > > > > > > > > > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> > > > > > > > > > > > > > 32=0.0%,
> > > > > > > > > > > > > > 64=0.0%, =64=0.0%
> > > > > > > > > > > > > >      issued    : total=r=0/w=6959/d=0,
> > > > > > > > > > > > > > short=r=0/w=0/d=0,
> > > > > > > > > > > > > drop=r=0/w=0/d=0
> > > > > > > > > > > > > >      latency   : target=0, window=0,
> > > > > > > > > > > > > > percentile=100.00%, depth=1
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I've checked via the admin socket and the prefer wal
> > > > > > > > > > > > > > option
> > > > > > > > > > > > > > is set to
> > > > > > > 8192.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Random capture of iostat
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s
> > > > > > > > > > > > > > rkB/s    wkB/s avgrq-sz
> > > > > > > avgqu-sz
> > > > > > > > > > > > > await r_await w_await  svctm  %util
> > > > > > > > > > > > > > sda               0.00    77.00    0.00    5.50
> > > > > > > > > > > > > > 0.00   338.00   122.91     0.05
> > > > > > > 11.64
> > > > > > > > > > > > > 0.00   11.64   8.73   4.80
> > > > > > > > > > > > > > sdb               0.00     0.00    0.00    0.00
> > > > > > > > > > > > > > 0.00     0.00     0.00     0.00
> > > > > > > 0.00
> > > > > > > > > > > > > 0.00    0.00   0.00   0.00
> > > > > > > > > > > > > > sdc               0.00     0.00    0.00  151.50
> > > > > > > > > > > > > > 0.00   302.00     3.99     0.54
> > > > > > > 3.59
> > > > > > > > > > > > > 0.00    3.59   3.58  54.20
> > > > > > > > > > > > > > sdd               0.00     0.00    0.00 1474.00
> > > > > > > > > > > > > > 0.00  4008.00     5.44     0.09
> > > > > > > 0.06
> > > > > > > > > > > > > 0.00    0.06   0.06   9.20
> > > > > > > > > > > > > > sde               0.00     0.00    0.00  217.00
> > > > > > > > > > > > > > 0.00   434.00     4.00     0.91
> > > > > > > 4.20
> > > > > > > > > > > > > 0.00    4.20   4.20  91.20
> > > > > > > > > > > > > > sdf               0.00     0.00    0.00    0.00
> > > > > > > > > > > > > > 0.00     0.00     0.00     0.00
> > > > > > > 0.00
> > > > > > > > > > > > > 0.00    0.00   0.00   0.00
> > > > > > > > > > > > > > sdg               0.00     0.00    0.00   66.50
> > > > > > > > > > > > > > 0.00   134.00     4.03     0.18
> > > > > > > 2.68
> > > > > > > > > > > > > 0.00    2.68   2.68  17.80
> > > > > > > > > > > > > > sdh               0.00     0.00    0.00  217.00
> > > > > > > > > > > > > > 0.00   434.00     4.00     0.80
> > > > > > > 3.71
> > > > > > > > > > > > > 0.00    3.71   3.71  80.40
> > > > > > > > > > > > > > sdi               0.00     0.00    0.00 1134.00
> > > > > > > > > > > > > > 0.00  3082.00     5.44     0.09
> > > > > > > 0.08
> > > > > > > > > > > > > 0.00    0.08   0.07   8.40
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > > > > > > > > > > > > > Sent: 01 February 2017 15:34
> > > > > > > > > > > > > > To: nick@xxxxxxxxxx
> > > > > > > > > > > > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > > > > > > > > > > > Subject: bluestore prefer wal size
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hey Nick,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I've updated/improved the prefer wal size PR (that
> > > > > > > > > > > > > > sends
> > > > > > > > > > > > > > small writes through the wal).  See
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 	https://github.com/ceph/ceph/pull/13217
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > if you want to try it out.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > sage
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > > > ceph-devel"
> > > > > > > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > > > > > > > > > majordomo info at
> > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > 
> > > > > 
> > > 
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux