Hi Nick, On 02/02/2017 08:46 AM, Nick Fisk wrote:
Further update If I do random writes instead of sequential, I see the "defering small write via wal" action. But from the do_alloc_write function not small write. Ie. log entry is: _do_alloc_write defering small 0x1000 write via wal
I just ran some tests using the branch with 4k random and sequential writes against a single OSD. I see deferring small write via wal from do_alloc_write both for both sequential and random when bluestore_prefer_wal_size is set to be larger than the IO size. Otherwise, I don't see the messages. I am also preconditioning with 4MB writes.
So far I've just been testing on NVMe, but I should be able to do some mixed tests soon to see if there are any interesting performance implications.
Mark
And that’s for a 4kb random write on an RBD that has been fully conditioned via fio 4MB writes So it's something in the sequential nature of the test which trips it up as I don't see any sign of deferred writes From looking at the code in the small write function, I can only see two possibilities that would stop if from reaching the prefer wal IF statement while (ep != o->extent_map.extent_map.end()) { if (ep->blob_start() >= end) { break; and if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) && b->get_blob().get_ondisk_length() >= b_off + b_len && b->get_blob().is_unused(b_off, b_len) && b->get_blob().is_allocated(b_off, b_len)) { Unfortunately, that’s about as far as I have got, before my knowledge of Bluestore becomes a limitation. Don't know if that triggers any thoughts?2017-02-02 10:39:31.988782 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing 0x56102a5a4900 osr(9.73 0x56102a5db1a0) 2017-02-02 10:39:31.988786 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900 = 0x56102ab4be40 seq 260 2017-02-02 10:39:31.988797 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key s 2017-02-02 10:39:31.988802 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key s = 0 2017-02-02 10:39:31.988806 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~ 1000 2017-02-02 10:39:31.988809 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0 2017-02-02 10:39:31.988812 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12 target_blob_size 0x80000 2017-02-02 10:39:31.988814 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000 2017-02-02 10:39:31.988816 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) bstart 0x80000 2017-02-02 10:39:31.988821 7ff7282ff700 20 bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard 0x7b000~1000 2017-02-02 10:39:31.988827 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small wal write 0x7b000~1000 of mutable Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) at [0x2564fb000~1000] 2017-02-02 10:39:31.988832 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small lex 0xfb000~1000: 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1) SharedBlob(0x56102aa76260)) 2017-02-02 10:39:31.988836 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc 0x56102ab4be40 0 blobs 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve instance 94627427524464 num_used 345738 total 7627973 2017-02-02 10:39:31.988840 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old 0xfb000~1000: 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1) SharedBlob(0x56102aa76260)) 2017-02-02 10:39:31.988844 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) compress_extent_map 0xfb000~1000 next shard 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000: 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) 2017-02-02 10:39:31.988850 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) compress_extent_map 0xfb000~1000 next shard 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000: 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) 2017-02-02 10:39:31.988868 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000 dirty 2017-02-02 10:39:31.988870 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~1000 = 0 2017-02-02 10:39:31.988875 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head #9:ce000000::::head# 2017-02-02 10:39:31.988882 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head #9:ce000000::::head# = 0 2017-02-02 10:39:31.988885 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc 0x56102ab4be40 onodes 0x56102a8ee300 shared_blobs 2017-02-02 10:39:31.988894 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531 bytes (was 531) from 1 extents 2017-02-02 10:39:31.988902 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) onode #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is 424 (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline extents) 2017-02-02 10:39:31.988914 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 prepare 2017-02-02 10:39:31.988916 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40 2017-02-02 10:39:31.988917 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 io_done 2017-02-02 10:39:31.988918 7ff7282ff700 20 bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) finish_write buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing nocache) 2017-02-02 10:39:31.988939 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake 2017-02-02 10:39:31.988941 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1 submitting 1 cleaning 0 2017-02-02 10:39:31.988952 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc 0x56102ab4be40 allocated 0x[] released 0x[] 2017-02-02 10:39:31.988999 7ff72e30b700 10 bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace bluefs 9535 M free (0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143 2017-02-02 10:39:31.989400 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1 cleaned 0 in 0.000456 2017-02-02 10:39:31.989405 7ff72e30b700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 kv_submitted 2017-02-02 10:39:31.989407 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc 0x56102ab4be40 2017-02-02 10:39:31.989413 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40 seq 255 2017-02-02 10:39:31.989414 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write [0x2564fb000~1000]sagesagesageNick -----Original Message----- From: Nick Fisk [mailto:nick@xxxxxxxxxx] Sent: 01 February 2017 19:03 To: 'Sage Weil' <sweil@xxxxxxxxxx>; 'Mark Nelson'<mnelson@xxxxxxxxxx>Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: RE: bluestore prefer wal size Hi Sage, First results not looking good. It looks like write IO to the SSD's (sdd and sdi)is now massively amplified, by somewhere in the region of about 10x. But I'm still only getting around 100 4kb's seq write iops from the fio client. This is in comparison to 2500-3000 iops on the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).rbd engine: RBD version: 0.1.11 Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s] [0/119/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb 118:56:51 2017write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec slat (usec): min=6, max=142, avg=10.98, stdev= 8.38 clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63 lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63 clat percentiles (msec): | 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 9], 20.00th=[ 9], | 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 9], 60.00th=[ 9], | 70.00th=[ 9], 80.00th=[ 9], 90.00th=[ 9], 95.00th=[ 9], | 99.00th=[ 17], 99.50th=[ 25], 99.90th=[ 33], 99.95th=[ 36], | 99.99th=[ 273] bw (KB /s): min= 191, max= 480, per=100.00%, avg=464.18, stdev=34.60 lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69% lat (msec) : 500=0.01% cpu : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,drop=r=0/w=0/d=0latency : target=0, window=0, percentile=100.00%, depth=1 I've checked via the admin socket and the prefer wal option is set to 8192. Random capture of iostat Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-szawait r_await w_await svctm %utilsda 0.00 77.00 0.00 5.50 0.00 338.00 122.91 0.05 11.640.00 11.64 8.73 4.80sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 0.00sdc 0.00 0.00 0.00 151.50 0.00 302.00 3.99 0.54 3.590.00 3.59 3.58 54.20sdd 0.00 0.00 0.00 1474.00 0.00 4008.00 5.44 0.09 0.060.00 0.06 0.06 9.20sde 0.00 0.00 0.00 217.00 0.00 434.00 4.00 0.91 4.200.00 4.20 4.20 91.20sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 0.00sdg 0.00 0.00 0.00 66.50 0.00 134.00 4.03 0.18 2.680.00 2.68 2.68 17.80sdh 0.00 0.00 0.00 217.00 0.00 434.00 4.00 0.80 3.710.00 3.71 3.71 80.40sdi 0.00 0.00 0.00 1134.00 0.00 3082.00 5.44 0.09 0.080.00 0.08 0.07 8.40-----Original Message----- From: Sage Weil [mailto:sweil@xxxxxxxxxx] Sent: 01 February 2017 15:34 To: nick@xxxxxxxxxx Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: bluestore prefer wal size Hey Nick, I've updated/improved the prefer wal size PR (that sends small writes through the wal). See https://github.com/ceph/ceph/pull/13217 if you want to try it out. Thanks! sage-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html