Re: Odd WAL traffic for BlueStore

Haomai Wang <haomai@xxxxxxxx> · Mon, 22 Aug 2016 23:17:58 +0800

On Mon, Aug 22, 2016 at 11:10 PM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote:
>
> Will prepare shortly. Any suggestions on desired levels and components?

hmm, I also met this problem for last WAL perf between filejournal and
rocksdb. For normal case, we will reuse recycle log while writing
enough data as sage mentioned. But I found it's strange while stress
writing doesn't discard bluefs inode metadata update. The rocksdb wal
log file continues to grow while writing hundreds of logs, I can see
bluefs wal log file's inode size exceed 500MB.....

I stopped at looking at DBImpl::SwitchMemtable. maybe you can see
whether your case is recycle log isn't switch?

>
>
>
>
> On 22.08.2016 18:08, Sage Weil wrote:
>>
>> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>>>
>>> Hi All,
>>>
>>> While testing BlueStore as a standalone storage via FIO plugin I'm observing
>>> huge traffic to a WAL device.
>>>
>>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>>>
>>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block
>>> WAL.
>>>
>>> The second is split similarly and first 200Gb partition allocated for Raw
>>> Block data.
>>>
>>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
>>> much difference comparing to default settings though...
>>>
>>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm
>>> observing (using nmon and other disk mon tools) significant write traffic to
>>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device
>>> traffic is pretty stable at ~30 Mbs.
>>>
>>> Additionally I inserted an output for BlueFS perf counters on
>>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>>
>>> The resulting values are very frustrating: ~28Gb and 4Gb for
>>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>
>> Yeah, this doesn't seem right.  Have you generated a log to see what is
>> actually happening on each write?  I don't have any bright ideas about
>> what is going wrong here.
>>
>> sage
>>
>>> Doing 64K changes the picture dramatically:
>>>
>>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
>>> BlueFS counters are ~140Mb and 1K respectively.
>>>
>>> Surely write completes much faster in the second case.
>>>
>>> No WAL is reported in logs at BlueStore level for both cases.
>>>
>>>
>>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over
>>> the store propagated this way too.
>>>
>>> I'm wondering why WAL device is involved in the process at all ( writes happen
>>> in min_alloc_size blocks) operate and why the traffic and written data volume
>>> is so high?
>>>
>>> Don't we have some fault affecting 4K performance here?
>>>
>>>
>>> Here are my settings and FIO job specification:
>>>
>>> ###########################
>>>
>>> [global]
>>>          debug bluestore = 0/0
>>>          debug bluefs = 1/0
>>>          debug bdev = 0/0
>>>          debug rocksdb = 0/0
>>>
>>>          # spread objects over 8 collections
>>>          osd pool default pg num = 32
>>>          log to stderr = false
>>>
>>> [osd]
>>>          osd objectstore = bluestore
>>>          bluestore_block_create = true
>>>          bluestore_block_db_create = true
>>>          bluestore_block_wal_create = true
>>>          bluestore_min_alloc_size = 4096
>>>          #bluestore_max_alloc_size = #or 4096
>>>          bluestore_fsck_on_mount = false
>>>
>>>          bluestore_block_path=/dev/sdi1
>>>          bluestore_block_db_path=/dev/sde1
>>>          bluestore_block_wal_path=/dev/sde2
>>>
>>>          enable experimental unrecoverable data corrupting features = bluestore
>>> rocksdb memdb
>>>
>>>          bluestore_rocksdb_options =
>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>
>>>          rocksdb_cache_size = 4294967296
>>>          bluestore_csum = false
>>>          bluestore_csum_type = none
>>>          bluestore_bluefs_buffered_io = false
>>>          bluestore_max_ops = 30000
>>>          bluestore_max_bytes = 629145600
>>>          bluestore_buffer_cache_size = 104857600
>>>          bluestore_block_wal_size = 0
>>>
>>>          # use directory= option from fio job file
>>>          osd data = ${fio_dir}
>>>
>>>          # log inside fio_dir
>>>          log file = ${fio_dir}/log
>>> ####################################
>>>
>>> #FIO jobs
>>> #################
>>> # Runs a 4k random write test against the ceph BlueStore.
>>> [global]
>>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
>>> LD_LIBRARY_PATH
>>>
>>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration
>>> file
>>> directory=./fio-bluestore # directory for osd_data
>>>
>>> rw=write
>>> iodepth=16
>>> size=256m
>>>
>>> [bluestore]
>>> nr_files=63
>>> bs=4k        # or 64k
>>> numjobs=32
>>> #############
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html