On Mon, Aug 22, 2016 at 11:10 PM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote: > > Will prepare shortly. Any suggestions on desired levels and components? hmm, I also met this problem for last WAL perf between filejournal and rocksdb. For normal case, we will reuse recycle log while writing enough data as sage mentioned. But I found it's strange while stress writing doesn't discard bluefs inode metadata update. The rocksdb wal log file continues to grow while writing hundreds of logs, I can see bluefs wal log file's inode size exceed 500MB..... I stopped at looking at DBImpl::SwitchMemtable. maybe you can see whether your case is recycle log isn't switch? > > > > > On 22.08.2016 18:08, Sage Weil wrote: >> >> On Mon, 22 Aug 2016, Igor Fedotov wrote: >>> >>> Hi All, >>> >>> While testing BlueStore as a standalone storage via FIO plugin I'm observing >>> huge traffic to a WAL device. >>> >>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L >>> >>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block >>> WAL. >>> >>> The second is split similarly and first 200Gb partition allocated for Raw >>> Block data. >>> >>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No >>> much difference comparing to default settings though... >>> >>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm >>> observing (using nmon and other disk mon tools) significant write traffic to >>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device >>> traffic is pretty stable at ~30 Mbs. >>> >>> Additionally I inserted an output for BlueFS perf counters on >>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst). >>> >>> The resulting values are very frustrating: ~28Gb and 4Gb for >>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively. >> >> Yeah, this doesn't seem right. Have you generated a log to see what is >> actually happening on each write? I don't have any bright ideas about >> what is going wrong here. >> >> sage >> >>> Doing 64K changes the picture dramatically: >>> >>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs >>> BlueFS counters are ~140Mb and 1K respectively. >>> >>> Surely write completes much faster in the second case. >>> >>> No WAL is reported in logs at BlueStore level for both cases. >>> >>> >>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over >>> the store propagated this way too. >>> >>> I'm wondering why WAL device is involved in the process at all ( writes happen >>> in min_alloc_size blocks) operate and why the traffic and written data volume >>> is so high? >>> >>> Don't we have some fault affecting 4K performance here? >>> >>> >>> Here are my settings and FIO job specification: >>> >>> ########################### >>> >>> [global] >>> debug bluestore = 0/0 >>> debug bluefs = 1/0 >>> debug bdev = 0/0 >>> debug rocksdb = 0/0 >>> >>> # spread objects over 8 collections >>> osd pool default pg num = 32 >>> log to stderr = false >>> >>> [osd] >>> osd objectstore = bluestore >>> bluestore_block_create = true >>> bluestore_block_db_create = true >>> bluestore_block_wal_create = true >>> bluestore_min_alloc_size = 4096 >>> #bluestore_max_alloc_size = #or 4096 >>> bluestore_fsck_on_mount = false >>> >>> bluestore_block_path=/dev/sdi1 >>> bluestore_block_db_path=/dev/sde1 >>> bluestore_block_wal_path=/dev/sde2 >>> >>> enable experimental unrecoverable data corrupting features = bluestore >>> rocksdb memdb >>> >>> bluestore_rocksdb_options = >>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800" >>> >>> rocksdb_cache_size = 4294967296 >>> bluestore_csum = false >>> bluestore_csum_type = none >>> bluestore_bluefs_buffered_io = false >>> bluestore_max_ops = 30000 >>> bluestore_max_bytes = 629145600 >>> bluestore_buffer_cache_size = 104857600 >>> bluestore_block_wal_size = 0 >>> >>> # use directory= option from fio job file >>> osd data = ${fio_dir} >>> >>> # log inside fio_dir >>> log file = ${fio_dir}/log >>> #################################### >>> >>> #FIO jobs >>> ################# >>> # Runs a 4k random write test against the ceph BlueStore. >>> [global] >>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your >>> LD_LIBRARY_PATH >>> >>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration >>> file >>> directory=./fio-bluestore # directory for osd_data >>> >>> rw=write >>> iodepth=16 >>> size=256m >>> >>> [bluestore] >>> nr_files=63 >>> bs=4k # or 64k >>> numjobs=32 >>> ############# >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html