Re: storing pg logs outside of rocksdb

xiaoyan li <wisher2003@xxxxxxxxx> · Thu, 21 Jun 2018 11:30:34 +0800

On Thu, Jun 21, 2018 at 6:46 AM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:
> Hi Lisa,
>
> I gave your branch a whirl.  On the first run I tried to allocate too many
> PGs and it ran out of space and asserted. :D  We'll need to figure out a
> mechanism to allocate space that doesn't depend on a a hardcoded dev value.
Yes, the problem exists.
>
> Ok, now for the goods.  These were just fast 10 minute tests on a tiny RBD
> volume, so take the results with a big grain of salt.  I expect things to
> improve for pglog-split-fastinfo when there's more data in rocksdb though.
> Despite that, the results are interesting!  In pglog-split-fastinfo, rocksdb
> deals with far fewer keys and spends far less time in compaction, but indeed
> having a single WAL for everything means more coalescing of writes with the
> associated benefits (But man that compaction!).  I think on the P3700 you
> can use 512B sectors, so that would at least help with the write-amp but may
> not offer any performance benefits?
Ok, I will have test of 512B sector. And at the meanwhile, I did tests
with Intel P3700 400G.
I still can get (50.4-45.1)/45.1*100 = 11%.
Mark, from your tests, IOPS improves little.

My test environment:
CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Disk: Single Intel P3700 400G
15 10G RBD volumes, 1 OSD, 128 PGs,
iodepth=32

And test results:
pglog-split-fastinfo branch:
rbd0: (groupid=0, jobs=16): err= 0: pid=3563: Thu Jun 21 06:21:07 2018
  write: IOPS=45.1k, BW=176MiB/s (185MB/s)(51.7GiB/300057msec)
    slat (nsec): min=1040, max=961881, avg=4186.57, stdev=3802.90
    clat (msec): min=3, max=175, avg=11.34, stdev=12.97
     lat (msec): min=3, max=175, avg=11.34, stdev=12.97

master branch
rbd0: (groupid=0, jobs=16): err= 0: pid=52676: Thu Jun 21 05:38:08 2018
  write: IOPS=50.4k, BW=197MiB/s (206MB/s)(57.7GiB/300009msec)
    slat (nsec): min=1044, max=3379.9k, avg=4370.65, stdev=3955.84
    clat (msec): min=2, max=178, avg=10.16, stdev=10.41
     lat (msec): min=2, max=178, avg=10.16, stdev=10.41

My ceph.conf:

[global]
    fsid = 18677778-8289-11e7-b44b-a4bf0118d2ff
    pid_path = /var/run/ceph
    osd pool default size = 1
    auth_service_required = none
    auth_cluster_required = none
    auth_client_required = none
    osd_objectstore = bluestore
    mon allow pool delete = true
    debug bluestore = 0/0
    debug osd = 0/0
    debug ms = 0/0
        debug bluefs = 0/0
        debug bdev = 0/0
        debug rocksdb = 0/0
        osd pool default pg num = 64
        osd op num shards = 8
[mon]
    mon_data = /var/lib/ceph/mon.$id
[osd]
    osd_data = /var/lib/ceph/mnt/osd-device-$id-data
    osd_mkfs_type = xfs
    osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
[client]
    rbd_cache = false
[mon.sceph9]
    host = sceph9
    mon addr = 127.0.0.1
    log file = /opt/fio_test/osd/mon.log
[osd.0]
    host = sceph9
    public addr = 127.0.0.1
    cluster addr = 127.0.0.1
    devs = /dev/nvme1n1p3
    bluestore_block_path = /dev/nvme1n1p4
    bluestore_block_db_path = /dev/nvme1n1p2
    bluestore_block_wal_path = /dev/nvme1n1p1
    log file = /opt/fio_test/osd/osd.log
    bluestore_shard_finishers = true
    bdev_block_size = 4096

[global]

ioengine=rbd
clientname=admin
rw=randwrite
bs=4k
time_based=1
runtime=300s
iodepth=32
group_reporting

[rbd0]
pool=rbd
rbdname=rbd0

....

[rbd14]
pool=rbd
rbdname=rbd14

Lisa

>
> This looks like it would be amazing if you could do cache-line granularity
> writes!
>
>
> Test Setup
> ----------
> 4k fio/librbd randwrite 16QD for 600s
> 16GB RBD volume, 1 OSD, 128 PGs
> Single Intel 800GB P3700 NVMe
>
>
> Client
>
> fio Throughput
> ----------
> 65a74d4:              22.5K IOPS
> pglog-split-fastinfo: 22.9K IOPS
>
> fio 99.95th percentile latency
> ---------------------------
> 65a74d4:              25035 usec
> pglog-split-fastinfo:  9372 usec
>
>
> RocksDB
>
> RocksDB Compaction Event Count
> ------------------------------
> 65a74d4:              77
> pglog-split-fastinfo: 70
>
> RocksDB Compaction Total Time
> -----------------------------
> 65a74d4:              66.85s
> pglog-split-fastinfo: 16.39s
>
> RocksDB Compaction Total Input Records
> --------------------------------------
> 65a74d4:              70255125
> pglog-split-fastinfo: 13507190
>
> RocksDB Compaction Total Output Records
> ---------------------------------------
> 65a74d4:              34126131
> pglog-split-fastinfo:  4090954
>
>
> Collectl
>
> RockSDB WAL Device Write (AVG MB/s)
> -----------------------------------
> 65a74d4:              142.74
> pglog-split-fastinfo: 138.71
>
> RocksDB WAL Device Write (AVG OPS/s)
> ------------------------------------
> 65a74d4:              6275.37
> pglog-split-fastinfo: 7314.67
>
>
> RocksDB DB Device Write (AVG MB/s)
> ----------------------------------
> 65a74d4:              24.88
> pglog-split-fastinfo: 10.12
>
> RocksDB DB Device Write (AVG OPS/s)
> -----------------------------------
> 65a74d4:              200.05
> pglog-split-fastinfo:  81.81
>
>
> OSD Block Device Write (AVG MB/s)
> ---------------------------------
> 65a74d4:               87.20
> pglog-split-fastinfo: 177.79*
>
> OSD Block Device Write (AVG OPS/s)
> ----------------------------------
> 65a74d4:              22324.3
> pglog-split-fastinfo: 45514.0*
>
> * pglog writes are now hitting the block device instead of rocksdb log
>
>
> Wallclock Profile of bstore_kv_sync
> -----------------------------------
>
> 65a74d4:
> + 100.00% clone
>   + 100.00% start_thread
>     + 100.00% BlueStore::KVSyncThread::entry
>       + 100.00% BlueStore::_kv_sync_thread
>         + 50.00% RocksDBStore::submit_transaction_sync
>         | + 49.88% RocksDBStore::submit_common
>         | | + 49.88% rocksdb::DBImpl::Write
>         | |   + 49.88% rocksdb::DBImpl::WriteImpl
>         | |     + 46.09% rocksdb::DBImpl::WriteToWAL
>         | |     | + 45.35% rocksdb::WritableFileWriter::Sync
>         | |     | | + 45.23% rocksdb::WritableFileWriter::SyncInternal
>         | |     | | | + 45.23% BlueRocksWritableFile::Sync
>         | |     | | |   + 45.23% fsync
>         | |     | | |     + 45.11% BlueFS::_fsync
>         | |     | | |     | + 28.73% BlueFS::_flush_bdev_safely
>         | |     | | |     | | + 27.63% BlueFS::flush_bdev
>         | |     | | |     | | | + 27.63% KernelDevice::flush
>         | |     | | |     | | |   + 27.63% fdatasync
>         | |     | | |     | | + 0.49% lock
>         | |     | | |     | | + 0.24% BlueFS::wait_for_aio
>         | |     | | |     | | + 0.12% unlock
>         | |     | | |     | | + 0.12% clear
>         | |     | | |     | | + 0.12% BlueFS::_claim_completed_aios
>         | |     | | |     | + 16.38% BlueFS::_flush
>         | |     | | |     |   + 16.38% BlueFS::_flush_range
>         | |     | | |     |     + 14.30% KernelDevice::aio_write
>         | |     | | |     |     | + 14.30% KernelDevice::_sync_write
>         | |     | | |     |     |   + 7.58% pwritev64
>         | |     | | |     |     |   + 6.72% sync_file_range
>         | |     | | |     |     + 1.10% ~list
>         | |     | | |     |     + 0.37% __memset_sse2
>         | |     | | |     |     + 0.24% bluefs_fnode_t::seek
>         | |     | | |     |     + 0.12% ceph::buffer::list::substr_of
>         | |     | | |     |     + 0.12%
> ceph::buffer::list::claim_append_piecewise
>         | |     | | |     + 0.12% unique_lock
>         | |     | | + 0.12% rocksdb::WritableFileWriter::Flush
>         | |     | + 0.73% rocksdb::DBImpl::WriteToWAL
>         | |     + 2.32% rocksdb::WriteBatchInternal::InsertInto
>         | |     + 0.37% rocksdb::WriteThread::ExitAsBatchGroupLeader
>         | |     + 0.12% ~WriteContext
>         | |     + 0.12% rocksdb::WriteThread::JoinBatchGroup
>         | |     + 0.12% rocksdb::WriteThread::EnterAsBatchGroupLeader
>         | |     + 0.12% rocksdb::StopWatch::~StopWatch
>         | |     + 0.12% rocksdb::InstrumentedMutex::Lock
>         | |     + 0.12% rocksdb::DBImpl::MarkLogsSynced
>         | |     + 0.12% operator=
>         | |     + 0.12% PerfStepTimer
>         | + 0.12% ceph_clock_now
>         + 38.51% RocksDBStore::submit_transaction
>         | + 38.51% RocksDBStore::submit_common
>         |   + 38.14% rocksdb::DBImpl::Write
>         |     + 38.02% rocksdb::DBImpl::WriteImpl
>         |       + 25.18% rocksdb::WriteBatchInternal::InsertInto
>         |       | + 25.18% rocksdb::WriteBatch::Iterate
>         |       |   + 23.59% rocksdb::MemTableInserter::PutCF
>         |       |   | + 23.59% rocksdb::MemTableInserter::PutCFImpl
>         |       |   |   + 23.35% rocksdb::MemTable::Add
>         |       |   |   | + 16.99%
> rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
> const&>::Insert<false>
>         |       |   |   | | + 14.79%
> rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
> const&>::RecomputeSpliceLevels
>         |       |   |   | | | + 14.18% FindSpliceForLevel<true>
>         |       |   |   | | |   + 8.56% KeyIsAfterNode
>         |       |   |   | | |   + 0.73% Next
>         |       |   |   | | + 1.34% KeyIsAfterNode
>         |       |   |   | | + 0.86%
> rocksdb::MemTable::KeyComparator::operator()
>         |       |   |   | + 4.89% __memcpy_ssse3
>         |       |   |   | + 0.61% rocksdb::MemTable::UpdateFlushState
>         |       |   |   | + 0.37% rocksdb::(anonymous
> namespace)::SkipListRep::Allocate
>         |       |   |   + 0.12%
> rocksdb::ColumnFamilyMemTablesImpl::GetMemTable
>         |       |   + 0.61% rocksdb::ReadRecordFromWriteBatch
>         |       |   + 0.61% rocksdb::MemTableInserter::DeleteCF
>         |       |   + 0.12% operator=
>         |       + 11.49% rocksdb::DBImpl::WriteToWAL
>         |       | + 11.37% rocksdb::DBImpl::WriteToWAL
>         |       | | + 11.12% rocksdb::log::Writer::AddRecord
>         |       | |   + 11.12% rocksdb::log::Writer::EmitPhysicalRecord
>         |       | |     + 8.07% rocksdb::WritableFileWriter::Append
>         |       | |     + 2.08% rocksdb::crc32c::crc32c_3way
>         |       | |     + 0.98% rocksdb::WritableFileWriter::Flush
>         |       | + 0.12% rocksdb::WriteBatchInternal::SetSequence
>         |       + 0.24% rocksdb::WriteThread::EnterAsBatchGroupLeader
>         |       + 0.12% rocksdb::WriteThread::ExitAsBatchGroupLeader
>         |       + 0.12% Writer
>         |       + 0.12% WriteContext
>         + 6.60% std::condition_variable::wait(std::unique_lock<std::mutex>&)
>         + 1.10% RocksDBStore::RocksDBTransactionImpl::rm_single_key
>         + 0.98% std::condition_variable::notify_one()
>         + 0.73% BlueStore::_txc_applied_kv
>         + 0.37% KernelDevice::flush
>         + 0.24% get_deferred_key
>         + 0.24% RocksDBStore::get_transaction
>         + 0.12% ~basic_string
>         + 0.12% swap
>         + 0.12% lock
>         + 0.12% end
>         + 0.12% ceph_clock_now
>         + 0.12% Throttle::put
>
>
> pglog-split-fastinfo:
> + 100.00% clone
>   + 100.00% start_thread
>     + 100.00% BlueStore::KVSyncThread::entry
>       + 100.00% BlueStore::_kv_sync_thread
>         + 53.03% RocksDBStore::submit_transaction_sync
>         | + 52.75% RocksDBStore::submit_common
>         | | + 52.75% rocksdb::DBImpl::Write
>         | |   + 52.60% rocksdb::DBImpl::WriteImpl
>         | |     + 48.55% rocksdb::DBImpl::WriteToWAL
>         | |     | + 47.83% rocksdb::WritableFileWriter::Sync
>         | |     | | + 47.54% rocksdb::WritableFileWriter::SyncInternal
>         | |     | | | + 47.54% BlueRocksWritableFile::Sync
>         | |     | | |   + 47.54% fsync
>         | |     | | |     + 47.54% BlueFS::_fsync
>         | |     | | |       + 31.50% BlueFS::_flush_bdev_safely
>         | |     | | |       | + 31.07% BlueFS::flush_bdev
>         | |     | | |       | | + 31.07% KernelDevice::flush
>         | |     | | |       | |   + 30.92% fdatasync
>         | |     | | |       | |   + 0.14% lock_guard
>         | |     | | |       | + 0.14% lock
>         | |     | | |       | + 0.14% BlueFS::wait_for_aio
>         | |     | | |       + 16.04% BlueFS::_flush
>         | |     | | |         + 16.04% BlueFS::_flush_range
>         | |     | | |           + 11.99% KernelDevice::aio_write
>         | |     | | |           | + 11.99% KernelDevice::_sync_write
>         | |     | | |           |   + 6.36% pwritev64
>         | |     | | |           |   + 5.49% sync_file_range
>         | |     | | |           + 1.59% ~list
>         | |     | | |           + 0.43% IOContext::aio_wait
>         | |     | | |           + 0.29%
> ceph::buffer::list::claim_append_piecewise
>         | |     | | |           + 0.14% list
>         | |     | | |           + 0.14% ceph::buffer::ptr::zero
>         | |     | | |           + 0.14% ceph::buffer::list::substr_of
>         | |     | | |           + 0.14%
> ceph::buffer::list::page_aligned_appender::flush
>         | |     | | |           + 0.14% ceph::buffer::list::list
>         | |     | | |           + 0.14% ceph::buffer::list::append
>         | |     | | |           + 0.14% __memset_sse2
>         | |     | | |           + 0.14%
> _ZN4ceph6buffer4list6appendERKNS0_3ptrEjj@plt
>         | |     | | + 0.29% rocksdb::WritableFileWriter::Flush
>         | |     | + 0.43% rocksdb::DBImpl::WriteToWAL
>         | |     | + 0.14% rocksdb::DBImpl::MergeBatch
>         | |     | + 0.14% operator=
>         | |     + 3.47% rocksdb::WriteBatchInternal::InsertInto
>         | |     + 0.14% rocksdb::InstrumentedMutex::Lock
>         | |     + 0.14% Writer
>         | |     + 0.14% LastSequence
>         | + 0.14% operator-
>         + 29.05% RocksDBStore::submit_transaction
>         | + 28.61% RocksDBStore::submit_common
>         | | + 28.32% rocksdb::DBImpl::Write
>         | |   + 28.03% rocksdb::DBImpl::WriteImpl
>         | |     + 17.77% rocksdb::WriteBatchInternal::InsertInto
>         | |     | + 17.77% rocksdb::WriteBatch::Iterate
>         | |     |   + 17.20% rocksdb::MemTableInserter::PutCF
>         | |     |   | + 17.20% rocksdb::MemTableInserter::PutCFImpl
>         | |     |   |   + 16.47% rocksdb::MemTable::Add
>         | |     |   |   | + 12.72%
> rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
> const&>::Insert<false>
>         | |     |   |   | | + 10.69%
> rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
> const&>::RecomputeSpliceLevels
>         | |     |   |   | | | + 10.40% FindSpliceForLevel<true>
>         | |     |   |   | | |   + 6.79% KeyIsAfterNode
>         | |     |   |   | | |   + 0.29% Next
>         | |     |   |   | | + 1.01%
> rocksdb::MemTable::KeyComparator::operator()
>         | |     |   |   | | + 0.72% KeyIsAfterNode
>         | |     |   |   | | + 0.14% SetNext
>         | |     |   |   | + 3.32% __memcpy_ssse3
>         | |     |   |   | + 0.29% rocksdb::MemTable::UpdateFlushState
>         | |     |   |   | + 0.14% rocksdb::(anonymous
> namespace)::SkipListRep::Allocate
>         | |     |   |   + 0.29% SeekToColumnFamily
>         | |     |   |   + 0.29% CheckMemtableFull
>         | |     |   |   + 0.14%
> rocksdb::ColumnFamilyMemTablesImpl::GetMemTable
>         | |     |   + 0.14% rocksdb::ReadRecordFromWriteBatch
>         | |     |   + 0.14% operator=
>         | |     + 9.10% rocksdb::DBImpl::WriteToWAL
>         | |     + 0.29% rocksdb::DBImpl::PreprocessWrite
>         | |     + 0.14% ~WriteContext
>         | |     + 0.14% rocksdb::WriteThread::ExitAsBatchGroupLeader
>         | |     + 0.14% operator=
>         | |     + 0.14% Writer
>         | |     + 0.14% FinalStatus
>         | + 0.14% ceph_clock_now
>         | + 0.14% PerfCounters::tinc
>         + 12.86%
> std::condition_variable::wait(std::unique_lock<std::mutex>&)
>         | + 12.86% pthread_cond_wait@@GLIBC_2.3.2
>         |   + 0.29% __pthread_mutex_cond_lock
>         + 1.45% BlueStore::_txc_applied_kv
>         + 0.87% std::condition_variable::notify_one()
>         + 0.72% KernelDevice::flush
>         + 0.43% RocksDBStore::RocksDBTransactionImpl::rm_single_key
>         + 0.29% ~shared_ptr
>         + 0.14% ~deque
>         + 0.14% unique_lock
>         + 0.14% swap
>         + 0.14% operator--
>         + 0.14% log_state_latency
>         + 0.14% lock
>         + 0.14% get_deferred_key
>         + 0.14% deque
>
> Mark
>
> On 06/20/2018 03:19 AM, xiaoyan li wrote:
>>
>>   Hi all,
>> I wrote a poc to split pglog from Rocksdb and store them into
>> standalone space in the block device.
>> The updates are done in OSD and BlueStore:
>>
>> OSD parts:
>> 1.       Split pglog entries and pglog info from omaps.
>> BlueStore:
>> 1.       Allocate 16M space in block device per PG for storing pglog.
>> 2.       Per every transaction from OSD,  combine pglog entries and
>> pglog info, and write them into a block. The block is set to 4k at
>> this moment.
>>
>> Currently, I only make the write workflow work.
>> With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got
>> the following performance for 4k random writes, and the performance
>> got 13.87% better.
>>
>> Master:
>>    write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec)
>>      slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69
>>      clat (msec): min=3, max=123, avg=10.60, stdev= 8.31
>>       lat (msec): min=3, max=123, avg=10.60, stdev= 8.31
>>
>> Pgsplit branch:
>>    write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec)
>>      slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47
>>      clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92
>>       lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92
>>
>> Here is the POC:
>> https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo
>> The problem is that per every transaction, I use a 4k block to save
>> the pglog entries and pglog info which is only 130+920 = 1050 bytes.
>> This wastes a lot of space.
>> Any suggestions?
>>
>> Best wishes
>> Lisa
>>
>> On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On 04/03/2018 09:36 PM, xiaoyan li wrote:
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 04/03/2018 09:56 AM, Mark Nelson wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/03/2018 08:27 AM, Sage Weil wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 3 Apr 2018, Li Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>      Before we move forward, could someone give a test such that
>>>>>>>> the pglog not written into rocksdb at all, to see how much is the
>>>>>>>> performance improvement as the upper bound, it shoule be less than
>>>>>>>> turning on the bluestore_debug_omit_kv_commit
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> (The PetStore behavior doesn't tell us anything about how BlueStore
>>>>>>> will
>>>>>>> behave without the pglog overhead.)
>>>>>>>
>>>>>>> sage
>>>>>>
>>>>>>
>>>>>>
>>>>>> We do have some testing of the bluestore's behavior, though it's about
>>>>>> 6
>>>>>> months old now:
>>>>>>
>>>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>>>>>
>>>>>> - 128 PGs
>>>>>>
>>>>>> - stats are sloppy since they only appear every ~10 mins
>>>>>>
>>>>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K*
>>>>>>
>>>>>>      - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>>>>>> 7.858GB
>>>>>>
>>>>>>      - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush:
>>>>>> 15.847GB <-- with this workload this is pg log and dup op kv entries
>>>>>>
>>>>>>      - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:
>>>>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops
>>>>>> =
>>>>>> 24.2K*
>>>>>>
>>>>>>      - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M, Flush:
>>>>>> 7.538GB
>>>>>>
>>>>>>      - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush:
>>>>>> 8.884GB <-- with this workload this is pg log and dup op kv entries
>>>>>>
>>>>>>      - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K, Flush:
>>>>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops
>>>>>> =
>>>>>> 23.8K*
>>>>>>
>>>>>>      - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>>>>>> 7.936GB
>>>>>>
>>>>>>      - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush:
>>>>>> 9.289GB <-- with this workload this is pg log and dup op kv entries
>>>>>>
>>>>>>      - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K, Flush:
>>>>>> 0.368GB <-- deferred writes
>>>>>>
>>>>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K*
>>>>>>
>>>>>> *
>>>>>> The actual performance variation here I think is much less important
>>>>>> than
>>>>>> the KeyIn behavior.  The NVMe devices in these tests are fast enough
>>>>>> to
>>>>>> absorb a fair amount of overhead.
>>>>>
>>>>>
>>>>>
>>>>> Ugh, sorry.  That will teach me to talk in meeting and paste at the
>>>>> same
>>>>> time.  Those were the wrong stats.  Here are the right ones:
>>>>>
>>>>>>           - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>>>>>           - 128 PGs
>>>>>>           - stats are sloppy since they only appear every ~10 mins
>>>>>>           - min_pg_log_entries = 3000, trim = default, pginfo hack,
>>>>>> iops
>>>>>> =
>>>>>> 27.8K
>>>>>>               - Default CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:
>>>>>> 19M,
>>>>>> Flush:  8.662GB
>>>>>>               - [M] CF     - Size: 159.97MB, KeyIn: 162M, KeyDrop:
>>>>>> 139M,
>>>>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv
>>>>>> entries
>>>>>>               - [L] CF     - Size:   1.39MB, KeyIn: 201K, KeyDrop:
>>>>>> 89K,
>>>>>> Flush:  0.355GB <-- deferred writes                -
>>>>>> min_pg_log_entries
>>>>>> =
>>>>>> 3000, trim = default iops = 28.3K
>>>>>>               - Default CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:
>>>>>> 19M,
>>>>>> Flush:  8.762GB
>>>>>>               - [M] CF     - Size: 159.97MB, KeyIn: 202M, KeyDrop:
>>>>>> 175M,
>>>>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv
>>>>>> entries
>>>>>>               - [L] CF     - Size:   0.86MB, KeyIn: 201K, KeyDrop:
>>>>>> 89K,
>>>>>> Flush:  0.355GB <-- deferred writes
>>>>>>           - default min_pg_log_entries = 1500, trim = default, iops =
>>>>>> 26.6K
>>>>>>               - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:
>>>>>> 17M,
>>>>>> Flush:  7.858GB
>>>>>>               - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop:
>>>>>> 269M,
>>>>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv
>>>>>> entries
>>>>>>               - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:
>>>>>> 80K,
>>>>>> Flush:  0.320GB <-- deferred writes
>>>>>>           - min_pg_log_entries = 10, trim = 10, iops = 24.2K
>>>>>>               - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:
>>>>>> 16M,
>>>>>> Flush:  7.538GB
>>>>>>               - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop:
>>>>>> 250M,
>>>>>> Flush:  8.884GB <-- with this workload this is pg log and dup op kv
>>>>>> entries
>>>>>>               - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:
>>>>>> 83K,
>>>>>> Flush:  0.331GB <-- deferred writes
>>>>>>           - min_pg_log_entries = 1, trim = 1, iops = 23.8K
>>>>>>               - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:
>>>>>> 17M,
>>>>>> Flush:  7.936GB
>>>>>>               - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop:
>>>>>> 262M,
>>>>>> Flush:  9.289GB <-- with this workload this is pg log and dup op kv
>>>>>> entries
>>>>>>               - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:
>>>>>> 92K,
>>>>>> Flush:  0.368GB <-- deferred writes
>>>>>>           - min_pg_log_entires = 3000, trim = 1, iops = 25.8K
>>>>
>>>>
>>>> Hi Mark, do you extract above results from compaction stats in Rocksdb
>>>> LOG?
>>>
>>>
>>>
>>> Correct, except for the IOPS numbers which were from the client
>>> benchmark.
>>>
>>>
>>>>
>>>> ** Compaction Stats [default] **
>>>> Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
>>>> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt)
>>>> Avg(sec) KeyIn KeyDrop
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>     L0      6/0   270.47 MB   1.1      0.0     0.0      0.0       0.2
>>>>     0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>>>       0      0
>>>>     L1      3/0   190.94 MB   0.7      0.0     0.0      0.0       0.0
>>>>     0.0       0.0   0.0      0.0      0.0         0         0    0.000
>>>>       0      0
>>>>    Sum      9/0   461.40 MB   0.0      0.0     0.0      0.0       0.2
>>>>     0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>>>       0      0
>>>>    Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2
>>>>    0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>>>      0      0
>>>> Uptime(secs): 9.9 total, 9.9 interval
>>>> Flush(GB): cumulative 0.198, interval 0.198
>>>>
>>>>> Note specifically how the KeyIn rate drops with the min_pg_log_entries
>>>>> increased (ie disable dup_ops) and hacking out pginfo.  I suspect that
>>>>> commenting out log_operation would reduce the KeyIn rate significantly
>>>>> further.  Again these drives can absorb a lot of this so the
>>>>> improvement
>>>>> in
>>>>> iops is fairly modest (and setting min_pg_log_entries low actually
>>>>> hurts!),
>>>>> but this isn't just about performance, it's about the behavior that we
>>>>> invoke.  The Petstore results absolutely show us that on very fast
>>>>> storage
>>>>> we see a dramatic CPU usage reduction by removing log_operation and
>>>>> pginfo,
>>>>> so I think we should focus on what kind of behavior we want
>>>>> pglog/pginfo/dup_ops to invoke.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>>
>>>>>> *
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Li Wang
>>>>>>>>
>>>>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Based on your above discussion about pglog, I have the following
>>>>>>>>> rough
>>>>>>>>> design. Please help to give your suggestions.
>>>>>>>>>
>>>>>>>>> There will be three partitions: raw part for customer IOs, Bluefs
>>>>>>>>> for
>>>>>>>>> Rocksdb, and pglog partition.
>>>>>>>>> The former two partitions are same as current. The pglog partition
>>>>>>>>> is
>>>>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per
>>>>>>>>> pg.
>>>>>>>>> We will have such following data:
>>>>>>>>>
>>>>>>>>> Allocation bitmap (just in memory)
>>>>>>>>>
>>>>>>>>> The pglog partition has a bitmap to record which block is allocated
>>>>>>>>> or
>>>>>>>>> not. We can rebuild it through pg->allocated_block_list when
>>>>>>>>> starting,
>>>>>>>>> and no need to store it in persistent disk. But we will store basic
>>>>>>>>> information about the pglog partition in Rocksdb, like block size,
>>>>>>>>> block number etc when the objectstore is initialized.
>>>>>>>>>
>>>>>>>>> Pg -> allocated_blocks_list
>>>>>>>>>
>>>>>>>>> When a pg is created and IOs start, we can allocate a block for
>>>>>>>>> every
>>>>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
>>>>>>>>> entries. When total pglog entries increase and exceed the number,
>>>>>>>>> we
>>>>>>>>> can add a new block to the pg.
>>>>>>>>>
>>>>>>>>> Pg->start_position
>>>>>>>>>
>>>>>>>>> Record the oldest valid entry per pg.
>>>>>>>>>
>>>>>>>>> Pg->next_position
>>>>>>>>>
>>>>>>>>> Record the next entry to add per pg. The data will be updated
>>>>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of
>>>>>>>>> data will be merged.
>>>>>>>>>
>>>>>>>>> Updated Bluestore write progess:
>>>>>>>>>
>>>>>>>>> When writing data to disk (before metadata updating), we can append
>>>>>>>>> the pglog entry to its ring buffer in parallel.
>>>>>>>>> After that, submit pg ring buffer changes like pg->next_position,
>>>>>>>>> and
>>>>>>>>> current other metadata changes to Rocksdb.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari
>>>>>>>>> <varada.kari@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang
>>>>>>>>>> <laurence.liwang@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>      If we wanna store pg log in a standalone ring buffer,
>>>>>>>>>>> another
>>>>>>>>>>> candidate
>>>>>>>>>>> is the deferred write, why not use the ring buffer as the journal
>>>>>>>>>>> for
>>>>>>>>>>> 4K random
>>>>>>>>>>> write, it should be much more lightweight than rocksdb
>>>>>>>>>>>
>>>>>>>>>> It will be similar to FileStore implementation, for small writes.
>>>>>>>>>> That
>>>>>>>>>> comes with the same alignment issues and given
>>>>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't
>>>>>>>>>> make
>>>>>>>>>> it to L0 files because of WAL handling.
>>>>>>>>>>
>>>>>>>>>> Varada
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Li Wang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson
>>>>>>>>>>>>> <mnelson@xxxxxxxxxx>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be
>>>>>>>>>>>>>> represented
>>>>>>>>>>>>>> as a
>>>>>>>>>>>>>> per-pg ring buffer rather than key/value data.  Maybe there
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently
>>>>>>>>>>>>>> see
>>>>>>>>>>>>>> them.  As
>>>>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> valid
>>>>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that
>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>> used for pg_log.  That sort of opens the door for different
>>>>>>>>>>>>>> object
>>>>>>>>>>>>>> store
>>>>>>>>>>>>>> implementations fulfilling that functionality in whatever ways
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> author
>>>>>>>>>>>>>> deems fit.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the reddit lingo, pretty much this.  We should be
>>>>>>>>>>>>> concentrating
>>>>>>>>>>>>> on
>>>>>>>>>>>>> this direction, or ruling it out.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, +1
>>>>>>>>>>>>
>>>>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes
>>>>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer.  The
>>>>>>>>>>>> first
>>>>>>>>>>>> questions to answer is (a) whether this does in fact improve
>>>>>>>>>>>> things
>>>>>>>>>>>> significantly and (b) whether we want to have an independent
>>>>>>>>>>>> ring
>>>>>>>>>>>> buffer
>>>>>>>>>>>> for each PG or try to mix them into one big one for the whole
>>>>>>>>>>>> OSD
>>>>>>>>>>>> (or
>>>>>>>>>>>> maybe per shard).
>>>>>>>>>>>>
>>>>>>>>>>>> The second question is how that fares on HDDs.  My guess is that
>>>>>>>>>>>> the
>>>>>>>>>>>> current rocksdb strategy is better because it reduces the number
>>>>>>>>>>>> of
>>>>>>>>>>>> IOs
>>>>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't
>>>>>>>>>>>> the
>>>>>>>>>>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll
>>>>>>>>>>>> get
>>>>>>>>>>>> lucky
>>>>>>>>>>>> and the new strategy will be best for both HDD and SSD..)
>>>>>>>>>>>>
>>>>>>>>>>>> Then we have to modify PGLog to be a complete implementation.  A
>>>>>>>>>>>> strict
>>>>>>>>>>>> ring buffer probably won't work because the PG log might not
>>>>>>>>>>>> trim
>>>>>>>>>>>> and
>>>>>>>>>>>> because log entries are variable length, so there'll probably
>>>>>>>>>>>> need
>>>>>>>>>>>> to be
>>>>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer
>>>>>>>>>>>> position) to
>>>>>>>>>>>> deal with that.  We have to trim the log periodically, so every
>>>>>>>>>>>> so
>>>>>>>>>>>> many
>>>>>>>>>>>> entries we may want to realign with a min_alloc_size boundary.
>>>>>>>>>>>> We
>>>>>>>>>>>> someones have to back up and rewrite divergent portions of the
>>>>>>>>>>>> log
>>>>>>>>>>>> (during
>>>>>>>>>>>> peering) so we'll need to sort out whether that is a complete
>>>>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram
>>>>>>>>>>>> (individually
>>>>>>>>>>>> or in chunks), etc etc.
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>> More majordomo info at
>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best wishes
>>>>>>>>> Lisa
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>

-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html