On Thu, Jun 21, 2018 at 6:46 AM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > Hi Lisa, > > I gave your branch a whirl. On the first run I tried to allocate too many > PGs and it ran out of space and asserted. :D We'll need to figure out a > mechanism to allocate space that doesn't depend on a a hardcoded dev value. Yes, the problem exists. > > Ok, now for the goods. These were just fast 10 minute tests on a tiny RBD > volume, so take the results with a big grain of salt. I expect things to > improve for pglog-split-fastinfo when there's more data in rocksdb though. > Despite that, the results are interesting! In pglog-split-fastinfo, rocksdb > deals with far fewer keys and spends far less time in compaction, but indeed > having a single WAL for everything means more coalescing of writes with the > associated benefits (But man that compaction!). I think on the P3700 you > can use 512B sectors, so that would at least help with the write-amp but may > not offer any performance benefits? Ok, I will have test of 512B sector. And at the meanwhile, I did tests with Intel P3700 400G. I still can get (50.4-45.1)/45.1*100 = 11%. Mark, from your tests, IOPS improves little. My test environment: CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Disk: Single Intel P3700 400G 15 10G RBD volumes, 1 OSD, 128 PGs, iodepth=32 And test results: pglog-split-fastinfo branch: rbd0: (groupid=0, jobs=16): err= 0: pid=3563: Thu Jun 21 06:21:07 2018 write: IOPS=45.1k, BW=176MiB/s (185MB/s)(51.7GiB/300057msec) slat (nsec): min=1040, max=961881, avg=4186.57, stdev=3802.90 clat (msec): min=3, max=175, avg=11.34, stdev=12.97 lat (msec): min=3, max=175, avg=11.34, stdev=12.97 master branch rbd0: (groupid=0, jobs=16): err= 0: pid=52676: Thu Jun 21 05:38:08 2018 write: IOPS=50.4k, BW=197MiB/s (206MB/s)(57.7GiB/300009msec) slat (nsec): min=1044, max=3379.9k, avg=4370.65, stdev=3955.84 clat (msec): min=2, max=178, avg=10.16, stdev=10.41 lat (msec): min=2, max=178, avg=10.16, stdev=10.41 My ceph.conf: [global] fsid = 18677778-8289-11e7-b44b-a4bf0118d2ff pid_path = /var/run/ceph osd pool default size = 1 auth_service_required = none auth_cluster_required = none auth_client_required = none osd_objectstore = bluestore mon allow pool delete = true debug bluestore = 0/0 debug osd = 0/0 debug ms = 0/0 debug bluefs = 0/0 debug bdev = 0/0 debug rocksdb = 0/0 osd pool default pg num = 64 osd op num shards = 8 [mon] mon_data = /var/lib/ceph/mon.$id [osd] osd_data = /var/lib/ceph/mnt/osd-device-$id-data osd_mkfs_type = xfs osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k [client] rbd_cache = false [mon.sceph9] host = sceph9 mon addr = 127.0.0.1 log file = /opt/fio_test/osd/mon.log [osd.0] host = sceph9 public addr = 127.0.0.1 cluster addr = 127.0.0.1 devs = /dev/nvme1n1p3 bluestore_block_path = /dev/nvme1n1p4 bluestore_block_db_path = /dev/nvme1n1p2 bluestore_block_wal_path = /dev/nvme1n1p1 log file = /opt/fio_test/osd/osd.log bluestore_shard_finishers = true bdev_block_size = 4096 [global] ioengine=rbd clientname=admin rw=randwrite bs=4k time_based=1 runtime=300s iodepth=32 group_reporting [rbd0] pool=rbd rbdname=rbd0 .... [rbd14] pool=rbd rbdname=rbd14 Lisa > > This looks like it would be amazing if you could do cache-line granularity > writes! > > > Test Setup > ---------- > 4k fio/librbd randwrite 16QD for 600s > 16GB RBD volume, 1 OSD, 128 PGs > Single Intel 800GB P3700 NVMe > > > Client > > fio Throughput > ---------- > 65a74d4: 22.5K IOPS > pglog-split-fastinfo: 22.9K IOPS > > fio 99.95th percentile latency > --------------------------- > 65a74d4: 25035 usec > pglog-split-fastinfo: 9372 usec > > > RocksDB > > RocksDB Compaction Event Count > ------------------------------ > 65a74d4: 77 > pglog-split-fastinfo: 70 > > RocksDB Compaction Total Time > ----------------------------- > 65a74d4: 66.85s > pglog-split-fastinfo: 16.39s > > RocksDB Compaction Total Input Records > -------------------------------------- > 65a74d4: 70255125 > pglog-split-fastinfo: 13507190 > > RocksDB Compaction Total Output Records > --------------------------------------- > 65a74d4: 34126131 > pglog-split-fastinfo: 4090954 > > > Collectl > > RockSDB WAL Device Write (AVG MB/s) > ----------------------------------- > 65a74d4: 142.74 > pglog-split-fastinfo: 138.71 > > RocksDB WAL Device Write (AVG OPS/s) > ------------------------------------ > 65a74d4: 6275.37 > pglog-split-fastinfo: 7314.67 > > > RocksDB DB Device Write (AVG MB/s) > ---------------------------------- > 65a74d4: 24.88 > pglog-split-fastinfo: 10.12 > > RocksDB DB Device Write (AVG OPS/s) > ----------------------------------- > 65a74d4: 200.05 > pglog-split-fastinfo: 81.81 > > > OSD Block Device Write (AVG MB/s) > --------------------------------- > 65a74d4: 87.20 > pglog-split-fastinfo: 177.79* > > OSD Block Device Write (AVG OPS/s) > ---------------------------------- > 65a74d4: 22324.3 > pglog-split-fastinfo: 45514.0* > > * pglog writes are now hitting the block device instead of rocksdb log > > > Wallclock Profile of bstore_kv_sync > ----------------------------------- > > 65a74d4: > + 100.00% clone > + 100.00% start_thread > + 100.00% BlueStore::KVSyncThread::entry > + 100.00% BlueStore::_kv_sync_thread > + 50.00% RocksDBStore::submit_transaction_sync > | + 49.88% RocksDBStore::submit_common > | | + 49.88% rocksdb::DBImpl::Write > | | + 49.88% rocksdb::DBImpl::WriteImpl > | | + 46.09% rocksdb::DBImpl::WriteToWAL > | | | + 45.35% rocksdb::WritableFileWriter::Sync > | | | | + 45.23% rocksdb::WritableFileWriter::SyncInternal > | | | | | + 45.23% BlueRocksWritableFile::Sync > | | | | | + 45.23% fsync > | | | | | + 45.11% BlueFS::_fsync > | | | | | | + 28.73% BlueFS::_flush_bdev_safely > | | | | | | | + 27.63% BlueFS::flush_bdev > | | | | | | | | + 27.63% KernelDevice::flush > | | | | | | | | + 27.63% fdatasync > | | | | | | | + 0.49% lock > | | | | | | | + 0.24% BlueFS::wait_for_aio > | | | | | | | + 0.12% unlock > | | | | | | | + 0.12% clear > | | | | | | | + 0.12% BlueFS::_claim_completed_aios > | | | | | | + 16.38% BlueFS::_flush > | | | | | | + 16.38% BlueFS::_flush_range > | | | | | | + 14.30% KernelDevice::aio_write > | | | | | | | + 14.30% KernelDevice::_sync_write > | | | | | | | + 7.58% pwritev64 > | | | | | | | + 6.72% sync_file_range > | | | | | | + 1.10% ~list > | | | | | | + 0.37% __memset_sse2 > | | | | | | + 0.24% bluefs_fnode_t::seek > | | | | | | + 0.12% ceph::buffer::list::substr_of > | | | | | | + 0.12% > ceph::buffer::list::claim_append_piecewise > | | | | | + 0.12% unique_lock > | | | | + 0.12% rocksdb::WritableFileWriter::Flush > | | | + 0.73% rocksdb::DBImpl::WriteToWAL > | | + 2.32% rocksdb::WriteBatchInternal::InsertInto > | | + 0.37% rocksdb::WriteThread::ExitAsBatchGroupLeader > | | + 0.12% ~WriteContext > | | + 0.12% rocksdb::WriteThread::JoinBatchGroup > | | + 0.12% rocksdb::WriteThread::EnterAsBatchGroupLeader > | | + 0.12% rocksdb::StopWatch::~StopWatch > | | + 0.12% rocksdb::InstrumentedMutex::Lock > | | + 0.12% rocksdb::DBImpl::MarkLogsSynced > | | + 0.12% operator= > | | + 0.12% PerfStepTimer > | + 0.12% ceph_clock_now > + 38.51% RocksDBStore::submit_transaction > | + 38.51% RocksDBStore::submit_common > | + 38.14% rocksdb::DBImpl::Write > | + 38.02% rocksdb::DBImpl::WriteImpl > | + 25.18% rocksdb::WriteBatchInternal::InsertInto > | | + 25.18% rocksdb::WriteBatch::Iterate > | | + 23.59% rocksdb::MemTableInserter::PutCF > | | | + 23.59% rocksdb::MemTableInserter::PutCFImpl > | | | + 23.35% rocksdb::MemTable::Add > | | | | + 16.99% > rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator > const&>::Insert<false> > | | | | | + 14.79% > rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator > const&>::RecomputeSpliceLevels > | | | | | | + 14.18% FindSpliceForLevel<true> > | | | | | | + 8.56% KeyIsAfterNode > | | | | | | + 0.73% Next > | | | | | + 1.34% KeyIsAfterNode > | | | | | + 0.86% > rocksdb::MemTable::KeyComparator::operator() > | | | | + 4.89% __memcpy_ssse3 > | | | | + 0.61% rocksdb::MemTable::UpdateFlushState > | | | | + 0.37% rocksdb::(anonymous > namespace)::SkipListRep::Allocate > | | | + 0.12% > rocksdb::ColumnFamilyMemTablesImpl::GetMemTable > | | + 0.61% rocksdb::ReadRecordFromWriteBatch > | | + 0.61% rocksdb::MemTableInserter::DeleteCF > | | + 0.12% operator= > | + 11.49% rocksdb::DBImpl::WriteToWAL > | | + 11.37% rocksdb::DBImpl::WriteToWAL > | | | + 11.12% rocksdb::log::Writer::AddRecord > | | | + 11.12% rocksdb::log::Writer::EmitPhysicalRecord > | | | + 8.07% rocksdb::WritableFileWriter::Append > | | | + 2.08% rocksdb::crc32c::crc32c_3way > | | | + 0.98% rocksdb::WritableFileWriter::Flush > | | + 0.12% rocksdb::WriteBatchInternal::SetSequence > | + 0.24% rocksdb::WriteThread::EnterAsBatchGroupLeader > | + 0.12% rocksdb::WriteThread::ExitAsBatchGroupLeader > | + 0.12% Writer > | + 0.12% WriteContext > + 6.60% std::condition_variable::wait(std::unique_lock<std::mutex>&) > + 1.10% RocksDBStore::RocksDBTransactionImpl::rm_single_key > + 0.98% std::condition_variable::notify_one() > + 0.73% BlueStore::_txc_applied_kv > + 0.37% KernelDevice::flush > + 0.24% get_deferred_key > + 0.24% RocksDBStore::get_transaction > + 0.12% ~basic_string > + 0.12% swap > + 0.12% lock > + 0.12% end > + 0.12% ceph_clock_now > + 0.12% Throttle::put > > > pglog-split-fastinfo: > + 100.00% clone > + 100.00% start_thread > + 100.00% BlueStore::KVSyncThread::entry > + 100.00% BlueStore::_kv_sync_thread > + 53.03% RocksDBStore::submit_transaction_sync > | + 52.75% RocksDBStore::submit_common > | | + 52.75% rocksdb::DBImpl::Write > | | + 52.60% rocksdb::DBImpl::WriteImpl > | | + 48.55% rocksdb::DBImpl::WriteToWAL > | | | + 47.83% rocksdb::WritableFileWriter::Sync > | | | | + 47.54% rocksdb::WritableFileWriter::SyncInternal > | | | | | + 47.54% BlueRocksWritableFile::Sync > | | | | | + 47.54% fsync > | | | | | + 47.54% BlueFS::_fsync > | | | | | + 31.50% BlueFS::_flush_bdev_safely > | | | | | | + 31.07% BlueFS::flush_bdev > | | | | | | | + 31.07% KernelDevice::flush > | | | | | | | + 30.92% fdatasync > | | | | | | | + 0.14% lock_guard > | | | | | | + 0.14% lock > | | | | | | + 0.14% BlueFS::wait_for_aio > | | | | | + 16.04% BlueFS::_flush > | | | | | + 16.04% BlueFS::_flush_range > | | | | | + 11.99% KernelDevice::aio_write > | | | | | | + 11.99% KernelDevice::_sync_write > | | | | | | + 6.36% pwritev64 > | | | | | | + 5.49% sync_file_range > | | | | | + 1.59% ~list > | | | | | + 0.43% IOContext::aio_wait > | | | | | + 0.29% > ceph::buffer::list::claim_append_piecewise > | | | | | + 0.14% list > | | | | | + 0.14% ceph::buffer::ptr::zero > | | | | | + 0.14% ceph::buffer::list::substr_of > | | | | | + 0.14% > ceph::buffer::list::page_aligned_appender::flush > | | | | | + 0.14% ceph::buffer::list::list > | | | | | + 0.14% ceph::buffer::list::append > | | | | | + 0.14% __memset_sse2 > | | | | | + 0.14% > _ZN4ceph6buffer4list6appendERKNS0_3ptrEjj@plt > | | | | + 0.29% rocksdb::WritableFileWriter::Flush > | | | + 0.43% rocksdb::DBImpl::WriteToWAL > | | | + 0.14% rocksdb::DBImpl::MergeBatch > | | | + 0.14% operator= > | | + 3.47% rocksdb::WriteBatchInternal::InsertInto > | | + 0.14% rocksdb::InstrumentedMutex::Lock > | | + 0.14% Writer > | | + 0.14% LastSequence > | + 0.14% operator- > + 29.05% RocksDBStore::submit_transaction > | + 28.61% RocksDBStore::submit_common > | | + 28.32% rocksdb::DBImpl::Write > | | + 28.03% rocksdb::DBImpl::WriteImpl > | | + 17.77% rocksdb::WriteBatchInternal::InsertInto > | | | + 17.77% rocksdb::WriteBatch::Iterate > | | | + 17.20% rocksdb::MemTableInserter::PutCF > | | | | + 17.20% rocksdb::MemTableInserter::PutCFImpl > | | | | + 16.47% rocksdb::MemTable::Add > | | | | | + 12.72% > rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator > const&>::Insert<false> > | | | | | | + 10.69% > rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator > const&>::RecomputeSpliceLevels > | | | | | | | + 10.40% FindSpliceForLevel<true> > | | | | | | | + 6.79% KeyIsAfterNode > | | | | | | | + 0.29% Next > | | | | | | + 1.01% > rocksdb::MemTable::KeyComparator::operator() > | | | | | | + 0.72% KeyIsAfterNode > | | | | | | + 0.14% SetNext > | | | | | + 3.32% __memcpy_ssse3 > | | | | | + 0.29% rocksdb::MemTable::UpdateFlushState > | | | | | + 0.14% rocksdb::(anonymous > namespace)::SkipListRep::Allocate > | | | | + 0.29% SeekToColumnFamily > | | | | + 0.29% CheckMemtableFull > | | | | + 0.14% > rocksdb::ColumnFamilyMemTablesImpl::GetMemTable > | | | + 0.14% rocksdb::ReadRecordFromWriteBatch > | | | + 0.14% operator= > | | + 9.10% rocksdb::DBImpl::WriteToWAL > | | + 0.29% rocksdb::DBImpl::PreprocessWrite > | | + 0.14% ~WriteContext > | | + 0.14% rocksdb::WriteThread::ExitAsBatchGroupLeader > | | + 0.14% operator= > | | + 0.14% Writer > | | + 0.14% FinalStatus > | + 0.14% ceph_clock_now > | + 0.14% PerfCounters::tinc > + 12.86% > std::condition_variable::wait(std::unique_lock<std::mutex>&) > | + 12.86% pthread_cond_wait@@GLIBC_2.3.2 > | + 0.29% __pthread_mutex_cond_lock > + 1.45% BlueStore::_txc_applied_kv > + 0.87% std::condition_variable::notify_one() > + 0.72% KernelDevice::flush > + 0.43% RocksDBStore::RocksDBTransactionImpl::rm_single_key > + 0.29% ~shared_ptr > + 0.14% ~deque > + 0.14% unique_lock > + 0.14% swap > + 0.14% operator-- > + 0.14% log_state_latency > + 0.14% lock > + 0.14% get_deferred_key > + 0.14% deque > > Mark > > On 06/20/2018 03:19 AM, xiaoyan li wrote: >> >> Hi all, >> I wrote a poc to split pglog from Rocksdb and store them into >> standalone space in the block device. >> The updates are done in OSD and BlueStore: >> >> OSD parts: >> 1. Split pglog entries and pglog info from omaps. >> BlueStore: >> 1. Allocate 16M space in block device per PG for storing pglog. >> 2. Per every transaction from OSD, combine pglog entries and >> pglog info, and write them into a block. The block is set to 4k at >> this moment. >> >> Currently, I only make the write workflow work. >> With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got >> the following performance for 4k random writes, and the performance >> got 13.87% better. >> >> Master: >> write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec) >> slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69 >> clat (msec): min=3, max=123, avg=10.60, stdev= 8.31 >> lat (msec): min=3, max=123, avg=10.60, stdev= 8.31 >> >> Pgsplit branch: >> write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec) >> slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47 >> clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92 >> lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92 >> >> Here is the POC: >> https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo >> The problem is that per every transaction, I use a 4k block to save >> the pglog entries and pglog info which is only 130+920 = 1050 bytes. >> This wastes a lot of space. >> Any suggestions? >> >> Best wishes >> Lisa >> >> On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >>> >>> >>> >>> On 04/03/2018 09:36 PM, xiaoyan li wrote: >>>> >>>> >>>> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 04/03/2018 09:56 AM, Mark Nelson wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 04/03/2018 08:27 AM, Sage Weil wrote: >>>>>>> >>>>>>> >>>>>>> On Tue, 3 Apr 2018, Li Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> Before we move forward, could someone give a test such that >>>>>>>> the pglog not written into rocksdb at all, to see how much is the >>>>>>>> performance improvement as the upper bound, it shoule be less than >>>>>>>> turning on the bluestore_debug_omit_kv_commit >>>>>>> >>>>>>> >>>>>>> +1 >>>>>>> >>>>>>> (The PetStore behavior doesn't tell us anything about how BlueStore >>>>>>> will >>>>>>> behave without the pglog overhead.) >>>>>>> >>>>>>> sage >>>>>> >>>>>> >>>>>> >>>>>> We do have some testing of the bluestore's behavior, though it's about >>>>>> 6 >>>>>> months old now: >>>>>> >>>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>>>> >>>>>> - 128 PGs >>>>>> >>>>>> - stats are sloppy since they only appear every ~10 mins >>>>>> >>>>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K* >>>>>> >>>>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>>>> 7.858GB >>>>>> >>>>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush: >>>>>> 15.847GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, Flush: >>>>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops >>>>>> = >>>>>> 24.2K* >>>>>> >>>>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, Flush: >>>>>> 7.538GB >>>>>> >>>>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush: >>>>>> 8.884GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, Flush: >>>>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops >>>>>> = >>>>>> 23.8K* >>>>>> >>>>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>>>> 7.936GB >>>>>> >>>>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush: >>>>>> 9.289GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, Flush: >>>>>> 0.368GB <-- deferred writes >>>>>> >>>>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K* >>>>>> >>>>>> * >>>>>> The actual performance variation here I think is much less important >>>>>> than >>>>>> the KeyIn behavior. The NVMe devices in these tests are fast enough >>>>>> to >>>>>> absorb a fair amount of overhead. >>>>> >>>>> >>>>> >>>>> Ugh, sorry. That will teach me to talk in meeting and paste at the >>>>> same >>>>> time. Those were the wrong stats. Here are the right ones: >>>>> >>>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>>>> - 128 PGs >>>>>> - stats are sloppy since they only appear every ~10 mins >>>>>> - min_pg_log_entries = 3000, trim = default, pginfo hack, >>>>>> iops >>>>>> = >>>>>> 27.8K >>>>>> - Default CF - Size: 23.15MB, KeyIn: 24M, KeyDrop: >>>>>> 19M, >>>>>> Flush: 8.662GB >>>>>> - [M] CF - Size: 159.97MB, KeyIn: 162M, KeyDrop: >>>>>> 139M, >>>>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.39MB, KeyIn: 201K, KeyDrop: >>>>>> 89K, >>>>>> Flush: 0.355GB <-- deferred writes - >>>>>> min_pg_log_entries >>>>>> = >>>>>> 3000, trim = default iops = 28.3K >>>>>> - Default CF - Size: 23.13MB, KeyIn: 25M, KeyDrop: >>>>>> 19M, >>>>>> Flush: 8.762GB >>>>>> - [M] CF - Size: 159.97MB, KeyIn: 202M, KeyDrop: >>>>>> 175M, >>>>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 0.86MB, KeyIn: 201K, KeyDrop: >>>>>> 89K, >>>>>> Flush: 0.355GB <-- deferred writes >>>>>> - default min_pg_log_entries = 1500, trim = default, iops = >>>>>> 26.6K >>>>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: >>>>>> 17M, >>>>>> Flush: 7.858GB >>>>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: >>>>>> 269M, >>>>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: >>>>>> 80K, >>>>>> Flush: 0.320GB <-- deferred writes >>>>>> - min_pg_log_entries = 10, trim = 10, iops = 24.2K >>>>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: >>>>>> 16M, >>>>>> Flush: 7.538GB >>>>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: >>>>>> 250M, >>>>>> Flush: 8.884GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: >>>>>> 83K, >>>>>> Flush: 0.331GB <-- deferred writes >>>>>> - min_pg_log_entries = 1, trim = 1, iops = 23.8K >>>>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: >>>>>> 17M, >>>>>> Flush: 7.936GB >>>>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: >>>>>> 262M, >>>>>> Flush: 9.289GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: >>>>>> 92K, >>>>>> Flush: 0.368GB <-- deferred writes >>>>>> - min_pg_log_entires = 3000, trim = 1, iops = 25.8K >>>> >>>> >>>> Hi Mark, do you extract above results from compaction stats in Rocksdb >>>> LOG? >>> >>> >>> >>> Correct, except for the IOPS numbers which were from the client >>> benchmark. >>> >>> >>>> >>>> ** Compaction Stats [default] ** >>>> Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) >>>> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) >>>> Avg(sec) KeyIn KeyDrop >>>> >>>> >>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> L0 6/0 270.47 MB 1.1 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> L1 3/0 190.94 MB 0.7 0.0 0.0 0.0 0.0 >>>> 0.0 0.0 0.0 0.0 0.0 0 0 0.000 >>>> 0 0 >>>> Sum 9/0 461.40 MB 0.0 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> Uptime(secs): 9.9 total, 9.9 interval >>>> Flush(GB): cumulative 0.198, interval 0.198 >>>> >>>>> Note specifically how the KeyIn rate drops with the min_pg_log_entries >>>>> increased (ie disable dup_ops) and hacking out pginfo. I suspect that >>>>> commenting out log_operation would reduce the KeyIn rate significantly >>>>> further. Again these drives can absorb a lot of this so the >>>>> improvement >>>>> in >>>>> iops is fairly modest (and setting min_pg_log_entries low actually >>>>> hurts!), >>>>> but this isn't just about performance, it's about the behavior that we >>>>> invoke. The Petstore results absolutely show us that on very fast >>>>> storage >>>>> we see a dramatic CPU usage reduction by removing log_operation and >>>>> pginfo, >>>>> so I think we should focus on what kind of behavior we want >>>>> pglog/pginfo/dup_ops to invoke. >>>>> >>>>> Mark >>>>> >>>>> >>>>>> >>>>>> * >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Cheers, >>>>>>>> Li Wang >>>>>>>> >>>>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Based on your above discussion about pglog, I have the following >>>>>>>>> rough >>>>>>>>> design. Please help to give your suggestions. >>>>>>>>> >>>>>>>>> There will be three partitions: raw part for customer IOs, Bluefs >>>>>>>>> for >>>>>>>>> Rocksdb, and pglog partition. >>>>>>>>> The former two partitions are same as current. The pglog partition >>>>>>>>> is >>>>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per >>>>>>>>> pg. >>>>>>>>> We will have such following data: >>>>>>>>> >>>>>>>>> Allocation bitmap (just in memory) >>>>>>>>> >>>>>>>>> The pglog partition has a bitmap to record which block is allocated >>>>>>>>> or >>>>>>>>> not. We can rebuild it through pg->allocated_block_list when >>>>>>>>> starting, >>>>>>>>> and no need to store it in persistent disk. But we will store basic >>>>>>>>> information about the pglog partition in Rocksdb, like block size, >>>>>>>>> block number etc when the objectstore is initialized. >>>>>>>>> >>>>>>>>> Pg -> allocated_blocks_list >>>>>>>>> >>>>>>>>> When a pg is created and IOs start, we can allocate a block for >>>>>>>>> every >>>>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >>>>>>>>> entries. When total pglog entries increase and exceed the number, >>>>>>>>> we >>>>>>>>> can add a new block to the pg. >>>>>>>>> >>>>>>>>> Pg->start_position >>>>>>>>> >>>>>>>>> Record the oldest valid entry per pg. >>>>>>>>> >>>>>>>>> Pg->next_position >>>>>>>>> >>>>>>>>> Record the next entry to add per pg. The data will be updated >>>>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of >>>>>>>>> data will be merged. >>>>>>>>> >>>>>>>>> Updated Bluestore write progess: >>>>>>>>> >>>>>>>>> When writing data to disk (before metadata updating), we can append >>>>>>>>> the pglog entry to its ring buffer in parallel. >>>>>>>>> After that, submit pg ring buffer changes like pg->next_position, >>>>>>>>> and >>>>>>>>> current other metadata changes to Rocksdb. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari >>>>>>>>> <varada.kari@xxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang >>>>>>>>>> <laurence.liwang@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> If we wanna store pg log in a standalone ring buffer, >>>>>>>>>>> another >>>>>>>>>>> candidate >>>>>>>>>>> is the deferred write, why not use the ring buffer as the journal >>>>>>>>>>> for >>>>>>>>>>> 4K random >>>>>>>>>>> write, it should be much more lightweight than rocksdb >>>>>>>>>>> >>>>>>>>>> It will be similar to FileStore implementation, for small writes. >>>>>>>>>> That >>>>>>>>>> comes with the same alignment issues and given >>>>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't >>>>>>>>>> make >>>>>>>>>> it to L0 files because of WAL handling. >>>>>>>>>> >>>>>>>>>> Varada >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Li Wang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson >>>>>>>>>>>>> <mnelson@xxxxxxxxxx> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be >>>>>>>>>>>>>> represented >>>>>>>>>>>>>> as a >>>>>>>>>>>>>> per-pg ring buffer rather than key/value data. Maybe there >>>>>>>>>>>>>> are >>>>>>>>>>>>>> really >>>>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently >>>>>>>>>>>>>> see >>>>>>>>>>>>>> them. As >>>>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there >>>>>>>>>>>>>> are >>>>>>>>>>>>>> valid >>>>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that >>>>>>>>>>>>>> should be >>>>>>>>>>>>>> used for pg_log. That sort of opens the door for different >>>>>>>>>>>>>> object >>>>>>>>>>>>>> store >>>>>>>>>>>>>> implementations fulfilling that functionality in whatever ways >>>>>>>>>>>>>> the >>>>>>>>>>>>>> author >>>>>>>>>>>>>> deems fit. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In the reddit lingo, pretty much this. We should be >>>>>>>>>>>>> concentrating >>>>>>>>>>>>> on >>>>>>>>>>>>> this direction, or ruling it out. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yeah, +1 >>>>>>>>>>>> >>>>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes >>>>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer. The >>>>>>>>>>>> first >>>>>>>>>>>> questions to answer is (a) whether this does in fact improve >>>>>>>>>>>> things >>>>>>>>>>>> significantly and (b) whether we want to have an independent >>>>>>>>>>>> ring >>>>>>>>>>>> buffer >>>>>>>>>>>> for each PG or try to mix them into one big one for the whole >>>>>>>>>>>> OSD >>>>>>>>>>>> (or >>>>>>>>>>>> maybe per shard). >>>>>>>>>>>> >>>>>>>>>>>> The second question is how that fares on HDDs. My guess is that >>>>>>>>>>>> the >>>>>>>>>>>> current rocksdb strategy is better because it reduces the number >>>>>>>>>>>> of >>>>>>>>>>>> IOs >>>>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't >>>>>>>>>>>> the >>>>>>>>>>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll >>>>>>>>>>>> get >>>>>>>>>>>> lucky >>>>>>>>>>>> and the new strategy will be best for both HDD and SSD..) >>>>>>>>>>>> >>>>>>>>>>>> Then we have to modify PGLog to be a complete implementation. A >>>>>>>>>>>> strict >>>>>>>>>>>> ring buffer probably won't work because the PG log might not >>>>>>>>>>>> trim >>>>>>>>>>>> and >>>>>>>>>>>> because log entries are variable length, so there'll probably >>>>>>>>>>>> need >>>>>>>>>>>> to be >>>>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer >>>>>>>>>>>> position) to >>>>>>>>>>>> deal with that. We have to trim the log periodically, so every >>>>>>>>>>>> so >>>>>>>>>>>> many >>>>>>>>>>>> entries we may want to realign with a min_alloc_size boundary. >>>>>>>>>>>> We >>>>>>>>>>>> someones have to back up and rewrite divergent portions of the >>>>>>>>>>>> log >>>>>>>>>>>> (during >>>>>>>>>>>> peering) so we'll need to sort out whether that is a complete >>>>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram >>>>>>>>>>>> (individually >>>>>>>>>>>> or in chunks), etc etc. >>>>>>>>>>>> >>>>>>>>>>>> sage >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>>> ceph-devel" in >>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>> More majordomo info at >>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best wishes >>>>>>>>> Lisa >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> >>> >> >> >> > -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html