Re: storing pg logs outside of rocksdb

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Wed, 20 Jun 2018 17:46:11 -0500

Hi Lisa,

I gave your branch a whirl.  On the first run I tried to allocate too 
many PGs and it ran out of space and asserted. :D  We'll need to figure 
out a mechanism to allocate space that doesn't depend on a a hardcoded 
dev value.

Ok, now for the goods.  These were just fast 10 minute tests on a tiny 
RBD volume, so take the results with a big grain of salt.  I expect 
things to improve for pglog-split-fastinfo when there's more data in 
rocksdb though.  Despite that, the results are interesting!  In 
pglog-split-fastinfo, rocksdb deals with far fewer keys and spends far 
less time in compaction, but indeed having a single WAL for everything 
means more coalescing of writes with the associated benefits (But man 
that compaction!).  I think on the P3700 you can use 512B sectors, so 
that would at least help with the write-amp but may not offer any 
performance benefits?

This looks like it would be amazing if you could do cache-line 
granularity writes!

Test Setup
----------
4k fio/librbd randwrite 16QD for 600s
16GB RBD volume, 1 OSD, 128 PGs
Single Intel 800GB P3700 NVMe

Client

fio Throughput
----------
65a74d4:              22.5K IOPS
pglog-split-fastinfo: 22.9K IOPS

fio 99.95th percentile latency
---------------------------
65a74d4:              25035 usec
pglog-split-fastinfo:  9372 usec

RocksDB

RocksDB Compaction Event Count
------------------------------
65a74d4:              77
pglog-split-fastinfo: 70

RocksDB Compaction Total Time
-----------------------------
65a74d4:              66.85s
pglog-split-fastinfo: 16.39s

RocksDB Compaction Total Input Records
--------------------------------------
65a74d4:              70255125
pglog-split-fastinfo: 13507190

RocksDB Compaction Total Output Records
---------------------------------------
65a74d4:              34126131
pglog-split-fastinfo:  4090954

Collectl

RockSDB WAL Device Write (AVG MB/s)
-----------------------------------
65a74d4:              142.74
pglog-split-fastinfo: 138.71

RocksDB WAL Device Write (AVG OPS/s)
------------------------------------
65a74d4:              6275.37
pglog-split-fastinfo: 7314.67

RocksDB DB Device Write (AVG MB/s)
----------------------------------
65a74d4:              24.88
pglog-split-fastinfo: 10.12

RocksDB DB Device Write (AVG OPS/s)
-----------------------------------
65a74d4:              200.05
pglog-split-fastinfo:  81.81

OSD Block Device Write (AVG MB/s)
---------------------------------
65a74d4:               87.20
pglog-split-fastinfo: 177.79*

OSD Block Device Write (AVG OPS/s)
----------------------------------
65a74d4:              22324.3
pglog-split-fastinfo: 45514.0*

* pglog writes are now hitting the block device instead of rocksdb log

Wallclock Profile of bstore_kv_sync
-----------------------------------

65a74d4:
+ 100.00% clone
  + 100.00% start_thread
    + 100.00% BlueStore::KVSyncThread::entry
      + 100.00% BlueStore::_kv_sync_thread
        + 50.00% RocksDBStore::submit_transaction_sync
        | + 49.88% RocksDBStore::submit_common
        | | + 49.88% rocksdb::DBImpl::Write
        | |   + 49.88% rocksdb::DBImpl::WriteImpl
        | |     + 46.09% rocksdb::DBImpl::WriteToWAL
        | |     | + 45.35% rocksdb::WritableFileWriter::Sync
        | |     | | + 45.23% rocksdb::WritableFileWriter::SyncInternal
        | |     | | | + 45.23% BlueRocksWritableFile::Sync
        | |     | | |   + 45.23% fsync
        | |     | | |     + 45.11% BlueFS::_fsync
        | |     | | |     | + 28.73% BlueFS::_flush_bdev_safely
        | |     | | |     | | + 27.63% BlueFS::flush_bdev
        | |     | | |     | | | + 27.63% KernelDevice::flush
        | |     | | |     | | |   + 27.63% fdatasync
        | |     | | |     | | + 0.49% lock
        | |     | | |     | | + 0.24% BlueFS::wait_for_aio
        | |     | | |     | | + 0.12% unlock
        | |     | | |     | | + 0.12% clear
        | |     | | |     | | + 0.12% BlueFS::_claim_completed_aios
        | |     | | |     | + 16.38% BlueFS::_flush
        | |     | | |     |   + 16.38% BlueFS::_flush_range
        | |     | | |     |     + 14.30% KernelDevice::aio_write
        | |     | | |     |     | + 14.30% KernelDevice::_sync_write
        | |     | | |     |     |   + 7.58% pwritev64
        | |     | | |     |     |   + 6.72% sync_file_range
        | |     | | |     |     + 1.10% ~list
        | |     | | |     |     + 0.37% __memset_sse2
        | |     | | |     |     + 0.24% bluefs_fnode_t::seek
        | |     | | |     |     + 0.12% ceph::buffer::list::substr_of
        | |     | | |     |     + 0.12% 
ceph::buffer::list::claim_append_piecewise
        | |     | | |     + 0.12% unique_lock
        | |     | | + 0.12% rocksdb::WritableFileWriter::Flush
        | |     | + 0.73% rocksdb::DBImpl::WriteToWAL
        | |     + 2.32% rocksdb::WriteBatchInternal::InsertInto
        | |     + 0.37% rocksdb::WriteThread::ExitAsBatchGroupLeader
        | |     + 0.12% ~WriteContext
        | |     + 0.12% rocksdb::WriteThread::JoinBatchGroup
        | |     + 0.12% rocksdb::WriteThread::EnterAsBatchGroupLeader
        | |     + 0.12% rocksdb::StopWatch::~StopWatch
        | |     + 0.12% rocksdb::InstrumentedMutex::Lock
        | |     + 0.12% rocksdb::DBImpl::MarkLogsSynced
        | |     + 0.12% operator=
        | |     + 0.12% PerfStepTimer
        | + 0.12% ceph_clock_now
        + 38.51% RocksDBStore::submit_transaction
        | + 38.51% RocksDBStore::submit_common
        |   + 38.14% rocksdb::DBImpl::Write
        |     + 38.02% rocksdb::DBImpl::WriteImpl
        |       + 25.18% rocksdb::WriteBatchInternal::InsertInto
        |       | + 25.18% rocksdb::WriteBatch::Iterate
        |       |   + 23.59% rocksdb::MemTableInserter::PutCF
        |       |   | + 23.59% rocksdb::MemTableInserter::PutCFImpl
        |       |   |   + 23.35% rocksdb::MemTable::Add
        |       |   |   | + 16.99% 
rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator 
const&>::Insert<false>
        |       |   |   | | + 14.79% 
rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator 
const&>::RecomputeSpliceLevels
        |       |   |   | | | + 14.18% FindSpliceForLevel<true>
        |       |   |   | | |   + 8.56% KeyIsAfterNode
        |       |   |   | | |   + 0.73% Next
        |       |   |   | | + 1.34% KeyIsAfterNode
        |       |   |   | | + 0.86% 
rocksdb::MemTable::KeyComparator::operator()
        |       |   |   | + 4.89% __memcpy_ssse3
        |       |   |   | + 0.61% rocksdb::MemTable::UpdateFlushState
        |       |   |   | + 0.37% rocksdb::(anonymous 
namespace)::SkipListRep::Allocate
        |       |   |   + 0.12% 
rocksdb::ColumnFamilyMemTablesImpl::GetMemTable
        |       |   + 0.61% rocksdb::ReadRecordFromWriteBatch
        |       |   + 0.61% rocksdb::MemTableInserter::DeleteCF
        |       |   + 0.12% operator=
        |       + 11.49% rocksdb::DBImpl::WriteToWAL
        |       | + 11.37% rocksdb::DBImpl::WriteToWAL
        |       | | + 11.12% rocksdb::log::Writer::AddRecord
        |       | |   + 11.12% rocksdb::log::Writer::EmitPhysicalRecord
        |       | |     + 8.07% rocksdb::WritableFileWriter::Append
        |       | |     + 2.08% rocksdb::crc32c::crc32c_3way
        |       | |     + 0.98% rocksdb::WritableFileWriter::Flush
        |       | + 0.12% rocksdb::WriteBatchInternal::SetSequence
        |       + 0.24% rocksdb::WriteThread::EnterAsBatchGroupLeader
        |       + 0.12% rocksdb::WriteThread::ExitAsBatchGroupLeader
        |       + 0.12% Writer
        |       + 0.12% WriteContext
        + 6.60% 
std::condition_variable::wait(std::unique_lock<std::mutex>&)
        + 1.10% RocksDBStore::RocksDBTransactionImpl::rm_single_key
        + 0.98% std::condition_variable::notify_one()
        + 0.73% BlueStore::_txc_applied_kv
        + 0.37% KernelDevice::flush
        + 0.24% get_deferred_key
        + 0.24% RocksDBStore::get_transaction
        + 0.12% ~basic_string
        + 0.12% swap
        + 0.12% lock
        + 0.12% end
        + 0.12% ceph_clock_now
        + 0.12% Throttle::put

pglog-split-fastinfo:
+ 100.00% clone
  + 100.00% start_thread
    + 100.00% BlueStore::KVSyncThread::entry
      + 100.00% BlueStore::_kv_sync_thread
        + 53.03% RocksDBStore::submit_transaction_sync
        | + 52.75% RocksDBStore::submit_common
        | | + 52.75% rocksdb::DBImpl::Write
        | |   + 52.60% rocksdb::DBImpl::WriteImpl
        | |     + 48.55% rocksdb::DBImpl::WriteToWAL
        | |     | + 47.83% rocksdb::WritableFileWriter::Sync
        | |     | | + 47.54% rocksdb::WritableFileWriter::SyncInternal
        | |     | | | + 47.54% BlueRocksWritableFile::Sync
        | |     | | |   + 47.54% fsync
        | |     | | |     + 47.54% BlueFS::_fsync
        | |     | | |       + 31.50% BlueFS::_flush_bdev_safely
        | |     | | |       | + 31.07% BlueFS::flush_bdev
        | |     | | |       | | + 31.07% KernelDevice::flush
        | |     | | |       | |   + 30.92% fdatasync
        | |     | | |       | |   + 0.14% lock_guard
        | |     | | |       | + 0.14% lock
        | |     | | |       | + 0.14% BlueFS::wait_for_aio
        | |     | | |       + 16.04% BlueFS::_flush
        | |     | | |         + 16.04% BlueFS::_flush_range
        | |     | | |           + 11.99% KernelDevice::aio_write
        | |     | | |           | + 11.99% KernelDevice::_sync_write
        | |     | | |           |   + 6.36% pwritev64
        | |     | | |           |   + 5.49% sync_file_range
        | |     | | |           + 1.59% ~list
        | |     | | |           + 0.43% IOContext::aio_wait
        | |     | | |           + 0.29% 
ceph::buffer::list::claim_append_piecewise
        | |     | | |           + 0.14% list
        | |     | | |           + 0.14% ceph::buffer::ptr::zero
        | |     | | |           + 0.14% ceph::buffer::list::substr_of
        | |     | | |           + 0.14% 
ceph::buffer::list::page_aligned_appender::flush
        | |     | | |           + 0.14% ceph::buffer::list::list
        | |     | | |           + 0.14% ceph::buffer::list::append
        | |     | | |           + 0.14% __memset_sse2
        | |     | | |           + 0.14% 
_ZN4ceph6buffer4list6appendERKNS0_3ptrEjj@plt
        | |     | | + 0.29% rocksdb::WritableFileWriter::Flush
        | |     | + 0.43% rocksdb::DBImpl::WriteToWAL
        | |     | + 0.14% rocksdb::DBImpl::MergeBatch
        | |     | + 0.14% operator=
        | |     + 3.47% rocksdb::WriteBatchInternal::InsertInto
        | |     + 0.14% rocksdb::InstrumentedMutex::Lock
        | |     + 0.14% Writer
        | |     + 0.14% LastSequence
        | + 0.14% operator-
        + 29.05% RocksDBStore::submit_transaction
        | + 28.61% RocksDBStore::submit_common
        | | + 28.32% rocksdb::DBImpl::Write
        | |   + 28.03% rocksdb::DBImpl::WriteImpl
        | |     + 17.77% rocksdb::WriteBatchInternal::InsertInto
        | |     | + 17.77% rocksdb::WriteBatch::Iterate
        | |     |   + 17.20% rocksdb::MemTableInserter::PutCF
        | |     |   | + 17.20% rocksdb::MemTableInserter::PutCFImpl
        | |     |   |   + 16.47% rocksdb::MemTable::Add
        | |     |   |   | + 12.72% 
rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator 
const&>::Insert<false>
        | |     |   |   | | + 10.69% 
rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator 
const&>::RecomputeSpliceLevels
        | |     |   |   | | | + 10.40% FindSpliceForLevel<true>
        | |     |   |   | | |   + 6.79% KeyIsAfterNode
        | |     |   |   | | |   + 0.29% Next
        | |     |   |   | | + 1.01% 
rocksdb::MemTable::KeyComparator::operator()
        | |     |   |   | | + 0.72% KeyIsAfterNode
        | |     |   |   | | + 0.14% SetNext
        | |     |   |   | + 3.32% __memcpy_ssse3
        | |     |   |   | + 0.29% rocksdb::MemTable::UpdateFlushState
        | |     |   |   | + 0.14% rocksdb::(anonymous 
namespace)::SkipListRep::Allocate
        | |     |   |   + 0.29% SeekToColumnFamily
        | |     |   |   + 0.29% CheckMemtableFull
        | |     |   |   + 0.14% 
rocksdb::ColumnFamilyMemTablesImpl::GetMemTable
        | |     |   + 0.14% rocksdb::ReadRecordFromWriteBatch
        | |     |   + 0.14% operator=
        | |     + 9.10% rocksdb::DBImpl::WriteToWAL
        | |     + 0.29% rocksdb::DBImpl::PreprocessWrite
        | |     + 0.14% ~WriteContext
        | |     + 0.14% rocksdb::WriteThread::ExitAsBatchGroupLeader
        | |     + 0.14% operator=
        | |     + 0.14% Writer
        | |     + 0.14% FinalStatus
        | + 0.14% ceph_clock_now
        | + 0.14% PerfCounters::tinc
        + 12.86% 
std::condition_variable::wait(std::unique_lock<std::mutex>&)
        | + 12.86% pthread_cond_wait@@GLIBC_2.3.2
        |   + 0.29% __pthread_mutex_cond_lock
        + 1.45% BlueStore::_txc_applied_kv
        + 0.87% std::condition_variable::notify_one()
        + 0.72% KernelDevice::flush
        + 0.43% RocksDBStore::RocksDBTransactionImpl::rm_single_key
        + 0.29% ~shared_ptr
        + 0.14% ~deque
        + 0.14% unique_lock
        + 0.14% swap
        + 0.14% operator--
        + 0.14% log_state_latency
        + 0.14% lock
        + 0.14% get_deferred_key
        + 0.14% deque

Mark

On 06/20/2018 03:19 AM, xiaoyan li wrote:
  Hi all,
I wrote a poc to split pglog from Rocksdb and store them into
standalone space in the block device.
The updates are done in OSD and BlueStore:

OSD parts:
1.       Split pglog entries and pglog info from omaps.
BlueStore:
1.       Allocate 16M space in block device per PG for storing pglog.
2.       Per every transaction from OSD,  combine pglog entries and
pglog info, and write them into a block. The block is set to 4k at
this moment.

Currently, I only make the write workflow work.
With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got
the following performance for 4k random writes, and the performance
got 13.87% better.

Master:
   write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec)
     slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69
     clat (msec): min=3, max=123, avg=10.60, stdev= 8.31
      lat (msec): min=3, max=123, avg=10.60, stdev= 8.31

Pgsplit branch:
   write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec)
     slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47
     clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92
      lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92

Here is the POC: https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo
The problem is that per every transaction, I use a 4k block to save
the pglog entries and pglog info which is only 130+920 = 1050 bytes.
This wastes a lot of space.
Any suggestions?

Best wishes
Lisa

On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 04/03/2018 09:36 PM, xiaoyan li wrote:

On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx>
wrote:

On 04/03/2018 09:56 AM, Mark Nelson wrote:

On 04/03/2018 08:27 AM, Sage Weil wrote:

On Tue, 3 Apr 2018, Li Wang wrote:

Hi,
     Before we move forward, could someone give a test such that
the pglog not written into rocksdb at all, to see how much is the
performance improvement as the upper bound, it shoule be less than
turning on the bluestore_debug_omit_kv_commit

+1

(The PetStore behavior doesn't tell us anything about how BlueStore
will
behave without the pglog overhead.)

sage

We do have some testing of the bluestore's behavior, though it's about 6
months old now:

- ~1 hour 4K random overwrites to RBD on 1 NVMe OSD

- 128 PGs

- stats are sloppy since they only appear every ~10 mins

*- default min_pg_log_entries = 1500, trim = default, iops = 26.6K*

     - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:
7.858GB

     - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush:
15.847GB <-- with this workload this is pg log and dup op kv entries

     - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:
0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops =
24.2K*

     - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M, Flush:
7.538GB

     - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush:
8.884GB <-- with this workload this is pg log and dup op kv entries

     - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K, Flush:
0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops =
23.8K*

     - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M, Flush:
7.936GB

     - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush:
9.289GB <-- with this workload this is pg log and dup op kv entries

     - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K, Flush:
0.368GB <-- deferred writes

- min_pg_log_entires = 3000, trim = 1, *iops = 25.8K*

*
The actual performance variation here I think is much less important
than
the KeyIn behavior.  The NVMe devices in these tests are fast enough to
absorb a fair amount of overhead.

Ugh, sorry.  That will teach me to talk in meeting and paste at the same
time.  Those were the wrong stats.  Here are the right ones:

          - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
          - 128 PGs
          - stats are sloppy since they only appear every ~10 mins
          - min_pg_log_entries = 3000, trim = default, pginfo hack, iops
=
27.8K
              - Default CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:  19M,
Flush:  8.662GB
              - [M] CF     - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M,
Flush: 10.335GB <-- with this workload this is pg log and dup op kv
entries
              - [L] CF     - Size:   1.39MB, KeyIn: 201K, KeyDrop:  89K,
Flush:  0.355GB <-- deferred writes                - min_pg_log_entries
=
3000, trim = default iops = 28.3K
              - Default CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:  19M,
Flush:  8.762GB
              - [M] CF     - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M,
Flush: 16.890GB <-- with this workload this is pg log and dup op kv
entries
              - [L] CF     - Size:   0.86MB, KeyIn: 201K, KeyDrop:  89K,
Flush:  0.355GB <-- deferred writes
          - default min_pg_log_entries = 1500, trim = default, iops =
26.6K
              - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M,
Flush:  7.858GB
              - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M,
Flush: 15.847GB <-- with this workload this is pg log and dup op kv
entries
              - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K,
Flush:  0.320GB <-- deferred writes
          - min_pg_log_entries = 10, trim = 10, iops = 24.2K
              - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M,
Flush:  7.538GB
              - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M,
Flush:  8.884GB <-- with this workload this is pg log and dup op kv
entries
              - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K,
Flush:  0.331GB <-- deferred writes
          - min_pg_log_entries = 1, trim = 1, iops = 23.8K
              - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M,
Flush:  7.936GB
              - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M,
Flush:  9.289GB <-- with this workload this is pg log and dup op kv
entries
              - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K,
Flush:  0.368GB <-- deferred writes
          - min_pg_log_entires = 3000, trim = 1, iops = 25.8K

Hi Mark, do you extract above results from compaction stats in Rocksdb
LOG?

Correct, except for the IOPS numbers which were from the client benchmark.

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt)
Avg(sec) KeyIn KeyDrop

----------------------------------------------------------------------------------------------------------------------------------------------------------
    L0      6/0   270.47 MB   1.1      0.0     0.0      0.0       0.2
    0.2       0.0   1.0      0.0    154.3         1         4    0.329
      0      0
    L1      3/0   190.94 MB   0.7      0.0     0.0      0.0       0.0
    0.0       0.0   0.0      0.0      0.0         0         0    0.000
      0      0
   Sum      9/0   461.40 MB   0.0      0.0     0.0      0.0       0.2
    0.2       0.0   1.0      0.0    154.3         1         4    0.329
      0      0
   Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2
   0.2       0.0   1.0      0.0    154.3         1         4    0.329
     0      0
Uptime(secs): 9.9 total, 9.9 interval
Flush(GB): cumulative 0.198, interval 0.198

Note specifically how the KeyIn rate drops with the min_pg_log_entries
increased (ie disable dup_ops) and hacking out pginfo.  I suspect that
commenting out log_operation would reduce the KeyIn rate significantly
further.  Again these drives can absorb a lot of this so the improvement
in
iops is fairly modest (and setting min_pg_log_entries low actually
hurts!),
but this isn't just about performance, it's about the behavior that we
invoke.  The Petstore results absolutely show us that on very fast
storage
we see a dramatic CPU usage reduction by removing log_operation and
pginfo,
so I think we should focus on what kind of behavior we want
pglog/pginfo/dup_ops to invoke.

Mark

*

Cheers,
Li Wang

2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:

Hi all,

Based on your above discussion about pglog, I have the following
rough
design. Please help to give your suggestions.

There will be three partitions: raw part for customer IOs, Bluefs for
Rocksdb, and pglog partition.
The former two partitions are same as current. The pglog partition is
splitted into 1M blocks. We allocate blocks for ring buffers per pg.
We will have such following data:

Allocation bitmap (just in memory)

The pglog partition has a bitmap to record which block is allocated
or
not. We can rebuild it through pg->allocated_block_list when
starting,
and no need to store it in persistent disk. But we will store basic
information about the pglog partition in Rocksdb, like block size,
block number etc when the objectstore is initialized.

Pg -> allocated_blocks_list

When a pg is created and IOs start, we can allocate a block for every
pg. Every pglog entry is less than 300 bytes, 1M can store 3495
entries. When total pglog entries increase and exceed the number, we
can add a new block to the pg.

Pg->start_position

Record the oldest valid entry per pg.

Pg->next_position

Record the next entry to add per pg. The data will be updated
frequently, but Rocksdb is suitable for its io mode, and most of
data will be merged.

Updated Bluestore write progess:

When writing data to disk (before metadata updating), we can append
the pglog entry to its ring buffer in parallel.
After that, submit pg ring buffer changes like pg->next_position, and
current other metadata changes to Rocksdb.

On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
wrote:

On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
wrote:

Hi,
     If we wanna store pg log in a standalone ring buffer, another
candidate
is the deferred write, why not use the ring buffer as the journal
for
4K random
write, it should be much more lightweight than rocksdb

It will be similar to FileStore implementation, for small writes.
That
comes with the same alignment issues and given
write amplification. Rocksdb nicely abstracts that and we don't make
it to L0 files because of WAL handling.

Varada

Cheers,
Li Wang

2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:

On Wed, 28 Mar 2018, Matt Benjamin wrote:

On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
wrote:

On 03/28/2018 12:21 PM, Adam C. Emerson wrote:

2) It sure feels like conceptually the pglog should be
represented
as a
per-pg ring buffer rather than key/value data.  Maybe there are
really
important reasons that it shouldn't be, but I don't currently
see
them.  As
far as the objectstore is concerned, it seems to me like there
are
valid
reasons to provide some kind of log interface and perhaps that
should be
used for pg_log.  That sort of opens the door for different
object
store
implementations fulfilling that functionality in whatever ways
the
author
deems fit.

In the reddit lingo, pretty much this.  We should be
concentrating
on
this direction, or ruling it out.

Yeah, +1

It seems like step 1 is a proof of concept branch that encodes
pg_log_entry_t's and writes them to a simple ring buffer.  The
first
questions to answer is (a) whether this does in fact improve
things
significantly and (b) whether we want to have an independent ring
buffer
for each PG or try to mix them into one big one for the whole OSD
(or
maybe per shard).

The second question is how that fares on HDDs.  My guess is that
the
current rocksdb strategy is better because it reduces the number
of
IOs
and the additional data getting compacted (and CPU usage) isn't
the
limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll
get
lucky
and the new strategy will be best for both HDD and SSD..)

Then we have to modify PGLog to be a complete implementation.  A
strict
ring buffer probably won't work because the PG log might not trim
and
because log entries are variable length, so there'll probably need
to be
some simple mapping table (vs a trivial start/end ring buffer
position) to
deal with that.  We have to trim the log periodically, so every so
many
entries we may want to realign with a min_alloc_size boundary.  We
someones have to back up and rewrite divergent portions of the log
(during
peering) so we'll need to sort out whether that is a complete
reencode/rewrite or whether we keep encoded entries in ram
(individually
or in chunks), etc etc.

sage
--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Best wishes
Lisa

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html