CDM for: pg log, pg info, and dup ops data storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Folks,

I had a really good conversation with Josh Durgin on Friday and I wanted to write it down before I forgot everything. I figured I might as well post it here too since folks might be interested or come up with even better ideas.

Background: Recently I've been looking into how much work RocksDB is doing in Bluestore dealing with the creation and deletion of pg log, pg info, and dup op key/value pairs. Now that we are separating OMAP data into it's own column family, it is possible to look at OMAP statistics independently of bluestore metadata:

- ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
- 128 PGs
- stats are sloppy since they only appear every ~10 mins
- [D] CF = onodes, etc
- [M] CF = pglog/dup_ops/pginfo
- [L] CF = defered IO

First the default:

min_pg_log_entries = 1500, trim = default, iops = 26.6K

[D] CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:  7.858GB
[M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush: 15.847GB
[L] CF - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:  0.320GB

With dup ops code path disabled by increasing the min pglog entries:

min_pg_log_entries = 3000, trim = default iops = 28.3K
[D] CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:  19M, Flush:  8.762GB
[M] CF - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M, Flush: 16.890GB
[L] CF - Size:   0.86MB, KeyIn: 201K, KeyDrop:  89K, Flush:  0.355GB

Notice the large drop in KeyIn and KeyDrop records!

And now adding a hack to remove pginfo:

min_pg_log_entries = 3000, trim = default, pginfo hack, iops = 27.8K

[D] CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:  19M, Flush:  8.662GB
[M] CF - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M, Flush: 10.335GB
[L] CF - Size:   1.39MB, KeyIn: 201K, KeyDrop:  89K, Flush:  0.355GB

There's still 162M KeyIn and 139M keyDrop events for pglog remaining!

So as a result of this I started thinking that there must be a better way to deal with this kind of data than constantly writing and deleting key/value pairs from the database, at least for dup ops and pg log. Based on my conversation with Josh it sounds like we are already reusing pginfo keys which might explain why hacking it out didn't result in a 33% drop in key inserts and deletes.

For pg log and dup ops, my first thought was to simply create a simple on-disk ring-buffer for pglog and dup ops. The idea being that instead of specifying a "min" and "max" number of pglog entries, the user would just assign a portion of the disk (in Bluestore perhaps a portion of the DB device) that would be reserved for the ring buffer and we'd fill it with as many pglog and dup op entries as we could.

Josh keenly pointed out that we want to make sure that inactive PG logs don't get overwritten by active ones. To deal with this, I proposed a modification: use a series of logs and reference count the per-pg links. You'd never delete an old log until all references to it are removed. The problem here is that potentially if you had a lot of inactive PGs you could end up with pretty big space amplification.

Advantages:

1) Avoid key creation/deletion
2) sequential appends only
3) Write once (no direct write amplification)

Disadvantages:

1) potential for large space amp when PGs are inactive
2) lots of new code
3) potential for thrashing with the RocksDB WAL and other writes

Josh pointed out another potential option: create a ring buffer in RocksDB with sequentially ordered keys. The key names themselves are not particularly important: some prefix along with say an 8byte index. In this scenario we'd still end up writing to the WAL first and potentially compacting into an SST file, but the overhead should be less (far less?) than the current behavior. We could also make it general enough that we could have per-pg ring buffers for pglog data and a global ringbuffer for dup-ops (which is sort of the best case scenario for both sitautions)

Advantages:

1) Reuse the KeyValueDB interface (and all of the optimized RocksDB code!)
2) Still avoid new key creation/deletion
3) WAL writes are still sequentially laid out and compaction overhead may not be too bad 4) Shared WAL means less thrashing on spinning disks than multiple independent WALs.
4) Space amp potentially less bad than on-disk ring-buffer approach.

Disadvantages:

1) still has compaction overhead
2) less control over what RocksDB does
3) shared WAL means implications for global KV traffic, even with independent column families.

No matter what we do, I think we need to get away from the current behavior for pglog and dup ops. Unless I'm missing something (and I could be!) I don't think we need to be creating and deleting keys constantly like we currently are doing. I suspect that either of the above approaches would improve the situation dramatically.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux