Re: RocksDB configuration

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 5 Mar 2018 13:02:00 +0100

After going through:
https://de.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
I can already answer some of my own questions - notably, compaction should happen slowly,
and there is high write amplification for SSDs, which could explain why our SSDs in our MDS reached the limit. 
Likely, NVMes will perform better. 
I'm unsure how much of the write amplification hits WAL and how much hits the DB, though, this would be interesting to learn. 

Still, the questions concerning the usefulness of RocksDB compression (and if that's configurable at all via Ceph tunables),
and potential gains by offline compaction / overnight compaction are open. 

Also, after seeing this, I wonder how bad it really is if RocksDB spills over to a slow device, 
since the "hot" parts should stay on faster devices. 
Does somebody have long-term experience with that? 

Currently, our SSD-space is two small to keep RocksDB once our cluster becomes full (we only have about 7 GB of SSD per 4 TB of HDD-OSD),
so the question is whether we should buy larger SSDs / NVMes or this might actually be a non-issue in the long run. 

Cheers,
	Oliver

Am 05.03.2018 um 11:42 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> in the benchmarks done with many files, I noted that our bottleneck was mainly given by the MDS-SSD performance,
> and notably, after deletion of the many files in CephFS, the RocksDB stayed large and did not shrink. 
> Recreating an OSD from scratch and backfilling it, however, resulted in a smaller RocksDB. 
> 
> I noticed some interesting messages in the logs of starting OSDs:
>  set rocksdb option compaction_readahead_size = 2097152
>  set rocksdb option compression = kNoCompression
>  set rocksdb option max_write_buffer_number = 4
>  set rocksdb option min_write_buffer_number_to_merge = 1
>  set rocksdb option recycle_log_file_num = 4
>  set rocksdb option writable_file_max_buffer_size = 0
>  set rocksdb option write_buffer_size = 268435456
> 
> Now I wonder: Can these be configured via Ceph parameters? 
> Can / should one trigger compaction ceph-kvstore-tool - is this safe when the corresponding OSD is down, has anybody tested it? 
> Is there a fixed time slot when compaction starts (e.g. low load average)? 
> 
> I'm especially curious if compression would help to reduce write load on the metadata servers - maybe not, since the synchronization of I/O has to happen in any case,
> and this is more likely to be the actual limit than the bulk I/O. 
> 
> Just being curious! 
> 
> Cheers,
> 	Oliver
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com