Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 7 Aug 2020 08:48:45 -0500

Thinking about this a little more, one thing that I remember when I was 
writing the priority cache manager is that in some cases I saw strange 
behavior with the rocksdb block cache when compaction was performed.  It 
appeared that the entire contents of the cache could be invalidated.  I 
guess that would only make sense if it was trimming old entries from the 
cache instead of entries associated with (now deleted) sst files or 
perhaps waiting to delete all SST files until the end of the compaction 
cycle thus forcing old entries out of the cache and then invalidating 
the whole works.

In any event, I wonder if having the secondary page cache is enough on 
your clusters to sort of get around all of this by still having the SST 
files associated with the previously heavily used blocks in page cache 
kicking around until compaction completes. Maybe the combination of snap 
trimming or other background work along with compaction is just totally 
thrashing the rocksdb block cache.  For folks that feel comfortable 
watching IO hitting your DB devices, can you see if you have increased 
bursts of reads to the DB device after a compaction event has occurred?  
They look like this in the OSD logs:

2020-08-04T17:15:56.603+0000 7fb0cf60d700  4 rocksdb: (Original Log Time 
2020/08/04-17:15:56.603585) EVENT_LOG_v1 {"time_micros": 
1596561356603574, "job": 5, "event": "compaction_finished", 
"compaction_time_micros": 744532, "compaction_time_cpu_micros": 607655, 
"output_level": 1, "num_output_files": 2, "total_output_size": 84712923, 
"num_input_records": 1714260, "num_output_records": 658541, 
"num_subcompactions": 1, "output_compression": "NoCompression", 
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, 
"lsm_state": [0, 2, 0, 0, 0, 0, 0]}

You can also run this tool to get a nicely formatted list of them, 
though I don't have it reporting timestamps, just the time offset from 
the start of the log so looking at the OSD logs directly would be easier 
to match up timestamps.

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

Mark

On 8/6/20 8:07 AM, Vladimir Prokofev wrote:
Maneul, thank you for your input.
This is actually huge, and the problem is exactly that.

On a side note I will add, that I observed lower memory utilisation on OSD
nodes since the update, and a big throughput on block.db devices(up to
100+MB/s) that was not there before, so logically that meant that some
operations that were performed in memory before, now were executed directly
on block device. Was digging through possible causes, but your time-saving
message arrived earlier.
Thank you!

чт, 6 авг. 2020 г. в 14:56, Manuel Lausch <manuel.lausch@xxxxxxxx>:

Hi,

I found the reasen of this behavior change.
With 14.2.10 the default value of "bluefs_buffered_io" was changed from
true to false.
https://tracker.ceph.com/issues/44818

configureing this to true my problems seems to be solved.

Regards
Manuel

On Wed, 5 Aug 2020 13:30:45 +0200
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

Hello Vladimir,

I just tested this with a single node testcluster with 60 HDDs (3 of
them with bluestore without separate wal and db).

With the 14.2.10, I see on the bluestore OSDs a lot of read IOPs while
snaptrimming. With 14.2.9 this was not an issue.

I wonder if this would explain the huge amount of slowops on my big
testcluster (44 Nodes 1056 OSDs) while snaptrimming. I
cannot test a downgrade there, because there are no packages of older
releases for CentOS 8 available.

Regards
Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx