Thinking about this a little more, one thing that I remember when I was
writing the priority cache manager is that in some cases I saw strange
behavior with the rocksdb block cache when compaction was performed. It
appeared that the entire contents of the cache could be invalidated. I
guess that would only make sense if it was trimming old entries from the
cache instead of entries associated with (now deleted) sst files or
perhaps waiting to delete all SST files until the end of the compaction
cycle thus forcing old entries out of the cache and then invalidating
the whole works.
In any event, I wonder if having the secondary page cache is enough on
your clusters to sort of get around all of this by still having the SST
files associated with the previously heavily used blocks in page cache
kicking around until compaction completes. Maybe the combination of snap
trimming or other background work along with compaction is just totally
thrashing the rocksdb block cache. For folks that feel comfortable
watching IO hitting your DB devices, can you see if you have increased
bursts of reads to the DB device after a compaction event has occurred?
They look like this in the OSD logs:
2020-08-04T17:15:56.603+0000 7fb0cf60d700 4 rocksdb: (Original Log Time
2020/08/04-17:15:56.603585) EVENT_LOG_v1 {"time_micros":
1596561356603574, "job": 5, "event": "compaction_finished",
"compaction_time_micros": 744532, "compaction_time_cpu_micros": 607655,
"output_level": 1, "num_output_files": 2, "total_output_size": 84712923,
"num_input_records": 1714260, "num_output_records": 658541,
"num_subcompactions": 1, "output_compression": "NoCompression",
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0,
"lsm_state": [0, 2, 0, 0, 0, 0, 0]}
You can also run this tool to get a nicely formatted list of them,
though I don't have it reporting timestamps, just the time offset from
the start of the log so looking at the OSD logs directly would be easier
to match up timestamps.
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
Mark
On 8/6/20 8:07 AM, Vladimir Prokofev wrote:
Maneul, thank you for your input.
This is actually huge, and the problem is exactly that.
On a side note I will add, that I observed lower memory utilisation on OSD
nodes since the update, and a big throughput on block.db devices(up to
100+MB/s) that was not there before, so logically that meant that some
operations that were performed in memory before, now were executed directly
on block device. Was digging through possible causes, but your time-saving
message arrived earlier.
Thank you!
чт, 6 авг. 2020 г. в 14:56, Manuel Lausch <manuel.lausch@xxxxxxxx>:
Hi,
I found the reasen of this behavior change.
With 14.2.10 the default value of "bluefs_buffered_io" was changed from
true to false.
https://tracker.ceph.com/issues/44818
configureing this to true my problems seems to be solved.
Regards
Manuel
On Wed, 5 Aug 2020 13:30:45 +0200
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:
Hello Vladimir,
I just tested this with a single node testcluster with 60 HDDs (3 of
them with bluestore without separate wal and db).
With the 14.2.10, I see on the bluestore OSDs a lot of read IOPs while
snaptrimming. With 14.2.9 this was not an issue.
I wonder if this would explain the huge amount of slowops on my big
testcluster (44 Nodes 1056 OSDs) while snaptrimming. I
cannot test a downgrade there, because there are no packages of older
releases for CentOS 8 available.
Regards
Manuel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx