Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

Mark Nelson <mark.nelson@xxxxxxxxx> · Thu, 28 Sep 2023 11:45:12 -0500

There are some pretty strange compaction behavior happening in these 
logs.  For instance, in osd0, we see a O-1 CF L1 compaction that's 
taking ~204 seconds:

2023-09-21T20:03:59.378+0000 7f16a286c700  4 rocksdb: (Original Log Time 
2023/09/21-20:03:59.381808) EVENT_LOG_v1 {"time_micros": 
1695326639381779, "job": 17834, "event": "compaction_finished", 
"compaction_time_micros": 204158593, "compaction_time_cpu_micros": 
180777275, "output_level": 1, "num_output_files": 36, 
"total_output_size": 1006338775, "num_input_records": 2547576, 
"num_output_records": 2479913, "num_subcompactions": 1, 
"output_compression": "NoCompression", "num_single_delete_mismatches": 
0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 36, 205, 393, 0, 
0, 0]}

That's compacting at only 4.7MB/s.

Yet a little later on, we see an O-1 L3 compaction taking around 20 
seconds despite compacting more data:

2023-09-21T20:04:40.466+0000 7f16a286c700  4 rocksdb: (Original Log Time 
2023/09/21-20:04:40.472527) EVENT_LOG_v1 {"time_micros": 
1695326680472501, "job": 17848, "event": "compaction_finished", 
"compaction_time_micros": 20166590, "compaction_time_cpu_micros": 
16105672, "output_level": 3, "num_output_files": 24, 
"total_output_size": 1597043099, "num_input_records": 3342658, 
"num_output_records": 3247386, "num_subcompactions": 1, 
"output_compression": "NoCompression", "num_single_delete_mismatches": 
0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 35, 206, 393, 0, 
0, 0]}

With a compaction rate of around 75.5MB/s.

For OSD1, the compaction behavior is even worse!  An example O-0 L1 
compaction took ~969 seconds!

2023-09-22T11:24:14.206+0000 7fc4e4807700  4 rocksdb: (Original Log Time 
2023/09/22-11:24:14.211454) EVENT_LOG_v1 {"time_micros": 
1695381854211381, "job": 28548, "event": "compaction_finished", 
"compaction_time_micros": 968781445, "compaction_time_cpu_micros": 
844222174, "output_level": 1, "num_output_files": 56, 
"total_output_size": 1006413354, "num_input_records": 2649457, 
"num_output_records": 2579105, "num_subcompactions": 1, 
"output_compression": "NoCompression", "num_single_delete_mismatches": 
0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 56, 297, 359, 0, 
0, 0]}

With a compaction rate of around 0.99MB/s.

OSD2's worst case compaction rate however was around 19MB/s, but was 
often significantly faster (~100-200MB/s)

I'm leaning toward hybrid allocator issues or hardware issues here, 
because these compaction rates are all over the place and much slower 
than expected on flash hardware.

Mark

On 9/28/23 06:22, Igor Fedotov wrote:
Hi Sudhin,

It looks like manual DB compactions are (periodically?) issued via 
admin socket for your OSDs, which (my working hypothesis) triggers DB 
access stalls.

Here are the log lines indicating such calls

debug 2023-09-22T11:24:55.234+0000 7fc4efa20700  1 osd.1 1192508 
triggering manual compaction

debug 2023-09-21T15:35:22.696+0000 7faf22c8b700  1 osd.2 1180406 
finished manual compaction in 722.287 seconds

So I'm curious if you do have some external stuff performing manual 
OSD compactions? If so - would the primary issue go away when it's 
disabled?

You might want to disable it cluster wide and let OSDs run after that 
for a while to make sure that's the case. Then try to reproduce it 
again by manual running compaction for a specific OSD via CLI. Would 
it fail again?

If the above hypotheses is confirmed I could see two potential root 
causes:

1. Hybrid allocator might cause severe BlueFS stalls which make 
irresponsive. See https://tracker.ceph.com/issues/62815

2. Default RocksDB settings were changed in Reef. See 
https://github.com/ceph/ceph/pull/51900

The easiest way to verify if you're facing 1. is to set 
bluestore_allocator to bitmap for all the OSDs (and restart them) via 
"ceph config set" command .  Then monitor OSDs behavior during manual 
compactions.

For validating 2.  one should revert bluestore_rocksdb_options back to 
the original value 
"compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0"

I'd recommend to do that for a single OSD first. Just in case - we 
don't have much knowledge on how OSDs survive such a reversion hence 
better/safier do that gradually.

Hope this helps and awaiting for the feedback.

Thanks,

Igor

On 27/09/2023 22:04, sbengeri@xxxxxxxxx wrote:
Hi Igor,

I have copied three OSD logs to
https://drive.google.com/file/d/1aQxibFJR6Dzvr3RbuqnpPhaSMhPSL--F/view?usp=sharing 

Hopefully they include some meaningful information.

Thank you.

Sudhin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx