rocksdb corruption with 16.2.6

Andrej Filipcic <andrej.filipcic@xxxxxx> · Sun, 19 Sep 2021 16:47:15 +0200

Hi,

after upgrading the cluster from 16.2.5 to 16.2.6, several OSDs crashed 
and refuse to start due to rocksdb corruption, eg:
--------
 2021-09-19T15:47:10.611+0200 7f8bc1f0e700  4 rocksdb: 
[compaction/compaction_job.cc:1680] [default] Compaction start summary: 
Base version 6 Base level 0, inputs: [251944(53MB) 251942(42MB) 
251940(33MB)], [251935(66MB) 251936(66MB) 251937(4464KB) 251938(8217KB)]

2021-09-19T15:47:10.611+0200 7f8bc1f0e700  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1632059230612093, "job": 13, "event": 
"compaction_started", "compaction_reason": "LevelL0FilesNum", 
"files_L0": [251944, 251942, 251940], "files_L1": [251935, 251936, 
251937, 251938], "score": 1.27373, "input_data_size": 287841071}
2021-09-19T15:47:13.610+0200 7f8bc1f0e700  3 rocksdb: 
[db_impl/db_impl_compaction_flush.cc:2808] Compaction error: Corruption: 
block checksum mismatch: expected 2427092066, got 4051549320  in 
db/251935.sst offset 18414386 size 4032
2021-09-19T15:47:13.610+0200 7f8bc1f0e700  4 rocksdb: (Original Log Time 
2021/09/19-15:47:13.611350) [compaction/compaction_job.cc:760] [default] 
compacted to: files[3 4 31 138 0 0 0] max score 0.97, MB/sec: 96.0 rd, 
0.0 wr, level 1, files in(3, 4) out(1) MB in(130.0, 144.5) out(0.0), 
read-write-amplify(2.1) write-amplify(0.0) Corruption: block checksum 
mismatch: expected 2427092066, got 4051549320  in db/251935.sst offset 
18414386 size 4032, records in: 1654508, records dropped: 1554257 
output_compression: NoCompression

2021-09-19T15:47:13.610+0200 7f8bc1f0e700  4 rocksdb: (Original Log Time 
2021/09/19-15:47:13.611381) EVENT_LOG_v1 {"time_micros": 
1632059233611365, "job": 13, "event": "compaction_finished", 
"compaction_time_micros": 2999230, "compaction_time_cpu_micros": 87965, 
"output_level": 1, "num_output_files": 1, "total_output_size": 25072635, 
"num_input_records": 1654508, "num_output_records": 100251, 
"num_subcompactions": 1, "output_compression": "NoCompression", 
"num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, 
"lsm_state": [3, 4, 31, 138, 0, 0, 0]}
2021-09-19T15:47:13.610+0200 7f8bc1f0e700  2 rocksdb: 
[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background 
compaction error: Corruption: block checksum mismatch: expected 
2427092066, got 4051549320  in db/251935.sst offset 18414386 size 4032, 
Accumulated background error counts: 1
2021-09-19T15:47:13.636+0200 7f8bbacf1700 -1 rocksdb: submit_common 
error: Corruption: block checksum mismatch: expected 2427092066, got 
4051549320  in db/251935.sst offset 18414386 size 4032 code = 2 Rocksdb 
transaction:
PutCF( prefix = O key = 
0x8C7FFFFFFFFFFFFFF2EB980CC3'!temp_recovering_12.9d7s12_71578''447639_77916_head!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F000C000078 
value size = 905)
PutCF( prefix = O key = 
0x8C7FFFFFFFFFFFFFF2EB980CC3'!temp_recovering_12.9d7s12_71578''447639_77916_head!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F 
value size = 432)
MergeCF( prefix = b key = 0x00000024A4000000 value size = 16)
MergeCF( prefix = T key = 0x000000000000000C value size = 40)
2021-09-19T15:47:13.638+0200 7f8bbacf1700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueStore.cc: 
In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
bool)' thread 7f8bbacf1700 time 2021-09-19T15:47:13.637926+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueStore.cc: 
11650: :AILED ceph_assert(r == 0)

 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x158) [0x56045e16a54c]
 2: /usr/bin/ceph-osd(+0x56a766) [0x56045e16a766]
 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x45f) 
[0x56045e79639f]
 4: (BlueStore::_kv_sync_thread()+0x16dc) [0x56045e7cfa0c]
 5: (BlueStore::KVSyncThread::entry()+0x11) [0x56045e7f82d1]
 6: /lib64/libpthread.so.0(+0x814a) [0x7f8bd06f414a]
 7: clone()

2021-09-19T15:47:13.640+0200 7f8bbacf1700 -1 *** Caught signal (Aborted) **
 in thread 7f8bbacf1700 thread_name:bstore_kv_sync

 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific 
(stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f8bd06feb20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a9) [0x56045e16a59d]
 5: /usr/bin/ceph-osd(+0x56a766) [0x56045e16a766]
 6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x45f) 
[0x56045e79639f]
 7: (BlueStore::_kv_sync_thread()+0x16dc) [0x56045e7cfa0c]
 8: (BlueStore::KVSyncThread::entry()+0x11) [0x56045e7f82d1]
 9: /lib64/libpthread.so.0(+0x814a) [0x7f8bd06f414a]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.
---------

I have attached a part of the osd log.
7 out of ~1600 OSDs have this issue. bluestore fsck or repair does not 
help, it actually crashes on most of them. The cluster was stable of 6 
weeks before, no daemon crashes.

Any hints? I upgraded a smaller cluster before with 350 OSDs and none 
had issues.

config, just in case. I have disabled bluefs_buffered_io after few 
crashes appeared.

# ceph config dump
WHO       MASK  LEVEL     OPTION 
VALUE                                           RO
global          advanced  objecter_inflight_op_bytes 1073741824
global          advanced  osd_pool_default_pg_autoscale_mode off
  mon           advanced  auth_allow_insecure_global_id_reclaim false
  mon           advanced  mon_allow_pool_delete true
  mon           advanced  mon_max_pg_per_osd 1000
  mgr           advanced  mgr/prometheus/rbd_stats_pools 
rbd,rbd_data,proxmox,rbd_fastdata,proxmox_fast  *
  mgr           advanced  osd_deep_scrub_interval 1209600.000000
  mgr           basic     target_max_misplaced_ratio 0.800000
  osd           advanced  bluefs_buffered_io false
  osd           advanced  objecter_inflight_ops 10240
  osd           advanced  osd_deep_scrub_interval 1209600.000000
  osd           advanced  osd_max_backfills 32
  osd           advanced  osd_max_pg_per_osd_hard_ratio 20.000000
  osd           advanced  osd_max_scrubs 8
  osd           advanced  osd_op_num_threads_per_shard_ssd 
4                                               *
  osd           advanced  osd_op_thread_timeout 90
  osd           advanced  osd_recovery_max_active 10
  osd           advanced  osd_recovery_op_priority 63
  osd           advanced  osd_recovery_sleep_hdd 0.000000
  osd           advanced  osd_scrub_auto_repair true
  mds           basic     mds_cache_memory_limit 17179869184
  mds           advanced  mds_cache_trim_threshold 262144
  mds           advanced  mds_recall_global_max_decay_threshold 131072
  mds           advanced  mds_recall_max_caps 30000
  mds           advanced  mds_recall_max_decay_rate 1.500000
  mds           advanced  mds_recall_max_decay_threshold 131072
  mds           advanced  mds_recall_warning_threshold 262144
  client        advanced  client_force_lazyio true

Best regards,
Andrej

--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-425-7074
-------------------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx