cephadm upgrade 16.2.10 to 16.2.11: osds crash and get stuck restarting

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 26 Jan 2023 03:33:37 +0200

Hi,

Attempted to upgrade 16.2.10 to 16.2.11, 2 OSDs out of many started
crashing in a loop on the very 1st host:

Jan 25 23:07:53 ceph01 bash[2553123]: Uptime(secs): 0.0 total, 0.0 interval
Jan 25 23:07:53 ceph01 bash[2553123]: Flush(GB): cumulative 0.000, interval
0.000
Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(GB): cumulative 0.000,
interval 0.000
Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Total Files): cumulative 0,
interval 0
Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(L0 Files): cumulative 0,
interval 0
Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Keys): cumulative 0, interval
0
Jan 25 23:07:53 ceph01 bash[2553123]: Cumulative compaction: 0.00 GB write,
0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Jan 25 23:07:53 ceph01 bash[2553123]: Interval compaction: 0.00 GB write,
0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Jan 25 23:07:53 ceph01 bash[2553123]: Stalls(count): 0 level0_slowdown, 0
level0_slowdown_with_compaction, 0 level0_numfiles, 0
level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0
 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0
memtable_slowdown, interval 0 total count
Jan 25 23:07:53 ceph01 bash[2553123]: ** File Read Latency Histogram By
Level [P] **
Jan 25 23:07:53 ceph01 bash[2553123]: debug    -10>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb: (Original Log Time
2023/01/25-23:07:52.986439) [db_impl/db_impl_compaction_flush.cc:2611] Com
paction nothing to do
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -9>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb: (Original Log Time
2023/01/25-23:07:52.986493) [db_impl/db_impl_compaction_flush.cc:2611] Com
paction nothing to do
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -8>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb: (Original Log Time
2023/01/25-23:07:52.986500) [db_impl/db_impl_compaction_flush.cc:2611] Com
paction nothing to do
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -7>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb: (Original Log Time
2023/01/25-23:07:52.986505) [db_impl/db_impl_compaction_flush.cc:2611] Com
paction nothing to do
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -6>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb:
[compaction/compaction_job.cc:1676] [O-2] [JOB 9] Compacting 4@0 + 2@1
files to L1, score 1.0
0
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -5>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb:
[compaction/compaction_job.cc:1680] [O-2] Compaction start summary: Base
version 33 Base level 0, inputs: [649058(959KB) 649046(1510KB)
649024(1323KB) 649002(1396KB)], [648981(66MB) 648982(52MB)]
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -4>
2023-01-25T23:07:52.982+0000 7f7c87e67700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1674688072986547, "job": 9, "event": "compaction_started",
"compaction_reason": "LevelL0FilesNum", "files_L0": [649058, 649046,
649024, 649002], "files_L1": [648981, 648982], "score": 1,
"input_data_size": 129161327}
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -3>
2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 bdev(0x5619bf0ce400
/var/lib/ceph/osd/ceph-3/block) _aio_thread got r=-1 ((1) Operation not
permitted)
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -2>
2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc:
In function 'void KernelDevice::_aio_thread()' thread 7f7c8b66e700 time
2023-01-25T23:07:52.993976+0000
Jan 25 23:07:53 ceph01 bash[2553123]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc:
604: ceph_abort_msg("Unexpected IO error. This may suggest HW issue. Please
check your dmesg!")
Jan 25 23:07:53 ceph01 bash[2553123]:  ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)
Jan 25 23:07:53 ceph01 bash[2553123]:  1: (ceph::__ceph_abort(char const*,
int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe5) [0x5619b2fc1adc]
Jan 25 23:07:53 ceph01 bash[2553123]:  2:
(KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5]
Jan 25 23:07:53 ceph01 bash[2553123]:  3:
(KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1]
Jan 25 23:07:53 ceph01 bash[2553123]:  4: /lib64/libpthread.so.0(+0x81ca)
[0x7f7c978851ca]
Jan 25 23:07:53 ceph01 bash[2553123]:  5: clone()
Jan 25 23:07:53 ceph01 bash[2553123]: debug     -1>
2023-01-25T23:07:52.994+0000 7f7c8b66e700 -1 *** Caught signal (Aborted) **
Jan 25 23:07:53 ceph01 bash[2553123]:  in thread 7f7c8b66e700
thread_name:bstore_aio
Jan 25 23:07:53 ceph01 bash[2553123]:  ceph version 16.2.11
(3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)
Jan 25 23:07:53 ceph01 bash[2553123]:  1: /lib64/libpthread.so.0(+0x12cf0)
[0x7f7c9788fcf0]
Jan 25 23:07:53 ceph01 bash[2553123]:  2: gsignal()
Jan 25 23:07:53 ceph01 bash[2553123]:  3: abort()
Jan 25 23:07:53 ceph01 bash[2553123]:  4: (ceph::__ceph_abort(char const*,
int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b6) [0x5619b2fc1bad]
Jan 25 23:07:53 ceph01 bash[2553123]:  5:
(KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5]
Jan 25 23:07:53 ceph01 bash[2553123]:  6:
(KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1]
Jan 25 23:07:53 ceph01 bash[2553123]:  7: /lib64/libpthread.so.0(+0x81ca)
[0x7f7c978851ca]

OSD kept crashing until the host reboot, an OSD restart wouldn't help. This
hasn't happened during any previous upgrades, so was a rather unexpected
development.

Unclear what caused this, but a host reboot seems to have fixed it. It
happened to 1 other OSD on another host, exactly the same symptoms, also
solved by a reboot.

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx