Hi, Attempted to upgrade 16.2.10 to 16.2.11, 2 OSDs out of many started crashing in a loop on the very 1st host: Jan 25 23:07:53 ceph01 bash[2553123]: Uptime(secs): 0.0 total, 0.0 interval Jan 25 23:07:53 ceph01 bash[2553123]: Flush(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Total Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(L0 Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Keys): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Jan 25 23:07:53 ceph01 bash[2553123]: ** File Read Latency Histogram By Level [P] ** Jan 25 23:07:53 ceph01 bash[2553123]: debug -10> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986439) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -9> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986493) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -8> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986500) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -7> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986505) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -6> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1676] [O-2] [JOB 9] Compacting 4@0 + 2@1 files to L1, score 1.0 0 Jan 25 23:07:53 ceph01 bash[2553123]: debug -5> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1680] [O-2] Compaction start summary: Base version 33 Base level 0, inputs: [649058(959KB) 649046(1510KB) 649024(1323KB) 649002(1396KB)], [648981(66MB) 648982(52MB)] Jan 25 23:07:53 ceph01 bash[2553123]: debug -4> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1674688072986547, "job": 9, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [649058, 649046, 649024, 649002], "files_L1": [648981, 648982], "score": 1, "input_data_size": 129161327} Jan 25 23:07:53 ceph01 bash[2553123]: debug -3> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 bdev(0x5619bf0ce400 /var/lib/ceph/osd/ceph-3/block) _aio_thread got r=-1 ((1) Operation not permitted) Jan 25 23:07:53 ceph01 bash[2553123]: debug -2> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f7c8b66e700 time 2023-01-25T23:07:52.993976+0000 Jan 25 23:07:53 ceph01 bash[2553123]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: 604: ceph_abort_msg("Unexpected IO error. This may suggest HW issue. Please check your dmesg!") Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x5619b2fc1adc] Jan 25 23:07:53 ceph01 bash[2553123]: 2: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 4: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] Jan 25 23:07:53 ceph01 bash[2553123]: 5: clone() Jan 25 23:07:53 ceph01 bash[2553123]: debug -1> 2023-01-25T23:07:52.994+0000 7f7c8b66e700 -1 *** Caught signal (Aborted) ** Jan 25 23:07:53 ceph01 bash[2553123]: in thread 7f7c8b66e700 thread_name:bstore_aio Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7c9788fcf0] Jan 25 23:07:53 ceph01 bash[2553123]: 2: gsignal() Jan 25 23:07:53 ceph01 bash[2553123]: 3: abort() Jan 25 23:07:53 ceph01 bash[2553123]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x5619b2fc1bad] Jan 25 23:07:53 ceph01 bash[2553123]: 5: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 6: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 7: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] OSD kept crashing until the host reboot, an OSD restart wouldn't help. This hasn't happened during any previous upgrades, so was a rather unexpected development. Unclear what caused this, but a host reboot seems to have fixed it. It happened to 1 other OSD on another host, exactly the same symptoms, also solved by a reboot. Best regards, Zakhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx