Hi,
your ceph version seems to be 17.2.4, not 17.2.6 (which is the locally
installed ceph version on the system where you ran the command) Could
you add the 'ceph versions' output as well?
How is the load on the systems when the recovery starts? The OSDs
crash after around 20 minutes, not immediately. That's why I assume
that it's some sort of resource bottlneck.
---snip---
Mar 08 23:46:05 beta-cen bash[922752]: debug
2024-03-09T04:46:05.198+0000 7f0a4bb5d700 0 log_channel(cluster) log
[INF] : 2.1b continuing backfill to osd.6 from
(9971'17067184,10014'17073660]
2:dc1332a8:::rbd_data.99d921d8edc910.0000000000000198:head to
10014'17073660
Mar 08 23:46:05 beta-cen bash[922752]: debug
2024-03-09T04:46:05.198+0000 7f0a4ab5b700 0 log_channel(cluster) log
[INF] : 2.1d continuing backfill to osd.6 from
(9972'32706188,10014'32712589]
2:bc276b0b:::rbd_data.307ae0ca08e035.0000000000019bd6:head to
10014'32712589
Mar 08 23:46:05 beta-cen bash[922752]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.4/rpm/el8/BUILD/ceph-17.2.4/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f0a4bb5d700 time
2024-03-09T04:46:05.331039+0000
Mar 08 23:46:05 beta-cen bash[922752]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.4/rpm/el8/BUILD/ceph-17.2.4/src/osd/osd_types.cc: 5888: FAILED
ceph_assert(clone_overlap.count(clone))
Mar 08 23:46:05 beta-cen bash[922752]: ceph version 17.2.4
(1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable)
Mar 08 23:46:05 beta-cen bash[922752]: 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x55c62ba0d631]
Mar 08 23:46:05 beta-cen bash[922752]: 2:
/usr/bin/ceph-osd(+0x5977f7) [0x55c62ba0d7f7]
Mar 08 23:46:05 beta-cen bash[922752]: 3:
(SnapSet::get_clone_bytes(snapid_t) const+0xe8) [0x55c62bdc1228]
Mar 08 23:46:05 beta-cen bash[922752]: 4:
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x25e) [0x55c62bc4ff4e]
Mar 08 23:46:05 beta-cen bash[922752]: 5:
(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&,
bool*)+0x1281) [0x55c62bcbaeb1]
Mar 08 23:46:05 beta-cen bash[922752]: 6:
(PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0xe34) [0x55c62bcc0414]
Mar 08 23:46:05 beta-cen bash[922752]: 7: (OSD::do_recovery(PG*,
unsigned int, unsigned long, ThreadPool::TPHandle&)+0x272)
[0x55c62bb20852]
Mar 08 23:46:05 beta-cen bash[922752]: 8:
(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55c62be0cdcd]
Mar 08 23:46:05 beta-cen bash[922752]: 9:
(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x115f) [0x55c62bb21dbf]
Mar 08 23:46:05 beta-cen bash[922752]: 10:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435)
[0x55c62c27f8c5]
Mar 08 23:46:05 beta-cen bash[922752]: 11:
(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55c62c281fe4]
Mar 08 23:46:05 beta-cen bash[922752]: 12:
/lib64/libpthread.so.0(+0x81ca) [0x7f0a6bf991ca]
Mar 08 23:46:05 beta-cen bash[922752]: 13: clone()
---snip---
Zitat von Romain Lebbadi-Breteau <romain.lebbadi-breteau@xxxxxxxxxx>:
Hi,
Sorry for the bad formatting. Here are the outputs again.
ceph osd df :
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS
3 hdd 1.81879 0 0 B 0 B 0 B 0
B 0 B 0 B 0 0 0 down
12 hdd 1.81879 1.00000 1.8 TiB 385 GiB 383 GiB 6.7 MiB
1.4 GiB 1.4 TiB 20.66 1.73 18 up
13 hdd 1.81879 1.00000 1.8 TiB 422 GiB 421 GiB 5.8 MiB
1.3 GiB 1.4 TiB 22.67 1.90 17 up
15 hdd 1.81879 1.00000 1.8 TiB 264 GiB 263 GiB 4.6 MiB
1.1 GiB 1.6 TiB 14.17 1.19 14 up
16 hdd 9.09520 1.00000 9.1 TiB 1.0 TiB 1023 GiB 8.8 MiB
2.6 GiB 8.1 TiB 11.01 0.92 65 up
17 hdd 1.81879 1.00000 1.8 TiB 319 GiB 318 GiB 6.1 MiB
1.0 GiB 1.5 TiB 17.13 1.43 15 up
1 hdd 5.45749 1.00000 5.5 TiB 546 GiB 544 GiB 7.8 MiB
1.4 GiB 4.9 TiB 9.76 0.82 29 up
4 hdd 5.45749 1.00000 5.5 TiB 801 GiB 799 GiB 8.3 MiB
2.4 GiB 4.7 TiB 14.34 1.20 44 up
8 hdd 5.45749 1.00000 5.5 TiB 708 GiB 706 GiB 9.7 MiB
2.1 GiB 4.8 TiB 12.67 1.06 36 up
11 hdd 5.45749 0 0 B 0 B 0 B 0
B 0 B 0 B 0 0 0 down
14 hdd 1.81879 1.00000 1.8 TiB 200 GiB 198 GiB 3.8 MiB
1.3 GiB 1.6 TiB 10.71 0.90 10 up
0 hdd 9.09520 0 0 B 0 B 0 B 0
B 0 B 0 B 0 0 0 down
5 hdd 9.09520 1.00000 9.1 TiB 859 GiB 857 GiB 17 MiB
2.1 GiB 8.3 TiB 9.23 0.77 46 up
9 hdd 9.09520 1.00000 9.1 TiB 924 GiB 922 GiB 11 MiB
2.3 GiB 8.2 TiB 9.92 0.83 55 up
TOTAL 53 TiB 6.3 TiB 6.3 TiB 90 MiB
19 GiB 46 TiB 11.95
MIN/MAX VAR: 0.77/1.90 STDDEV: 4.74
ceph osd pool ls detail :
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 32 flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 9327 lfor 0/0/104 flags hashpspool,selfmanaged_snaps
stripe_width 0 application rbd
pool 3 'images' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 9018 lfor 0/0/104 flags hashpspool,selfmanaged_snaps
stripe_width 0 application rbd
pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9149
lfor 0/0/106 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd
pool 5 'polyphoto_backup' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 372 lfor 0/0/362 flags hashpspool,selfmanaged_snaps
stripe_width 0 compression_algorithm snappy compression_mode
aggressive application rbd
The error seems to come from a software error in Ceph. In the logs,
I get the message "FAILED ceph_assert(clone_overlap.count(clone))".
Thanks,
Romain Lebbadi-Breteau
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx