Re: PG damaged "failed_repair"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

your ceph version seems to be 17.2.4, not 17.2.6 (which is the locally installed ceph version on the system where you ran the command) Could you add the 'ceph versions' output as well?

How is the load on the systems when the recovery starts? The OSDs crash after around 20 minutes, not immediately. That's why I assume that it's some sort of resource bottlneck.

---snip---
Mar 08 23:46:05 beta-cen bash[922752]: debug 2024-03-09T04:46:05.198+0000 7f0a4bb5d700 0 log_channel(cluster) log [INF] : 2.1b continuing backfill to osd.6 from (9971'17067184,10014'17073660] 2:dc1332a8:::rbd_data.99d921d8edc910.0000000000000198:head to 10014'17073660 Mar 08 23:46:05 beta-cen bash[922752]: debug 2024-03-09T04:46:05.198+0000 7f0a4ab5b700 0 log_channel(cluster) log [INF] : 2.1d continuing backfill to osd.6 from (9972'32706188,10014'32712589] 2:bc276b0b:::rbd_data.307ae0ca08e035.0000000000019bd6:head to 10014'32712589 Mar 08 23:46:05 beta-cen bash[922752]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.4/rpm/el8/BUILD/ceph-17.2.4/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f0a4bb5d700 time 2024-03-09T04:46:05.331039+0000 Mar 08 23:46:05 beta-cen bash[922752]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.4/rpm/el8/BUILD/ceph-17.2.4/src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone)) Mar 08 23:46:05 beta-cen bash[922752]: ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) Mar 08 23:46:05 beta-cen bash[922752]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x55c62ba0d631] Mar 08 23:46:05 beta-cen bash[922752]: 2: /usr/bin/ceph-osd(+0x5977f7) [0x55c62ba0d7f7] Mar 08 23:46:05 beta-cen bash[922752]: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xe8) [0x55c62bdc1228] Mar 08 23:46:05 beta-cen bash[922752]: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x25e) [0x55c62bc4ff4e] Mar 08 23:46:05 beta-cen bash[922752]: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1281) [0x55c62bcbaeb1] Mar 08 23:46:05 beta-cen bash[922752]: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xe34) [0x55c62bcc0414] Mar 08 23:46:05 beta-cen bash[922752]: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x272) [0x55c62bb20852] Mar 08 23:46:05 beta-cen bash[922752]: 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55c62be0cdcd] Mar 08 23:46:05 beta-cen bash[922752]: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55c62bb21dbf] Mar 08 23:46:05 beta-cen bash[922752]: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55c62c27f8c5] Mar 08 23:46:05 beta-cen bash[922752]: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55c62c281fe4] Mar 08 23:46:05 beta-cen bash[922752]: 12: /lib64/libpthread.so.0(+0x81ca) [0x7f0a6bf991ca]
Mar 08 23:46:05 beta-cen bash[922752]:  13: clone()
---snip---


Zitat von Romain Lebbadi-Breteau <romain.lebbadi-breteau@xxxxxxxxxx>:

Hi,

Sorry for the bad formatting. Here are the outputs again.

ceph osd df :

ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  3    hdd  1.81879         0      0 B      0 B       0 B      0 B      0 B      0 B      0     0    0    down 12    hdd  1.81879   1.00000  1.8 TiB  385 GiB   383 GiB  6.7 MiB 1.4 GiB  1.4 TiB  20.66  1.73   18      up 13    hdd  1.81879   1.00000  1.8 TiB  422 GiB   421 GiB  5.8 MiB 1.3 GiB  1.4 TiB  22.67  1.90   17      up 15    hdd  1.81879   1.00000  1.8 TiB  264 GiB   263 GiB  4.6 MiB 1.1 GiB  1.6 TiB  14.17  1.19   14      up 16    hdd  9.09520   1.00000  9.1 TiB  1.0 TiB  1023 GiB  8.8 MiB 2.6 GiB  8.1 TiB  11.01  0.92   65      up 17    hdd  1.81879   1.00000  1.8 TiB  319 GiB   318 GiB  6.1 MiB 1.0 GiB  1.5 TiB  17.13  1.43   15      up  1    hdd  5.45749   1.00000  5.5 TiB  546 GiB   544 GiB  7.8 MiB 1.4 GiB  4.9 TiB   9.76  0.82   29      up  4    hdd  5.45749   1.00000  5.5 TiB  801 GiB   799 GiB  8.3 MiB 2.4 GiB  4.7 TiB  14.34  1.20   44      up  8    hdd  5.45749   1.00000  5.5 TiB  708 GiB   706 GiB  9.7 MiB 2.1 GiB  4.8 TiB  12.67  1.06   36      up 11    hdd  5.45749         0      0 B      0 B       0 B      0 B      0 B      0 B      0     0    0    down 14    hdd  1.81879   1.00000  1.8 TiB  200 GiB   198 GiB  3.8 MiB 1.3 GiB  1.6 TiB  10.71  0.90   10      up  0    hdd  9.09520         0      0 B      0 B       0 B      0 B      0 B      0 B      0     0    0    down  5    hdd  9.09520   1.00000  9.1 TiB  859 GiB   857 GiB   17 MiB 2.1 GiB  8.3 TiB   9.23  0.77   46      up  9    hdd  9.09520   1.00000  9.1 TiB  924 GiB   922 GiB   11 MiB 2.3 GiB  8.2 TiB   9.92  0.83   55      up                        TOTAL   53 TiB  6.3 TiB   6.3 TiB   90 MiB   19 GiB   46 TiB  11.95
MIN/MAX VAR: 0.77/1.90  STDDEV: 4.74

ceph osd pool ls detail :

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 32 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9327 lfor 0/0/104 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 3 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9018 lfor 0/0/104 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9149 lfor 0/0/106 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 5 'polyphoto_backup' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 372 lfor 0/0/362 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm snappy compression_mode aggressive application rbd

The error seems to come from a software error in Ceph. In the logs, I get the message "FAILED ceph_assert(clone_overlap.count(clone))".

Thanks,

Romain Lebbadi-Breteau
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux