Re: Ceph 14.2 - some PGs stuck peering.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

it's not really clear what happened, I would investigate the root cause first. Did some of the OSDs fail, if yes, why?
To increase the recovery speed you can change these values live:

osd_max_backfills
osd_recovery_max_active

Choose carefully and only increase slowly as it can easily impact client I/O.

In the subject you write that they're stuck peering but you also write:

PGs seem to be slowly migrating from peering to activating but it's going very slowly - approx 10PGs during last hour.

So they're not stuck, correct?

Regards,
Eugen


Zitat von m.sliwinski@xxxxx:

Hi

We have a weird issue iwth our ceph cluster - almost all PGs assigned to one specific pool became stuck, locking out all operations without reporting any errors.
Story:
We have 3 different pools, hdd-backed, ssd-backed and nvme-backed.
Pool ssh worked fine for few months.
Today one of the hosts assigned to nvme pool restarted triggering recovery in that pool. It wnet fast and cluster went to OK state. During these events or shortly after them ssd pool became unresponsive. It was impossible to either read or write from/to it. We decided to slowly restart fist OSDs assigned to it, thenm as it didn't help - all the mons, wihout breaking quorum of course. At this moment both nvme and hdd polls are working fine, ssd one is stuck in recovery. All OSDs in that ssd pool use large amount of CPU and are exchanging approx 1Mpps per OSD server between each other.

PGs seem to be slowly migrating from peering to activating but it's going very slowly - approx 10PGs during last hour.

We were using 14.2.2 OSDs when issues happened, upgrade to 14.2.13 didn't help. We increased heartbeat grace, but it didn't change anything. It doesn't seem that there's a network problem as OSDs don't report problems with connecting to MONs or each other. Other OSDs - nvme, connected to that same set of switches work without issues.

Can you help? Point me to what should i check or do? I looked on-line and on the group for causes of peering issues and checked most of them, nothing helped. I can't use 'ceph pg 28.1cc query' as it hangs, even for PGs that are marked as active+clean in the results of 'ceph pg dump'

I checked status of the one of stuck PGs via ceph-objectstore-tool --data-path [...] --op info --pgid 28.29d for all three copies and got:

{
"pgid": "28.29d",
"last_update": "68160'205094",
"last_complete": "68160'205094",
"log_tail": "68062'202000",
"last_user_version": 205094,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
}
],
"history": {
"epoch_created": 67698,
"epoch_pool_created": 67698,
"last_epoch_started": 68871,
"last_interval_started": 68851,
"last_epoch_clean": 67746,
"last_interval_clean": 67745,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 69447,
"same_interval_since": 69447,
"same_primary_since": 69411,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988"
},
"stats": {
"version": "68160'205094",
"reported_seq": "378496",
"reported_epoch": "69447",
"state": "peering",
"last_fresh": "2020-11-03 20:55:39.247348",
"last_change": "2020-11-03 20:55:39.247348",
"last_active": "2020-11-03 15:26:24.270088",
"last_peered": "2020-11-03 19:04:43.152655",
"last_clean": "2020-11-03 14:45:02.988293",
"last_became_active": "2020-09-01 13:52:40.091759",
"last_became_peered": "2020-11-03 19:04:42.939991",
"last_unstale": "2020-11-03 20:55:39.247348",
"last_undegraded": "2020-11-03 20:55:39.247348",
"last_fullsized": "2020-11-03 20:55:39.247348",
"mapping_epoch": 69447,
"log_start": "68062'202000",
"ondisk_log_start": "68062'202000",
"created": 67698,
"last_epoch_clean": 67746,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988",
"log_size": 3094,
"ondisk_log_size": 3094,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 15173849600,
"num_objects": 3647,
"num_object_clones": 0,
"num_object_copies": 10941,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 3647,
"num_whiteouts": 0,
"num_read": 172836,
"num_read_kb": 6824184,
"num_write": 196190,
"num_write_kb": 21380176,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
},
"up": [
261,
284,
271
],
"acting": [
261,
284,
271
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [
271,
284
],
"up_primary": 261,
"acting_primary": 261,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 69422,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}




{
"pgid": "28.29d",
"last_update": "68160'205094",
"last_complete": "68160'205094",
"log_tail": "68062'202000",
"last_user_version": 205094,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
}
],
"history": {
"epoch_created": 67698,
"epoch_pool_created": 67698,
"last_epoch_started": 68871,
"last_interval_started": 68851,
"last_epoch_clean": 67746,
"last_interval_clean": 67745,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 69630,
"same_interval_since": 69630,
"same_primary_since": 69628,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988"
},
"stats": {
"version": "68160'205094",
"reported_seq": "378445",
"reported_epoch": "69627",
"state": "peering",
"last_fresh": "2020-11-03 21:15:08.819278",
"last_change": "2020-11-03 21:14:18.360957",
"last_active": "2020-11-03 15:26:24.270088",
"last_peered": "2020-11-03 19:04:43.152655",
"last_clean": "2020-11-03 14:45:02.988293",
"last_became_active": "2020-09-01 13:52:40.091759",
"last_became_peered": "2020-11-03 19:04:42.939991",
"last_unstale": "2020-11-03 21:15:08.819278",
"last_undegraded": "2020-11-03 21:15:08.819278",
"last_fullsized": "2020-11-03 21:15:08.819278",
"mapping_epoch": 69630,
"log_start": "68062'202000",
"ondisk_log_start": "68062'202000",
"created": 67698,
"last_epoch_clean": 67746,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988",
"log_size": 3094,
"ondisk_log_size": 3094,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 15173849600,
"num_objects": 3647,
"num_object_clones": 0,
"num_object_copies": 10941,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 3647,
"num_whiteouts": 0,
"num_read": 172836,
"num_read_kb": 6824184,
"num_write": 196190,
"num_write_kb": 21380176,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
},
"up": [
261,
284
],
"acting": [
261,
284
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [
271
],
"up_primary": 261,
"acting_primary": 261,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 69392,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}




{
"pgid": "28.29d",
"last_update": "68160'205094",
"last_complete": "68160'205094",
"log_tail": "68062'202000",
"last_user_version": 205094,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
}
],
"history": {
"epoch_created": 67698,
"epoch_pool_created": 67698,
"last_epoch_started": 68871,
"last_interval_started": 68851,
"last_epoch_clean": 67746,
"last_interval_clean": 67745,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 69411,
"same_interval_since": 69411,
"same_primary_since": 69411,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988"
},
"stats": {
"version": "68070'205093",
"reported_seq": "378344",
"reported_epoch": "68160",
"state": "active+clean",
"last_fresh": "2020-11-03 14:45:02.988293",
"last_change": "2020-11-03 03:32:46.896044",
"last_active": "2020-11-03 14:45:02.988293",
"last_peered": "2020-11-03 14:45:02.988293",
"last_clean": "2020-11-03 14:45:02.988293",
"last_became_active": "2020-09-01 13:52:40.091759",
"last_became_peered": "2020-09-01 13:52:40.091759",
"last_unstale": "2020-11-03 14:45:02.988293",
"last_undegraded": "2020-11-03 14:45:02.988293",
"last_fullsized": "2020-11-03 14:45:02.988293",
"mapping_epoch": 69411,
"log_start": "68062'202000",
"ondisk_log_start": "68062'202000",
"created": 67698,
"last_epoch_clean": 67746,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "68062'199623",
"last_scrub_stamp": "2020-11-03 03:32:46.895988",
"last_deep_scrub": "68062'177321",
"last_deep_scrub_stamp": "2020-11-02 01:07:15.963916",
"last_clean_scrub_stamp": "2020-11-03 03:32:46.895988",
"log_size": 3093,
"ondisk_log_size": 3093,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 15173849600,
"num_objects": 3647,
"num_object_clones": 0,
"num_object_copies": 10941,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 3647,
"num_whiteouts": 0,
"num_read": 172836,
"num_read_kb": 6824184,
"num_write": 196190,
"num_write_kb": 21380176,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
},
"up": [
261,
284,
271
],
"acting": [
261,
284,
271
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [],
"up_primary": 261,
"acting_primary": 261,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 67746,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}



Current status of the cluster:
            Reduced data availability: 1021 pgs inactive, 999 pgs peering
Degraded data redundancy: 18357/94939584 objects degraded (0.019%), 3 pgs degraded, 5 pgs undersized

  services:
    mon: 3 daemons, quorum monb01,monb02,monb03
    mgr: monb03(active), standbys: monb01, monb02
    osd: 285 osds: 284 up, 284 in

  data:
    pools:   9 pools, 9546 pgs
    objects: 31.65 M objects, 120 TiB
    usage:   363 TiB used, 127 TiB / 490 TiB avail
    pgs:     10.696% pgs not active
             18357/94939584 objects degraded (0.019%)
             8520 active+clean
             999  peering
             18   activating
             3    active+clean+scrubbing+deep
             2    activating+undersized+degraded
             2    activating+undersized
             1    active+clean+scrubbing
             1    active+undersized+degraded

  io:
    client:   367 MiB/s rd, 195 MiB/s wr, 24.51 kop/s rd, 5.95 kop/s wr
    cache:    24 MiB/s flush, 90 MiB/s evict, 23 op/s promote
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux