Re: No active PG; No disk activity

Murilo Morais <murilo@xxxxxxxxxxxxxx> · Tue, 1 Nov 2022 15:52:17 -0300

I managed to solve this problem.

To document the resolution: The firewall was blocking communication. After
disabling everything related to it and restarting the machine everything
went back to normal.

Em ter., 1 de nov. de 2022 às 10:46, Murilo Morais <murilo@xxxxxxxxxxxxxx>
escreveu:

> Good morning everyone!
>
> Today there was an atypical situation in our Cluster where the three
> machines came to shut down.
>
> On powering up the cluster went up and formed quorum with no problems, but
> the PGs are all in Working, I don't see any disk activity on the machines.
> No PG is active.
>
>
>
>
> [ceph: root@dcs1 /]# ceph osd tree
> ID  CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
> -1         98.24359  root default
> -3         32.74786      host dcs1
>  0    hdd   2.72899          osd.0       up   1.00000  1.00000
>  1    hdd   2.72899          osd.1       up   1.00000  1.00000
>  2    hdd   2.72899          osd.2       up   1.00000  1.00000
>  3    hdd   2.72899          osd.3       up   1.00000  1.00000
>  4    hdd   2.72899          osd.4       up   1.00000  1.00000
>  5    hdd   2.72899          osd.5       up   1.00000  1.00000
>  6    hdd   2.72899          osd.6       up   1.00000  1.00000
>  7    hdd   2.72899          osd.7       up   1.00000  1.00000
>  8    hdd   2.72899          osd.8       up   1.00000  1.00000
>  9    hdd   2.72899          osd.9       up   1.00000  1.00000
> 10    hdd   2.72899          osd.10      up   1.00000  1.00000
> 11    hdd   2.72899          osd.11      up   1.00000  1.00000
> -5         32.74786      host dcs2
> 12    hdd   2.72899          osd.12      up   1.00000  1.00000
> 13    hdd   2.72899          osd.13      up   1.00000  1.00000
> 14    hdd   2.72899          osd.14      up   1.00000  1.00000
> 15    hdd   2.72899          osd.15      up   1.00000  1.00000
> 16    hdd   2.72899          osd.16      up   1.00000  1.00000
> 17    hdd   2.72899          osd.17      up   1.00000  1.00000
> 18    hdd   2.72899          osd.18      up   1.00000  1.00000
> 19    hdd   2.72899          osd.19      up   1.00000  1.00000
> 20    hdd   2.72899          osd.20      up   1.00000  1.00000
> 21    hdd   2.72899          osd.21      up   1.00000  1.00000
> 22    hdd   2.72899          osd.22      up   1.00000  1.00000
> 23    hdd   2.72899          osd.23      up   1.00000  1.00000
> -7         32.74786      host dcs3
> 24    hdd   2.72899          osd.24      up   1.00000  1.00000
> 25    hdd   2.72899          osd.25      up   1.00000  1.00000
> 26    hdd   2.72899          osd.26      up   1.00000  1.00000
> 27    hdd   2.72899          osd.27      up   1.00000  1.00000
> 28    hdd   2.72899          osd.28      up   1.00000  1.00000
> 29    hdd   2.72899          osd.29      up   1.00000  1.00000
> 30    hdd   2.72899          osd.30      up   1.00000  1.00000
> 31    hdd   2.72899          osd.31      up   1.00000  1.00000
> 32    hdd   2.72899          osd.32      up   1.00000  1.00000
> 33    hdd   2.72899          osd.33      up   1.00000  1.00000
> 34    hdd   2.72899          osd.34      up   1.00000  1.00000
> 35    hdd   2.72899          osd.35      up   1.00000  1.00000
>
>
>
>
> [ceph: root@dcs1 /]# ceph -s
>   cluster:
>     id:     58bbb950-538b-11ed-b237-2c59e53b80cc
>     health: HEALTH_WARN
>             4 filesystems are degraded
>             4 MDSs report slow metadata IOs
>             Reduced data availability: 1153 pgs inactive, 1101 pgs peering
>             26 slow ops, oldest one blocked for 563 sec, daemons
> [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
> have slow ops.
>
>   services:
>     mon: 3 daemons, quorum dcs1.evocorp,dcs2,dcs3 (age 7m)
>     mgr: dcs1.evocorp.kyqfcd(active, since 15m), standbys: dcs2.rirtyl
>     mds: 4/4 daemons up, 4 standby
>     osd: 36 osds: 36 up (since 6m), 36 in (since 47m); 65 remapped pgs
>
>   data:
>     volumes: 0/4 healthy, 4 recovering
>     pools:   10 pools, 1153 pgs
>     objects: 254.72k objects, 994 GiB
>     usage:   2.8 TiB used, 95 TiB / 98 TiB avail
>     pgs:     100.000% pgs not active
>              1036 peering
>              65   remapped+peering
>              52   activating
>
>
>
>
> [ceph: root@dcs1 /]# ceph health detail
> HEALTH_WARN 4 filesystems are degraded; 4 MDSs report slow metadata IOs;
> Reduced data availability: 1153 pgs inactive, 1101 pgs peering; 26 slow
> ops, oldest one blocked for 673 sec, daemons
> [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
> have slow ops.
> [WRN] FS_DEGRADED: 4 filesystems are degraded
>     fs dc_ovirt is degraded
>     fs dc_iso is degraded
>     fs dc_sas is degraded
>     fs pool_tester is degraded
> [WRN] MDS_SLOW_METADATA_IO: 4 MDSs report slow metadata IOs
>     mds.dc_sas.dcs1.wbyuik(mds.0): 4 slow metadata IOs are blocked > 30
> secs, oldest blocked for 1063 secs
>     mds.dc_ovirt.dcs1.lpcazs(mds.0): 4 slow metadata IOs are blocked > 30
> secs, oldest blocked for 1058 secs
>     mds.pool_tester.dcs1.ixkkfs(mds.0): 4 slow metadata IOs are blocked >
> 30 secs, oldest blocked for 1058 secs
>     mds.dc_iso.dcs1.jxqqjd(mds.0): 4 slow metadata IOs are blocked > 30
> secs, oldest blocked for 1058 secs
> [WRN] PG_AVAILABILITY: Reduced data availability: 1153 pgs inactive, 1101
> pgs peering
>     pg 6.c3 is stuck inactive for 50m, current state peering, last acting
> [30,15,11]
>     pg 6.c4 is stuck peering for 10h, current state peering, last acting
> [12,0,26]
>     pg 6.c5 is stuck peering for 10h, current state peering, last acting
> [12,32,6]
>     pg 6.c6 is stuck peering for 11h, current state peering, last acting
> [30,4,22]
>     pg 6.c7 is stuck peering for 10h, current state peering, last acting
> [4,14,26]
>     pg 6.c8 is stuck peering for 10h, current state peering, last acting
> [0,22,32]
>     pg 6.c9 is stuck peering for 11h, current state peering, last acting
> [32,20,0]
>     pg 6.ca is stuck peering for 11h, current state peering, last acting
> [31,0,23]
>     pg 6.cb is stuck peering for 10h, current state peering, last acting
> [8,35,16]
>     pg 6.cc is stuck peering for 10h, current state peering, last acting
> [8,24,13]
>     pg 6.cd is stuck peering for 10h, current state peering, last acting
> [15,25,1]
>     pg 6.ce is stuck peering for 11h, current state peering, last acting
> [27,23,4]
>     pg 6.cf is stuck peering for 11h, current state peering, last acting
> [25,4,20]
>     pg 7.c4 is stuck peering for 11m, current state remapped+peering, last
> acting [19,8]
>     pg 7.c5 is stuck peering for 10h, current state peering, last acting
> [6,14,32]
>     pg 7.c6 is stuck peering for 10h, current state peering, last acting
> [14,35,5]
>     pg 7.c7 is stuck peering for 10h, current state remapped+peering, last
> acting [11,14]
>     pg 7.c8 is stuck peering for 10h, current state peering, last acting
> [21,9,28]
>     pg 7.c9 is stuck peering for 10h, current state peering, last acting
> [0,30,15]
>     pg 7.ca is stuck peering for 10h, current state peering, last acting
> [23,2,26]
>     pg 7.cb is stuck peering for 10h, current state peering, last acting
> [23,9,24]
>     pg 7.cc is stuck peering for 10h, current state peering, last acting
> [23,27,0]
>     pg 7.cd is stuck peering for 11m, current state remapped+peering,
> last acting [13,6]
>     pg 7.ce is stuck peering for 10h, current state peering, last acting
> [16,1,25]
>     pg 7.cf is stuck peering for 11h, current state peering, last acting
> [24,16,8]
>     pg 9.c0 is stuck peering for 10h, current state peering, last acting
> [21,28]
>     pg 9.c1 is stuck peering for 10h, current state peering, last acting
> [12,31]
>     pg 9.c2 is stuck peering for 10h, current state peering, last acting
> [6,27]
>     pg 9.c3 is stuck peering for 10h, current state peering, last acting
> [9,27]
>     pg 9.c4 is stuck peering for 50m, current state peering, last acting
> [17,34]
>     pg 9.c5 is stuck peering for 11h, current state peering, last acting
> [31,8]
>     pg 9.c6 is stuck peering for 10h, current state peering, last acting
> [1,29]
>     pg 9.c7 is stuck peering for 10h, current state peering, last acting
> [12,30]
>     pg 9.c8 is stuck peering for 11h, current state peering, last acting
> [26,3]
>     pg 9.c9 is stuck peering for 11h, current state peering, last acting
> [29,13]
>     pg 9.ca is stuck peering for 11h, current state peering, last acting
> [25,6]
>     pg 9.cb is stuck peering for 10h, current state peering, last acting
> [16,9]
>     pg 9.cc is stuck peering for 4h, current state peering, last acting
> [4,29]
>     pg 10.c0 is stuck peering for 11h, current state peering, last acting
> [32,19]
>     pg 10.c1 is stuck peering for 10h, current state peering, last acting
> [23,6]
>     pg 10.c2 is stuck peering for 11h, current state peering, last acting
> [24,7]
>     pg 10.c3 is stuck peering for 38m, current state peering, last acting
> [5,20]
>     pg 10.c4 is stuck peering for 10h, current state peering, last acting
> [21,4]
>     pg 10.c5 is stuck peering for 10h, current state peering, last acting
> [12,8]
>     pg 10.c6 is stuck peering for 11h, current state peering, last acting
> [34,7]
>     pg 10.c7 is stuck peering for 10h, current state peering, last acting
> [17,30]
>     pg 10.c8 is stuck peering for 11h, current state peering, last acting
> [24,19]
>     pg 10.c9 is stuck inactive for 54m, current state activating, last
> acting [13,3]
>     pg 10.ca is stuck peering for 10h, current state peering, last acting
> [16,6]
>     pg 10.cb is stuck peering for 11h, current state peering, last acting
> [26,13]
>     pg 10.cf is stuck peering for 50m, current state peering, last acting
> [21,24]
> [WRN] SLOW_OPS: 26 slow ops, oldest one blocked for 673 sec, daemons
> [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
> have slow ops.
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx