Re: All pgs peering indefinetely

Rodrigo Severo - Fábrica <rodrigo@xxxxxxxxxxxxxxxxxxx> · Tue, 4 Feb 2020 14:44:08 -0300

Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
<rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu:
>
> Hi,
>
>
> I have a rather small cephfs cluster with 3 machines right now: all of
> them sharing MDS, MON, MGS and OSD roles.
>
> I had to move all machines to a new physical location and,
> unfortunately, I had to move all of them at the same time.
>
> They are already on again but ceph won't be accessible as all pgs are
> in peering state and OSD keep going down and up again.
>
> Here is some info about my cluster:
>
> -------------------------------------------
> # ceph -s
>   cluster:
>     id:     e348b63c-d239-4a15-a2ce-32f29a00431c
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             1 MDSs report slow metadata IOs
>             2 osds down
>             1 host (2 osds) down
>             Reduced data availability: 324 pgs inactive, 324 pgs peering
>             7 daemons have recently crashed
>             10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops
>
>   services:
>     mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
>     mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
>     mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
>     osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
>     rgw: 1 daemon active (a2-df)
>
>   data:
>     pools:   7 pools, 324 pgs
>     objects: 850.25k objects, 744 GiB
>     usage:   2.3 TiB used, 14 TiB / 16 TiB avail
>     pgs:     100.000% pgs not active
>              324 peering
> -------------------------------------------
>
> -------------------------------------------
> # ceph osd df tree
> ID  CLASS    WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META
> AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>  -1          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83 1.00   -        root default
> -10          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83 1.00   -            datacenter df
>  -3           5.45799        - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -                host a1-df
>   3 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> 3.6 TiB  0.03 0.00   0   down             osd.3
>   0      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
> 1.1 TiB 41.43 3.00   0   down             osd.0
>  -5           5.45799        - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -                host a2-df
>   4 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> 3.6 TiB  0.03 0.00 100     up             osd.4
>   1      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
> 1.1 TiB 41.42 3.00 224     up             osd.1
>  -7           5.45767        - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -                host a3-df
>   5 hdd-slow  3.63869  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> 3.6 TiB  0.03 0.00 100     up             osd.5
>   2      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
> 1.1 TiB 41.43 3.00 224     up             osd.2
>                          TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83
> MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
> -------------------------------------------
>
> At this exact moment both OSDs from server a1-df were down but that's
> changing. Sometimes I have only one OSD down, but most of the times I
> have 2. And exactly which ones are actually down keeps changing.
>
> What should I do to get my cluster back up? Just wait?

I just found out that I have a few pgs "stuck peering":

-------------------------------------------
# ceph health detail | grep peering
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
ops.
PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering
    pg 1.39 is stuck peering for 14011.965915, current state peering,
last acting [0,1]
    pg 1.3a is stuck peering for 14084.993947, current state peering,
last acting [0,1]
    pg 1.3b is stuck peering for 14274.225311, current state peering,
last acting [0,1]
    pg 1.3c is stuck peering for 15937.859532, current state peering,
last acting [1,0]
    pg 1.3d is stuck peering for 15786.873447, current state peering,
last acting [1,0]
    pg 1.3e is stuck peering for 15841.947891, current state peering,
last acting [1,0]
    pg 1.3f is stuck peering for 15841.912853, current state peering,
last acting [1,0]
    pg 1.40 is stuck peering for 14031.769901, current state peering,
last acting [0,1]
    pg 1.41 is stuck peering for 14010.216124, current state peering,
last acting [0,1]
    pg 1.42 is stuck peering for 15841.895446, current state peering,
last acting [1,0]
    pg 1.43 is stuck peering for 15915.024413, current state peering,
last acting [1,0]
    pg 1.44 is stuck peering for 13872.015272, current state peering,
last acting [0,1]
    pg 1.45 is stuck peering for 15684.413850, current state peering,
last acting [1,0]
    pg 1.46 is stuck peering for 15906.378461, current state peering,
last acting [1,0]
    pg 1.47 is stuck peering for 14377.822032, current state peering,
last acting [0,1]
    pg 1.48 is stuck peering for 14085.032316, current state peering,
last acting [0,1]
    pg 1.49 is stuck peering for 14085.030366, current state peering,
last acting [0,1]
    pg 1.4a is stuck peering for 14667.451862, current state peering,
last acting [0,1]
    pg 1.4b is stuck peering for 14048.714764, current state peering,
last acting [0,1]
    pg 1.4c is stuck peering for 13998.360919, current state peering,
last acting [0,1]
    pg 1.4d is stuck peering for 15693.831021, current state peering,
last acting [1,0]
    pg 2.38 is stuck peering for 15841.882464, current state peering,
last acting [1,0]
    pg 2.39 is stuck peering for 15841.881968, current state peering,
last acting [1,0]
    pg 2.3a is stuck peering for 14085.032520, current state peering,
last acting [0,1]
    pg 2.3b is stuck inactive for 12717.975044, current state peering,
last acting [0,1]
    pg 2.3c is stuck peering for 15841.947367, current state peering,
last acting [1,0]
    pg 2.3d is stuck peering for 15732.221067, current state peering,
last acting [1,0]
    pg 2.3e is stuck peering for 15938.007321, current state peering,
last acting [0,1]
    pg 2.3f is stuck peering for 14084.992407, current state peering,
last acting [0,1]
    pg 7.38 is stuck peering for 14080.942444, current state peering,
last acting [3,4]
    pg 7.39 is stuck peering for 14048.869554, current state peering,
last acting [3,4]
    pg 7.3a is stuck peering for 14048.869790, current state peering,
last acting [3,4]
    pg 7.3b is stuck peering for 14080.943240, current state peering,
last acting [3,4]
    pg 7.3c is stuck peering for 15842.114296, current state peering,
last acting [4,3]
    pg 7.3d is stuck peering for 14048.870194, current state peering,
last acting [3,4]
    pg 7.3e is stuck peering for 15842.105944, current state peering,
last acting [4,3]
    pg 7.3f is stuck peering for 15842.111549, current state peering,
last acting [4,3]
    pg 7.40 is stuck peering for 14048.869572, current state peering,
last acting [3,4]
    pg 7.41 is stuck peering for 14048.868747, current state peering,
last acting [3,4]
    pg 7.42 is stuck peering for 15845.175729, current state peering,
last acting [4,3]
    pg 7.43 is stuck peering for 15842.105227, current state peering,
last acting [4,3]
    pg 7.44 is stuck peering for 15845.196486, current state peering,
last acting [4,3]
    pg 7.45 is stuck peering for 14048.869849, current state peering,
last acting [3,4]
    pg 7.46 is stuck peering for 14080.942650, current state peering,
last acting [3,4]
    pg 7.47 is stuck peering for 15845.197875, current state peering,
last acting [4,3]
    pg 7.4a is stuck peering for 15842.113906, current state peering,
last acting [4,3]
    pg 7.4b is stuck peering for 15845.197205, current state peering,
last acting [4,3]
    pg 7.4c is stuck peering for 14048.869937, current state peering,
last acting [3,4]
    pg 7.4d is stuck peering for 14048.869137, current state peering,
last acting [3,4]
    pg 7.4e is stuck peering for 15842.111699, current state peering,
last acting [4,3]
    pg 7.4f is stuck peering for 14080.943391, current state peering,
last acting [3,4]
-------------------------------------------

Why is that? How can I fix it?

Rodrigo
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx