Re: All pgs peering indefinetely

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Tue, 4 Feb 2020 12:54:18 -0500

I would guess that you have something preventing osd to osd communication
on ports 6800-7300 or osd to mon communication on  port 6789 and/or 3300.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica <
rodrigo@xxxxxxxxxxxxxxxxxxx> wrote:

> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
> <rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu:
> >
> > Hi,
> >
> >
> > I have a rather small cephfs cluster with 3 machines right now: all of
> > them sharing MDS, MON, MGS and OSD roles.
> >
> > I had to move all machines to a new physical location and,
> > unfortunately, I had to move all of them at the same time.
> >
> > They are already on again but ceph won't be accessible as all pgs are
> > in peering state and OSD keep going down and up again.
> >
> > Here is some info about my cluster:
> >
> > -------------------------------------------
> > # ceph -s
> >   cluster:
> >     id:     e348b63c-d239-4a15-a2ce-32f29a00431c
> >     health: HEALTH_WARN
> >             1 filesystem is degraded
> >             1 MDSs report slow metadata IOs
> >             2 osds down
> >             1 host (2 osds) down
> >             Reduced data availability: 324 pgs inactive, 324 pgs peering
> >             7 daemons have recently crashed
> >             10 slow ops, oldest one blocked for 206 sec, mon.a2-df has
> slow ops
> >
> >   services:
> >     mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
> >     mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
> >     mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
> >     osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
> >     rgw: 1 daemon active (a2-df)
> >
> >   data:
> >     pools:   7 pools, 324 pgs
> >     objects: 850.25k objects, 744 GiB
> >     usage:   2.3 TiB used, 14 TiB / 16 TiB avail
> >     pgs:     100.000% pgs not active
> >              324 peering
> > -------------------------------------------
> >
> > -------------------------------------------
> > # ceph osd df tree
> > ID  CLASS    WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META
> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
> >  -1          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83 1.00   -        root default
> > -10          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83 1.00   -            datacenter df
> >  -3           5.45799        - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -                host a1-df
> >   3 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> > 3.6 TiB  0.03 0.00   0   down             osd.3
> >   0      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
> > 1.1 TiB 41.43 3.00   0   down             osd.0
> >  -5           5.45799        - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -                host a2-df
> >   4 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> > 3.6 TiB  0.03 0.00 100     up             osd.4
> >   1      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
> > 1.1 TiB 41.42 3.00 224     up             osd.1
> >  -7           5.45767        - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -                host a3-df
> >   5 hdd-slow  3.63869  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
> > 3.6 TiB  0.03 0.00 100     up             osd.5
> >   2      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
> > 1.1 TiB 41.43 3.00 224     up             osd.2
> >                          TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83
> > MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
> > -------------------------------------------
> >
> > At this exact moment both OSDs from server a1-df were down but that's
> > changing. Sometimes I have only one OSD down, but most of the times I
> > have 2. And exactly which ones are actually down keeps changing.
> >
> > What should I do to get my cluster back up? Just wait?
>
> I just found out that I have a few pgs "stuck peering":
>
> -------------------------------------------
> # ceph health detail | grep peering
> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
> 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
> inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
> ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
> ops.
> PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs
> peering
>     pg 1.39 is stuck peering for 14011.965915, current state peering,
> last acting [0,1]
>     pg 1.3a is stuck peering for 14084.993947, current state peering,
> last acting [0,1]
>     pg 1.3b is stuck peering for 14274.225311, current state peering,
> last acting [0,1]
>     pg 1.3c is stuck peering for 15937.859532, current state peering,
> last acting [1,0]
>     pg 1.3d is stuck peering for 15786.873447, current state peering,
> last acting [1,0]
>     pg 1.3e is stuck peering for 15841.947891, current state peering,
> last acting [1,0]
>     pg 1.3f is stuck peering for 15841.912853, current state peering,
> last acting [1,0]
>     pg 1.40 is stuck peering for 14031.769901, current state peering,
> last acting [0,1]
>     pg 1.41 is stuck peering for 14010.216124, current state peering,
> last acting [0,1]
>     pg 1.42 is stuck peering for 15841.895446, current state peering,
> last acting [1,0]
>     pg 1.43 is stuck peering for 15915.024413, current state peering,
> last acting [1,0]
>     pg 1.44 is stuck peering for 13872.015272, current state peering,
> last acting [0,1]
>     pg 1.45 is stuck peering for 15684.413850, current state peering,
> last acting [1,0]
>     pg 1.46 is stuck peering for 15906.378461, current state peering,
> last acting [1,0]
>     pg 1.47 is stuck peering for 14377.822032, current state peering,
> last acting [0,1]
>     pg 1.48 is stuck peering for 14085.032316, current state peering,
> last acting [0,1]
>     pg 1.49 is stuck peering for 14085.030366, current state peering,
> last acting [0,1]
>     pg 1.4a is stuck peering for 14667.451862, current state peering,
> last acting [0,1]
>     pg 1.4b is stuck peering for 14048.714764, current state peering,
> last acting [0,1]
>     pg 1.4c is stuck peering for 13998.360919, current state peering,
> last acting [0,1]
>     pg 1.4d is stuck peering for 15693.831021, current state peering,
> last acting [1,0]
>     pg 2.38 is stuck peering for 15841.882464, current state peering,
> last acting [1,0]
>     pg 2.39 is stuck peering for 15841.881968, current state peering,
> last acting [1,0]
>     pg 2.3a is stuck peering for 14085.032520, current state peering,
> last acting [0,1]
>     pg 2.3b is stuck inactive for 12717.975044, current state peering,
> last acting [0,1]
>     pg 2.3c is stuck peering for 15841.947367, current state peering,
> last acting [1,0]
>     pg 2.3d is stuck peering for 15732.221067, current state peering,
> last acting [1,0]
>     pg 2.3e is stuck peering for 15938.007321, current state peering,
> last acting [0,1]
>     pg 2.3f is stuck peering for 14084.992407, current state peering,
> last acting [0,1]
>     pg 7.38 is stuck peering for 14080.942444, current state peering,
> last acting [3,4]
>     pg 7.39 is stuck peering for 14048.869554, current state peering,
> last acting [3,4]
>     pg 7.3a is stuck peering for 14048.869790, current state peering,
> last acting [3,4]
>     pg 7.3b is stuck peering for 14080.943240, current state peering,
> last acting [3,4]
>     pg 7.3c is stuck peering for 15842.114296, current state peering,
> last acting [4,3]
>     pg 7.3d is stuck peering for 14048.870194, current state peering,
> last acting [3,4]
>     pg 7.3e is stuck peering for 15842.105944, current state peering,
> last acting [4,3]
>     pg 7.3f is stuck peering for 15842.111549, current state peering,
> last acting [4,3]
>     pg 7.40 is stuck peering for 14048.869572, current state peering,
> last acting [3,4]
>     pg 7.41 is stuck peering for 14048.868747, current state peering,
> last acting [3,4]
>     pg 7.42 is stuck peering for 15845.175729, current state peering,
> last acting [4,3]
>     pg 7.43 is stuck peering for 15842.105227, current state peering,
> last acting [4,3]
>     pg 7.44 is stuck peering for 15845.196486, current state peering,
> last acting [4,3]
>     pg 7.45 is stuck peering for 14048.869849, current state peering,
> last acting [3,4]
>     pg 7.46 is stuck peering for 14080.942650, current state peering,
> last acting [3,4]
>     pg 7.47 is stuck peering for 15845.197875, current state peering,
> last acting [4,3]
>     pg 7.4a is stuck peering for 15842.113906, current state peering,
> last acting [4,3]
>     pg 7.4b is stuck peering for 15845.197205, current state peering,
> last acting [4,3]
>     pg 7.4c is stuck peering for 14048.869937, current state peering,
> last acting [3,4]
>     pg 7.4d is stuck peering for 14048.869137, current state peering,
> last acting [3,4]
>     pg 7.4e is stuck peering for 15842.111699, current state peering,
> last acting [4,3]
>     pg 7.4f is stuck peering for 14080.943391, current state peering,
> last acting [3,4]
> -------------------------------------------
>
>
> Why is that? How can I fix it?
>
>
> Rodrigo
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx