Re: All pgs peering indefinetely

<DHilsbos@xxxxxxxxxxxxxx> · Tue, 4 Feb 2020 18:19:00 +0000

Rodrigo;

Best bet would be to check logs.  Check the OSD logs on the affected server.  Check cluster logs on the MONs.  Check OSD logs on other servers.

Your Ceph version(s) and your OS distribution and version would also be useful to help you troubleshoot this OSD flapping issue.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
DHilsbos@xxxxxxxxxxxxxx 
www.PerformAir.com

-----Original Message-----
From: Rodrigo Severo - Fábrica [mailto:rodrigo@xxxxxxxxxxxxxxxxxxx] 
Sent: Tuesday, February 04, 2020 11:05 AM
To: Wesley Dillingham
Cc: ceph-users
Subject:  Re: All pgs peering indefinetely

Em ter., 4 de fev. de 2020 às 14:54, Wesley Dillingham
<wes@xxxxxxxxxxxxxxxxx> escreveu:
>
>
> I would guess that you have something preventing osd to osd communication on ports 6800-7300 or osd to mon communication on  port 6789 and/or 3300.

The 3 servers are on the same subnet. They are connect to a
non-managed switch. And none have any firewall (iptables) rules
blocking anything. They can ping one the other.

Can you think about some other way that some traffic could be blocked?
Or some other test I could do to check for connectivity?

Regards,

Rodrigo

>
>
> Respectfully,
>
> Wes Dillingham
> wes@xxxxxxxxxxxxxxxxx
> LinkedIn
>
>
> On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica <rodrigo@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
>> <rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu:
>> >
>> > Hi,
>> >
>> >
>> > I have a rather small cephfs cluster with 3 machines right now: all of
>> > them sharing MDS, MON, MGS and OSD roles.
>> >
>> > I had to move all machines to a new physical location and,
>> > unfortunately, I had to move all of them at the same time.
>> >
>> > They are already on again but ceph won't be accessible as all pgs are
>> > in peering state and OSD keep going down and up again.
>> >
>> > Here is some info about my cluster:
>> >
>> > -------------------------------------------
>> > # ceph -s
>> >   cluster:
>> >     id:     e348b63c-d239-4a15-a2ce-32f29a00431c
>> >     health: HEALTH_WARN
>> >             1 filesystem is degraded
>> >             1 MDSs report slow metadata IOs
>> >             2 osds down
>> >             1 host (2 osds) down
>> >             Reduced data availability: 324 pgs inactive, 324 pgs peering
>> >             7 daemons have recently crashed
>> >             10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops
>> >
>> >   services:
>> >     mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
>> >     mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
>> >     mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
>> >     osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
>> >     rgw: 1 daemon active (a2-df)
>> >
>> >   data:
>> >     pools:   7 pools, 324 pgs
>> >     objects: 850.25k objects, 744 GiB
>> >     usage:   2.3 TiB used, 14 TiB / 16 TiB avail
>> >     pgs:     100.000% pgs not active
>> >              324 peering
>> > -------------------------------------------
>> >
>> > -------------------------------------------
>> > # ceph osd df tree
>> > ID  CLASS    WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META
>> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>> >  -1          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -        root default
>> > -10          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -            datacenter df
>> >  -3           5.45799        - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -                host a1-df
>> >   3 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
>> > 3.6 TiB  0.03 0.00   0   down             osd.3
>> >   0      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00   0   down             osd.0
>> >  -5           5.45799        - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -                host a2-df
>> >   4 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100     up             osd.4
>> >   1      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
>> > 1.1 TiB 41.42 3.00 224     up             osd.1
>> >  -7           5.45767        - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -                host a3-df
>> >   5 hdd-slow  3.63869  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100     up             osd.5
>> >   2      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00 224     up             osd.2
>> >                          TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83
>> > MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
>> > -------------------------------------------
>> >
>> > At this exact moment both OSDs from server a1-df were down but that's
>> > changing. Sometimes I have only one OSD down, but most of the times I
>> > have 2. And exactly which ones are actually down keeps changing.
>> >
>> > What should I do to get my cluster back up? Just wait?
>>
>> I just found out that I have a few pgs "stuck peering":
>>
>> -------------------------------------------
>> # ceph health detail | grep peering
>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
>> 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
>> inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
>> ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
>> ops.
>> PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering
>>     pg 1.39 is stuck peering for 14011.965915, current state peering,
>> last acting [0,1]
>>     pg 1.3a is stuck peering for 14084.993947, current state peering,
>> last acting [0,1]
>>     pg 1.3b is stuck peering for 14274.225311, current state peering,
>> last acting [0,1]
>>     pg 1.3c is stuck peering for 15937.859532, current state peering,
>> last acting [1,0]
>>     pg 1.3d is stuck peering for 15786.873447, current state peering,
>> last acting [1,0]
>>     pg 1.3e is stuck peering for 15841.947891, current state peering,
>> last acting [1,0]
>>     pg 1.3f is stuck peering for 15841.912853, current state peering,
>> last acting [1,0]
>>     pg 1.40 is stuck peering for 14031.769901, current state peering,
>> last acting [0,1]
>>     pg 1.41 is stuck peering for 14010.216124, current state peering,
>> last acting [0,1]
>>     pg 1.42 is stuck peering for 15841.895446, current state peering,
>> last acting [1,0]
>>     pg 1.43 is stuck peering for 15915.024413, current state peering,
>> last acting [1,0]
>>     pg 1.44 is stuck peering for 13872.015272, current state peering,
>> last acting [0,1]
>>     pg 1.45 is stuck peering for 15684.413850, current state peering,
>> last acting [1,0]
>>     pg 1.46 is stuck peering for 15906.378461, current state peering,
>> last acting [1,0]
>>     pg 1.47 is stuck peering for 14377.822032, current state peering,
>> last acting [0,1]
>>     pg 1.48 is stuck peering for 14085.032316, current state peering,
>> last acting [0,1]
>>     pg 1.49 is stuck peering for 14085.030366, current state peering,
>> last acting [0,1]
>>     pg 1.4a is stuck peering for 14667.451862, current state peering,
>> last acting [0,1]
>>     pg 1.4b is stuck peering for 14048.714764, current state peering,
>> last acting [0,1]
>>     pg 1.4c is stuck peering for 13998.360919, current state peering,
>> last acting [0,1]
>>     pg 1.4d is stuck peering for 15693.831021, current state peering,
>> last acting [1,0]
>>     pg 2.38 is stuck peering for 15841.882464, current state peering,
>> last acting [1,0]
>>     pg 2.39 is stuck peering for 15841.881968, current state peering,
>> last acting [1,0]
>>     pg 2.3a is stuck peering for 14085.032520, current state peering,
>> last acting [0,1]
>>     pg 2.3b is stuck inactive for 12717.975044, current state peering,
>> last acting [0,1]
>>     pg 2.3c is stuck peering for 15841.947367, current state peering,
>> last acting [1,0]
>>     pg 2.3d is stuck peering for 15732.221067, current state peering,
>> last acting [1,0]
>>     pg 2.3e is stuck peering for 15938.007321, current state peering,
>> last acting [0,1]
>>     pg 2.3f is stuck peering for 14084.992407, current state peering,
>> last acting [0,1]
>>     pg 7.38 is stuck peering for 14080.942444, current state peering,
>> last acting [3,4]
>>     pg 7.39 is stuck peering for 14048.869554, current state peering,
>> last acting [3,4]
>>     pg 7.3a is stuck peering for 14048.869790, current state peering,
>> last acting [3,4]
>>     pg 7.3b is stuck peering for 14080.943240, current state peering,
>> last acting [3,4]
>>     pg 7.3c is stuck peering for 15842.114296, current state peering,
>> last acting [4,3]
>>     pg 7.3d is stuck peering for 14048.870194, current state peering,
>> last acting [3,4]
>>     pg 7.3e is stuck peering for 15842.105944, current state peering,
>> last acting [4,3]
>>     pg 7.3f is stuck peering for 15842.111549, current state peering,
>> last acting [4,3]
>>     pg 7.40 is stuck peering for 14048.869572, current state peering,
>> last acting [3,4]
>>     pg 7.41 is stuck peering for 14048.868747, current state peering,
>> last acting [3,4]
>>     pg 7.42 is stuck peering for 15845.175729, current state peering,
>> last acting [4,3]
>>     pg 7.43 is stuck peering for 15842.105227, current state peering,
>> last acting [4,3]
>>     pg 7.44 is stuck peering for 15845.196486, current state peering,
>> last acting [4,3]
>>     pg 7.45 is stuck peering for 14048.869849, current state peering,
>> last acting [3,4]
>>     pg 7.46 is stuck peering for 14080.942650, current state peering,
>> last acting [3,4]
>>     pg 7.47 is stuck peering for 15845.197875, current state peering,
>> last acting [4,3]
>>     pg 7.4a is stuck peering for 15842.113906, current state peering,
>> last acting [4,3]
>>     pg 7.4b is stuck peering for 15845.197205, current state peering,
>> last acting [4,3]
>>     pg 7.4c is stuck peering for 14048.869937, current state peering,
>> last acting [3,4]
>>     pg 7.4d is stuck peering for 14048.869137, current state peering,
>> last acting [3,4]
>>     pg 7.4e is stuck peering for 15842.111699, current state peering,
>> last acting [4,3]
>>     pg 7.4f is stuck peering for 14080.943391, current state peering,
>> last acting [3,4]
>> -------------------------------------------
>>
>>
>> Why is that? How can I fix it?
>>
>>
>> Rodrigo
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx