Rodrigo; Best bet would be to check logs. Check the OSD logs on the affected server. Check cluster logs on the MONs. Check OSD logs on other servers. Your Ceph version(s) and your OS distribution and version would also be useful to help you troubleshoot this OSD flapping issue. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. DHilsbos@xxxxxxxxxxxxxx www.PerformAir.com -----Original Message----- From: Rodrigo Severo - Fábrica [mailto:rodrigo@xxxxxxxxxxxxxxxxxxx] Sent: Tuesday, February 04, 2020 11:05 AM To: Wesley Dillingham Cc: ceph-users Subject: Re: All pgs peering indefinetely Em ter., 4 de fev. de 2020 às 14:54, Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> escreveu: > > > I would guess that you have something preventing osd to osd communication on ports 6800-7300 or osd to mon communication on port 6789 and/or 3300. The 3 servers are on the same subnet. They are connect to a non-managed switch. And none have any firewall (iptables) rules blocking anything. They can ping one the other. Can you think about some other way that some traffic could be blocked? Or some other test I could do to check for connectivity? Regards, Rodrigo > > > Respectfully, > > Wes Dillingham > wes@xxxxxxxxxxxxxxxxx > LinkedIn > > > On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica <rodrigo@xxxxxxxxxxxxxxxxxxx> wrote: >> >> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica >> <rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu: >> > >> > Hi, >> > >> > >> > I have a rather small cephfs cluster with 3 machines right now: all of >> > them sharing MDS, MON, MGS and OSD roles. >> > >> > I had to move all machines to a new physical location and, >> > unfortunately, I had to move all of them at the same time. >> > >> > They are already on again but ceph won't be accessible as all pgs are >> > in peering state and OSD keep going down and up again. >> > >> > Here is some info about my cluster: >> > >> > ------------------------------------------- >> > # ceph -s >> > cluster: >> > id: e348b63c-d239-4a15-a2ce-32f29a00431c >> > health: HEALTH_WARN >> > 1 filesystem is degraded >> > 1 MDSs report slow metadata IOs >> > 2 osds down >> > 1 host (2 osds) down >> > Reduced data availability: 324 pgs inactive, 324 pgs peering >> > 7 daemons have recently crashed >> > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops >> > >> > services: >> > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m) >> > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df >> > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby >> > osd: 6 osds: 4 up (since 5s), 6 in (since 47m) >> > rgw: 1 daemon active (a2-df) >> > >> > data: >> > pools: 7 pools, 324 pgs >> > objects: 850.25k objects, 744 GiB >> > usage: 2.3 TiB used, 14 TiB / 16 TiB avail >> > pgs: 100.000% pgs not active >> > 324 peering >> > ------------------------------------------- >> > >> > ------------------------------------------- >> > # ceph osd df tree >> > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >> > AVAIL %USE VAR PGS STATUS TYPE NAME >> > -1 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB >> > 14 TiB 13.83 1.00 - root default >> > -10 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB >> > 14 TiB 13.83 1.00 - datacenter df >> > -3 5.45799 - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB >> > 4.7 TiB 13.83 1.00 - host a1-df >> > 3 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB >> > 3.6 TiB 0.03 0.00 0 down osd.3 >> > 0 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB >> > 1.1 TiB 41.43 3.00 0 down osd.0 >> > -5 5.45799 - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB >> > 4.7 TiB 13.83 1.00 - host a2-df >> > 4 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB >> > 3.6 TiB 0.03 0.00 100 up osd.4 >> > 1 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB >> > 1.1 TiB 41.42 3.00 224 up osd.1 >> > -7 5.45767 - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB >> > 4.7 TiB 13.83 1.00 - host a3-df >> > 5 hdd-slow 3.63869 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB >> > 3.6 TiB 0.03 0.00 100 up osd.5 >> > 2 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB >> > 1.1 TiB 41.43 3.00 224 up osd.2 >> > TOTAL 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB >> > 14 TiB 13.83 >> > MIN/MAX VAR: 0.00/3.00 STDDEV: 21.82 >> > ------------------------------------------- >> > >> > At this exact moment both OSDs from server a1-df were down but that's >> > changing. Sometimes I have only one OSD down, but most of the times I >> > have 2. And exactly which ones are actually down keeps changing. >> > >> > What should I do to get my cluster back up? Just wait? >> >> I just found out that I have a few pgs "stuck peering": >> >> ------------------------------------------- >> # ceph health detail | grep peering >> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; >> 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs >> inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow >> ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow >> ops. >> PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering >> pg 1.39 is stuck peering for 14011.965915, current state peering, >> last acting [0,1] >> pg 1.3a is stuck peering for 14084.993947, current state peering, >> last acting [0,1] >> pg 1.3b is stuck peering for 14274.225311, current state peering, >> last acting [0,1] >> pg 1.3c is stuck peering for 15937.859532, current state peering, >> last acting [1,0] >> pg 1.3d is stuck peering for 15786.873447, current state peering, >> last acting [1,0] >> pg 1.3e is stuck peering for 15841.947891, current state peering, >> last acting [1,0] >> pg 1.3f is stuck peering for 15841.912853, current state peering, >> last acting [1,0] >> pg 1.40 is stuck peering for 14031.769901, current state peering, >> last acting [0,1] >> pg 1.41 is stuck peering for 14010.216124, current state peering, >> last acting [0,1] >> pg 1.42 is stuck peering for 15841.895446, current state peering, >> last acting [1,0] >> pg 1.43 is stuck peering for 15915.024413, current state peering, >> last acting [1,0] >> pg 1.44 is stuck peering for 13872.015272, current state peering, >> last acting [0,1] >> pg 1.45 is stuck peering for 15684.413850, current state peering, >> last acting [1,0] >> pg 1.46 is stuck peering for 15906.378461, current state peering, >> last acting [1,0] >> pg 1.47 is stuck peering for 14377.822032, current state peering, >> last acting [0,1] >> pg 1.48 is stuck peering for 14085.032316, current state peering, >> last acting [0,1] >> pg 1.49 is stuck peering for 14085.030366, current state peering, >> last acting [0,1] >> pg 1.4a is stuck peering for 14667.451862, current state peering, >> last acting [0,1] >> pg 1.4b is stuck peering for 14048.714764, current state peering, >> last acting [0,1] >> pg 1.4c is stuck peering for 13998.360919, current state peering, >> last acting [0,1] >> pg 1.4d is stuck peering for 15693.831021, current state peering, >> last acting [1,0] >> pg 2.38 is stuck peering for 15841.882464, current state peering, >> last acting [1,0] >> pg 2.39 is stuck peering for 15841.881968, current state peering, >> last acting [1,0] >> pg 2.3a is stuck peering for 14085.032520, current state peering, >> last acting [0,1] >> pg 2.3b is stuck inactive for 12717.975044, current state peering, >> last acting [0,1] >> pg 2.3c is stuck peering for 15841.947367, current state peering, >> last acting [1,0] >> pg 2.3d is stuck peering for 15732.221067, current state peering, >> last acting [1,0] >> pg 2.3e is stuck peering for 15938.007321, current state peering, >> last acting [0,1] >> pg 2.3f is stuck peering for 14084.992407, current state peering, >> last acting [0,1] >> pg 7.38 is stuck peering for 14080.942444, current state peering, >> last acting [3,4] >> pg 7.39 is stuck peering for 14048.869554, current state peering, >> last acting [3,4] >> pg 7.3a is stuck peering for 14048.869790, current state peering, >> last acting [3,4] >> pg 7.3b is stuck peering for 14080.943240, current state peering, >> last acting [3,4] >> pg 7.3c is stuck peering for 15842.114296, current state peering, >> last acting [4,3] >> pg 7.3d is stuck peering for 14048.870194, current state peering, >> last acting [3,4] >> pg 7.3e is stuck peering for 15842.105944, current state peering, >> last acting [4,3] >> pg 7.3f is stuck peering for 15842.111549, current state peering, >> last acting [4,3] >> pg 7.40 is stuck peering for 14048.869572, current state peering, >> last acting [3,4] >> pg 7.41 is stuck peering for 14048.868747, current state peering, >> last acting [3,4] >> pg 7.42 is stuck peering for 15845.175729, current state peering, >> last acting [4,3] >> pg 7.43 is stuck peering for 15842.105227, current state peering, >> last acting [4,3] >> pg 7.44 is stuck peering for 15845.196486, current state peering, >> last acting [4,3] >> pg 7.45 is stuck peering for 14048.869849, current state peering, >> last acting [3,4] >> pg 7.46 is stuck peering for 14080.942650, current state peering, >> last acting [3,4] >> pg 7.47 is stuck peering for 15845.197875, current state peering, >> last acting [4,3] >> pg 7.4a is stuck peering for 15842.113906, current state peering, >> last acting [4,3] >> pg 7.4b is stuck peering for 15845.197205, current state peering, >> last acting [4,3] >> pg 7.4c is stuck peering for 14048.869937, current state peering, >> last acting [3,4] >> pg 7.4d is stuck peering for 14048.869137, current state peering, >> last acting [3,4] >> pg 7.4e is stuck peering for 15842.111699, current state peering, >> last acting [4,3] >> pg 7.4f is stuck peering for 14080.943391, current state peering, >> last acting [3,4] >> ------------------------------------------- >> >> >> Why is that? How can I fix it? >> >> >> Rodrigo >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx