I would guess that you have something preventing osd to osd communication on ports 6800-7300 or osd to mon communication on port 6789 and/or 3300. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica < rodrigo@xxxxxxxxxxxxxxxxxxx> wrote: > Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica > <rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu: > > > > Hi, > > > > > > I have a rather small cephfs cluster with 3 machines right now: all of > > them sharing MDS, MON, MGS and OSD roles. > > > > I had to move all machines to a new physical location and, > > unfortunately, I had to move all of them at the same time. > > > > They are already on again but ceph won't be accessible as all pgs are > > in peering state and OSD keep going down and up again. > > > > Here is some info about my cluster: > > > > ------------------------------------------- > > # ceph -s > > cluster: > > id: e348b63c-d239-4a15-a2ce-32f29a00431c > > health: HEALTH_WARN > > 1 filesystem is degraded > > 1 MDSs report slow metadata IOs > > 2 osds down > > 1 host (2 osds) down > > Reduced data availability: 324 pgs inactive, 324 pgs peering > > 7 daemons have recently crashed > > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has > slow ops > > > > services: > > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m) > > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df > > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby > > osd: 6 osds: 4 up (since 5s), 6 in (since 47m) > > rgw: 1 daemon active (a2-df) > > > > data: > > pools: 7 pools, 324 pgs > > objects: 850.25k objects, 744 GiB > > usage: 2.3 TiB used, 14 TiB / 16 TiB avail > > pgs: 100.000% pgs not active > > 324 peering > > ------------------------------------------- > > > > ------------------------------------------- > > # ceph osd df tree > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > > AVAIL %USE VAR PGS STATUS TYPE NAME > > -1 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > > 14 TiB 13.83 1.00 - root default > > -10 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > > 14 TiB 13.83 1.00 - datacenter df > > -3 5.45799 - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB > > 4.7 TiB 13.83 1.00 - host a1-df > > 3 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > > 3.6 TiB 0.03 0.00 0 down osd.3 > > 0 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB > > 1.1 TiB 41.43 3.00 0 down osd.0 > > -5 5.45799 - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB > > 4.7 TiB 13.83 1.00 - host a2-df > > 4 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > > 3.6 TiB 0.03 0.00 100 up osd.4 > > 1 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB > > 1.1 TiB 41.42 3.00 224 up osd.1 > > -7 5.45767 - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB > > 4.7 TiB 13.83 1.00 - host a3-df > > 5 hdd-slow 3.63869 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > > 3.6 TiB 0.03 0.00 100 up osd.5 > > 2 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB > > 1.1 TiB 41.43 3.00 224 up osd.2 > > TOTAL 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > > 14 TiB 13.83 > > MIN/MAX VAR: 0.00/3.00 STDDEV: 21.82 > > ------------------------------------------- > > > > At this exact moment both OSDs from server a1-df were down but that's > > changing. Sometimes I have only one OSD down, but most of the times I > > have 2. And exactly which ones are actually down keeps changing. > > > > What should I do to get my cluster back up? Just wait? > > I just found out that I have a few pgs "stuck peering": > > ------------------------------------------- > # ceph health detail | grep peering > HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; > 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs > inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow > ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow > ops. > PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs > peering > pg 1.39 is stuck peering for 14011.965915, current state peering, > last acting [0,1] > pg 1.3a is stuck peering for 14084.993947, current state peering, > last acting [0,1] > pg 1.3b is stuck peering for 14274.225311, current state peering, > last acting [0,1] > pg 1.3c is stuck peering for 15937.859532, current state peering, > last acting [1,0] > pg 1.3d is stuck peering for 15786.873447, current state peering, > last acting [1,0] > pg 1.3e is stuck peering for 15841.947891, current state peering, > last acting [1,0] > pg 1.3f is stuck peering for 15841.912853, current state peering, > last acting [1,0] > pg 1.40 is stuck peering for 14031.769901, current state peering, > last acting [0,1] > pg 1.41 is stuck peering for 14010.216124, current state peering, > last acting [0,1] > pg 1.42 is stuck peering for 15841.895446, current state peering, > last acting [1,0] > pg 1.43 is stuck peering for 15915.024413, current state peering, > last acting [1,0] > pg 1.44 is stuck peering for 13872.015272, current state peering, > last acting [0,1] > pg 1.45 is stuck peering for 15684.413850, current state peering, > last acting [1,0] > pg 1.46 is stuck peering for 15906.378461, current state peering, > last acting [1,0] > pg 1.47 is stuck peering for 14377.822032, current state peering, > last acting [0,1] > pg 1.48 is stuck peering for 14085.032316, current state peering, > last acting [0,1] > pg 1.49 is stuck peering for 14085.030366, current state peering, > last acting [0,1] > pg 1.4a is stuck peering for 14667.451862, current state peering, > last acting [0,1] > pg 1.4b is stuck peering for 14048.714764, current state peering, > last acting [0,1] > pg 1.4c is stuck peering for 13998.360919, current state peering, > last acting [0,1] > pg 1.4d is stuck peering for 15693.831021, current state peering, > last acting [1,0] > pg 2.38 is stuck peering for 15841.882464, current state peering, > last acting [1,0] > pg 2.39 is stuck peering for 15841.881968, current state peering, > last acting [1,0] > pg 2.3a is stuck peering for 14085.032520, current state peering, > last acting [0,1] > pg 2.3b is stuck inactive for 12717.975044, current state peering, > last acting [0,1] > pg 2.3c is stuck peering for 15841.947367, current state peering, > last acting [1,0] > pg 2.3d is stuck peering for 15732.221067, current state peering, > last acting [1,0] > pg 2.3e is stuck peering for 15938.007321, current state peering, > last acting [0,1] > pg 2.3f is stuck peering for 14084.992407, current state peering, > last acting [0,1] > pg 7.38 is stuck peering for 14080.942444, current state peering, > last acting [3,4] > pg 7.39 is stuck peering for 14048.869554, current state peering, > last acting [3,4] > pg 7.3a is stuck peering for 14048.869790, current state peering, > last acting [3,4] > pg 7.3b is stuck peering for 14080.943240, current state peering, > last acting [3,4] > pg 7.3c is stuck peering for 15842.114296, current state peering, > last acting [4,3] > pg 7.3d is stuck peering for 14048.870194, current state peering, > last acting [3,4] > pg 7.3e is stuck peering for 15842.105944, current state peering, > last acting [4,3] > pg 7.3f is stuck peering for 15842.111549, current state peering, > last acting [4,3] > pg 7.40 is stuck peering for 14048.869572, current state peering, > last acting [3,4] > pg 7.41 is stuck peering for 14048.868747, current state peering, > last acting [3,4] > pg 7.42 is stuck peering for 15845.175729, current state peering, > last acting [4,3] > pg 7.43 is stuck peering for 15842.105227, current state peering, > last acting [4,3] > pg 7.44 is stuck peering for 15845.196486, current state peering, > last acting [4,3] > pg 7.45 is stuck peering for 14048.869849, current state peering, > last acting [3,4] > pg 7.46 is stuck peering for 14080.942650, current state peering, > last acting [3,4] > pg 7.47 is stuck peering for 15845.197875, current state peering, > last acting [4,3] > pg 7.4a is stuck peering for 15842.113906, current state peering, > last acting [4,3] > pg 7.4b is stuck peering for 15845.197205, current state peering, > last acting [4,3] > pg 7.4c is stuck peering for 14048.869937, current state peering, > last acting [3,4] > pg 7.4d is stuck peering for 14048.869137, current state peering, > last acting [3,4] > pg 7.4e is stuck peering for 15842.111699, current state peering, > last acting [4,3] > pg 7.4f is stuck peering for 14080.943391, current state peering, > last acting [3,4] > ------------------------------------------- > > > Why is that? How can I fix it? > > > Rodrigo > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx