Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica <rodrigo@xxxxxxxxxxxxxxxxxxx> escreveu: > > Hi, > > > I have a rather small cephfs cluster with 3 machines right now: all of > them sharing MDS, MON, MGS and OSD roles. > > I had to move all machines to a new physical location and, > unfortunately, I had to move all of them at the same time. > > They are already on again but ceph won't be accessible as all pgs are > in peering state and OSD keep going down and up again. > > Here is some info about my cluster: > > ------------------------------------------- > # ceph -s > cluster: > id: e348b63c-d239-4a15-a2ce-32f29a00431c > health: HEALTH_WARN > 1 filesystem is degraded > 1 MDSs report slow metadata IOs > 2 osds down > 1 host (2 osds) down > Reduced data availability: 324 pgs inactive, 324 pgs peering > 7 daemons have recently crashed > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops > > services: > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m) > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby > osd: 6 osds: 4 up (since 5s), 6 in (since 47m) > rgw: 1 daemon active (a2-df) > > data: > pools: 7 pools, 324 pgs > objects: 850.25k objects, 744 GiB > usage: 2.3 TiB used, 14 TiB / 16 TiB avail > pgs: 100.000% pgs not active > 324 peering > ------------------------------------------- > > ------------------------------------------- > # ceph osd df tree > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS TYPE NAME > -1 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > 14 TiB 13.83 1.00 - root default > -10 16.37366 - 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > 14 TiB 13.83 1.00 - datacenter df > -3 5.45799 - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB > 4.7 TiB 13.83 1.00 - host a1-df > 3 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > 3.6 TiB 0.03 0.00 0 down osd.3 > 0 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB > 1.1 TiB 41.43 3.00 0 down osd.0 > -5 5.45799 - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB > 4.7 TiB 13.83 1.00 - host a2-df > 4 hdd-slow 3.63899 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > 3.6 TiB 0.03 0.00 100 up osd.4 > 1 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB > 1.1 TiB 41.42 3.00 224 up osd.1 > -7 5.45767 - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB > 4.7 TiB 13.83 1.00 - host a3-df > 5 hdd-slow 3.63869 1.00000 3.6 TiB 1.1 GiB 90 MiB 0 B 1 GiB > 3.6 TiB 0.03 0.00 100 up osd.5 > 2 hdd 1.81898 1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB > 1.1 TiB 41.43 3.00 224 up osd.2 > TOTAL 16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB > 14 TiB 13.83 > MIN/MAX VAR: 0.00/3.00 STDDEV: 21.82 > ------------------------------------------- > > At this exact moment both OSDs from server a1-df were down but that's > changing. Sometimes I have only one OSD down, but most of the times I > have 2. And exactly which ones are actually down keeps changing. > > What should I do to get my cluster back up? Just wait? I just found out that I have a few pgs "stuck peering": ------------------------------------------- # ceph health detail | grep peering HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow ops. PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering pg 1.39 is stuck peering for 14011.965915, current state peering, last acting [0,1] pg 1.3a is stuck peering for 14084.993947, current state peering, last acting [0,1] pg 1.3b is stuck peering for 14274.225311, current state peering, last acting [0,1] pg 1.3c is stuck peering for 15937.859532, current state peering, last acting [1,0] pg 1.3d is stuck peering for 15786.873447, current state peering, last acting [1,0] pg 1.3e is stuck peering for 15841.947891, current state peering, last acting [1,0] pg 1.3f is stuck peering for 15841.912853, current state peering, last acting [1,0] pg 1.40 is stuck peering for 14031.769901, current state peering, last acting [0,1] pg 1.41 is stuck peering for 14010.216124, current state peering, last acting [0,1] pg 1.42 is stuck peering for 15841.895446, current state peering, last acting [1,0] pg 1.43 is stuck peering for 15915.024413, current state peering, last acting [1,0] pg 1.44 is stuck peering for 13872.015272, current state peering, last acting [0,1] pg 1.45 is stuck peering for 15684.413850, current state peering, last acting [1,0] pg 1.46 is stuck peering for 15906.378461, current state peering, last acting [1,0] pg 1.47 is stuck peering for 14377.822032, current state peering, last acting [0,1] pg 1.48 is stuck peering for 14085.032316, current state peering, last acting [0,1] pg 1.49 is stuck peering for 14085.030366, current state peering, last acting [0,1] pg 1.4a is stuck peering for 14667.451862, current state peering, last acting [0,1] pg 1.4b is stuck peering for 14048.714764, current state peering, last acting [0,1] pg 1.4c is stuck peering for 13998.360919, current state peering, last acting [0,1] pg 1.4d is stuck peering for 15693.831021, current state peering, last acting [1,0] pg 2.38 is stuck peering for 15841.882464, current state peering, last acting [1,0] pg 2.39 is stuck peering for 15841.881968, current state peering, last acting [1,0] pg 2.3a is stuck peering for 14085.032520, current state peering, last acting [0,1] pg 2.3b is stuck inactive for 12717.975044, current state peering, last acting [0,1] pg 2.3c is stuck peering for 15841.947367, current state peering, last acting [1,0] pg 2.3d is stuck peering for 15732.221067, current state peering, last acting [1,0] pg 2.3e is stuck peering for 15938.007321, current state peering, last acting [0,1] pg 2.3f is stuck peering for 14084.992407, current state peering, last acting [0,1] pg 7.38 is stuck peering for 14080.942444, current state peering, last acting [3,4] pg 7.39 is stuck peering for 14048.869554, current state peering, last acting [3,4] pg 7.3a is stuck peering for 14048.869790, current state peering, last acting [3,4] pg 7.3b is stuck peering for 14080.943240, current state peering, last acting [3,4] pg 7.3c is stuck peering for 15842.114296, current state peering, last acting [4,3] pg 7.3d is stuck peering for 14048.870194, current state peering, last acting [3,4] pg 7.3e is stuck peering for 15842.105944, current state peering, last acting [4,3] pg 7.3f is stuck peering for 15842.111549, current state peering, last acting [4,3] pg 7.40 is stuck peering for 14048.869572, current state peering, last acting [3,4] pg 7.41 is stuck peering for 14048.868747, current state peering, last acting [3,4] pg 7.42 is stuck peering for 15845.175729, current state peering, last acting [4,3] pg 7.43 is stuck peering for 15842.105227, current state peering, last acting [4,3] pg 7.44 is stuck peering for 15845.196486, current state peering, last acting [4,3] pg 7.45 is stuck peering for 14048.869849, current state peering, last acting [3,4] pg 7.46 is stuck peering for 14080.942650, current state peering, last acting [3,4] pg 7.47 is stuck peering for 15845.197875, current state peering, last acting [4,3] pg 7.4a is stuck peering for 15842.113906, current state peering, last acting [4,3] pg 7.4b is stuck peering for 15845.197205, current state peering, last acting [4,3] pg 7.4c is stuck peering for 14048.869937, current state peering, last acting [3,4] pg 7.4d is stuck peering for 14048.869137, current state peering, last acting [3,4] pg 7.4e is stuck peering for 15842.111699, current state peering, last acting [4,3] pg 7.4f is stuck peering for 14080.943391, current state peering, last acting [3,4] ------------------------------------------- Why is that? How can I fix it? Rodrigo _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx