All pgs peering indefinetely

Rodrigo Severo - Fábrica <rodrigo@xxxxxxxxxxxxxxxxxxx> · Tue, 4 Feb 2020 12:39:30 -0300

Hi,

I have a rather small cephfs cluster with 3 machines right now: all of
them sharing MDS, MON, MGS and OSD roles.

I had to move all machines to a new physical location and,
unfortunately, I had to move all of them at the same time.

They are already on again but ceph won't be accessible as all pgs are
in peering state and OSD keep going down and up again.

Here is some info about my cluster:

-------------------------------------------
# ceph -s
  cluster:
    id:     e348b63c-d239-4a15-a2ce-32f29a00431c
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            2 osds down
            1 host (2 osds) down
            Reduced data availability: 324 pgs inactive, 324 pgs peering
            7 daemons have recently crashed
            10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops

  services:
    mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
    mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
    mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
    osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
    rgw: 1 daemon active (a2-df)

  data:
    pools:   7 pools, 324 pgs
    objects: 850.25k objects, 744 GiB
    usage:   2.3 TiB used, 14 TiB / 16 TiB avail
    pgs:     100.000% pgs not active
             324 peering
-------------------------------------------

-------------------------------------------
# ceph osd df tree
ID  CLASS    WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
 -1          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -        root default
-10          16.37366        -  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -            datacenter df
 -3           5.45799        - 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -                host a1-df
  3 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
3.6 TiB  0.03 0.00   0   down             osd.3
  0      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
1.1 TiB 41.43 3.00   0   down             osd.0
 -5           5.45799        - 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -                host a2-df
  4 hdd-slow  3.63899  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
3.6 TiB  0.03 0.00 100     up             osd.4
  1      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
1.1 TiB 41.42 3.00 224     up             osd.1
 -7           5.45767        - 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -                host a3-df
  5 hdd-slow  3.63869  1.00000 3.6 TiB 1.1 GiB  90 MiB     0 B   1 GiB
3.6 TiB  0.03 0.00 100     up             osd.5
  2      hdd  1.81898  1.00000 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
1.1 TiB 41.43 3.00 224     up             osd.2
                         TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83
MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
-------------------------------------------

At this exact moment both OSDs from server a1-df were down but that's
changing. Sometimes I have only one OSD down, but most of the times I
have 2. And exactly which ones are actually down keeps changing.

What should I do to get my cluster back up? Just wait?

Regards,

Rodrigo Severo
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx