recover ceph-mon

xsempresu@xxxxxxxxx · Sat, 29 Feb 2020 16:00:57 -0000

I have test one-node ceph cluster with 4 osds under assumption to add the second node just before production.
Linux 4.19.0-6-amd64 - debian 10 - ceph version 12.2.11
Unfortunately, system drive was broken before it.
I recovered the system from full backup.
Since no changes was performed to the cluster configuration after that backup, I hoped that it works.
For the reasons I can't understand first few seconds after boot ceph status was OK (134 active+clean, 2 active+clean+scrubbing+deep), but a minute later status changed to:
# ceph status
  cluster:
    id:     e02f2885-946b-46c8-91d5-146dd724ecaf
    health: HEALTH_WARN
            1 filesystem is degraded
            2 osds down
            1 slice (2 osds) down
            Reduced data availability: 136 pgs inactive, 15 pgs peering

  services:
    mon: 1 daemons, quorum rbd0
    mgr: rbd0(active)
    mds: fs-1/1/1 up  {0=rbd0=up:replay}
    osd: 5 osds: 1 up, 3 in

  data:
    pools:   2 pools, 136 pgs
    objects: 118.53k objects, 429GiB
    usage:   7.15TiB used, 3.77TiB / 10.9TiB avail
    pgs:     88.971% pgs unknown
             11.029% pgs not active
             121 unknown
             15  peering

# ceph osd dump
epoch 1983
fsid e02f2885-946b-46c8-91d5-146dd724ecaf
created 2019-08-16 15:14:07.783009
modified 2020-02-29 13:55:39.212461
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 27
full_ratio 0.97
backfillfull_ratio 0.94
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 1 'fs_data' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 1595 flags hashpspool stripe_width 0 application cephfs
pool 2 'fs_meta' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1595 flags hashpspool stripe_width 0 application cephfs
max_osd 5
osd.0 down out weight 0 up_from 1970 up_thru 1973 down_at 1975 last_clean_interval [1949,1963) 192.168.101.111:6806/440 192.168.101.111:6807/440 192.168.101.111:6808/440 192.168.101.111:6809/440 autoout,exists 78eaeb63-47c9-4962-b8ff-46607921f4f6
osd.1 down in  weight 1 up_from 1970 up_thru 1970 down_at 1975 last_clean_interval [1952,1963) 192.168.101.111:6801/439 192.168.101.111:6810/439 192.168.101.111:6811/439 192.168.101.111:6812/439 exists c4c4c85d-f537-4199-823b-b7ab01c78f03
osd.2 down in  weight 1 up_from 1969 up_thru 1975 down_at 1976 last_clean_interval [1946,1963) 192.168.101.111:6802/441 192.168.101.111:6803/441 192.168.101.111:6804/441 192.168.101.111:6805/441 exists bd66a9c3-bfa4-4352-816e-2e4cd86389f3
osd.3 down out weight 0 up_from 1617 up_thru 1619 down_at 1631 last_clean_interval [1602,1610) 192.168.101.111:6805/933 192.168.101.111:6806/933 192.168.101.111:6807/933 192.168.101.111:6808/933 exists f247115b-c6d5-49b1-9b0e-e799c50be379
osd.4 up   in  weight 1 up_from 1973 up_thru 1973 down_at 1972 last_clean_interval [1956,1963) 192.168.101.111:6813/442 192.168.101.111:6814/442 192.168.101.111:6815/442 192.168.101.111:6816/442 exists,up c208221e-1228-4247-a742-0c16ce01d38f
blacklist 192.168.101.111:6800/2636437603 expires 2020-03-01 13:26:01.809132

"ceph pg query" of any PG didn't response.

I can't find any errors in journalctl or in /var/log/ceph/* 
I wonder why only osd 4 up, what means outoout, why 15 pgs are peering, where to search detail information, is it a way to restore data.
Please help me to understand what happend and how to restore data if it possible.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx