Colleagues, I have the update. Starting from yestrerday the situation with ceph health is much worse than it was previously. We found that - ceph -s informs us that some PGs are in stale state - almost all diagnostic ceph subcommands hang! For example, "ceph osd ls" , "ceph osd dump", "ceph osd tree", "ceph health detail" provide the output - but "ceph osd status", all the commands "ceph pg ..." and other ones hang. So, it looks that the crashes of MDS daemons were the first signs of problems only. I read that "stale" state for PGs means that all nodes storing this placement group may be down - but it's wrong, all osd daemons are up on all three nodes: ------- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 68.05609 root default -3 22.68536 host asrv-dev-stor-1 0 hdd 5.45799 osd.0 up 1.00000 1.00000 1 hdd 5.45799 osd.1 up 1.00000 1.00000 2 hdd 5.45799 osd.2 up 1.00000 1.00000 3 hdd 5.45799 osd.3 up 1.00000 1.00000 12 ssd 0.42670 osd.12 up 1.00000 1.00000 13 ssd 0.42670 osd.13 up 1.00000 1.00000 -5 22.68536 host asrv-dev-stor-2 4 hdd 5.45799 osd.4 up 1.00000 1.00000 5 hdd 5.45799 osd.5 up 1.00000 1.00000 6 hdd 5.45799 osd.6 up 1.00000 1.00000 7 hdd 5.45799 osd.7 up 1.00000 1.00000 14 ssd 0.42670 osd.14 up 1.00000 1.00000 15 ssd 0.42670 osd.15 up 1.00000 1.00000 -7 22.68536 host asrv-dev-stor-3 8 hdd 5.45799 osd.8 up 1.00000 1.00000 10 hdd 5.45799 osd.10 up 1.00000 1.00000 11 hdd 5.45799 osd.11 up 1.00000 1.00000 18 hdd 5.45799 osd.18 up 1.00000 1.00000 16 ssd 0.42670 osd.16 up 1.00000 1.00000 17 ssd 0.42670 osd.17 up 1.00000 1.00000 May it be the physical problem with our drives? "smartctl -a" informs nothing wrong. We started the surface check using dd command also but it will be 7 hours per drive at least... What should we do also? The output of "ceph health detail": ceph health detail HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS daemons available; Reduced data availability: 50 pgs stale; 90 daemons have recently crashed; 3 mgr modules have recently crashed [ERR] MDS_DAMAGE: 1 MDSs report damaged metadata mds.asrv-dev-stor-2(mds.0): Metadata damage detected [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more [WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale pg 5.0 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,11] pg 5.13 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10] pg 5.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,2] pg 5.19 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,10] pg 5.1e is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 5.22 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18] pg 5.26 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,18] pg 5.29 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,6] pg 5.2b is stuck stale for 10h, current state stale+active+clean, last acting [0,18,6] pg 5.30 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7] pg 5.37 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,0] pg 5.3c is stuck stale for 67m, current state stale+active+clean, last acting [4,10,3] pg 5.43 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18] pg 5.44 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11] pg 5.45 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,3] pg 5.47 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,1] pg 5.48 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11] pg 5.60 is stuck stale for 10h, current state stale+active+clean, last acting [0,10,7] pg 7.2 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10] pg 7.4 is stuck stale for 67m, current state stale+active+clean, last acting [4,18,3] pg 7.f is stuck stale for 10h, current state stale+active+clean, last acting [0,4,8] pg 7.13 is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 7.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10] pg 7.1b is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 7.1f is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11] pg 7.2a is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8] pg 7.35 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 7.36 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,8] pg 7.37 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7] pg 7.38 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,11] pg 9.10 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,8] pg 9.16 is stuck stale for 10h, current state stale+active+clean, last acting [0,4,11] pg 9.20 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,8] pg 9.2a is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 9.33 is stuck stale for 10h, current state stale+active+clean, last acting [0,18,5] pg 9.3a is stuck stale for 10h, current state stale+active+clean, last acting [0,8,5] pg 9.48 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11] pg 9.4b is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 9.4f is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 9.52 is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 9.53 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,7] pg 9.56 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,18] pg 9.5a is stuck stale for 10h, current state stale+active+clean, last acting [0,7,8] pg 9.5d is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 9.6b is stuck stale for 67m, current state stale+active+clean, last acting [4,11,0] pg 9.6f is stuck stale for 67m, current state stale+active+clean, last acting [4,2,18] pg 9.73 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10] pg 9.76 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,2] pg 9.79 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8] pg 9.7f is stuck stale for 10h, current state stale+active+clean, last acting [0,10,5] _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx