Hello, 'almost all diagnostic ceph subcommands hang!' -> this triggered my bell. We've had a similar issue with many ceph commands hanging due to a missing L3 ACL between MGRs and a new MDS machine that we added to the cluster. I second Eugen analysis: network issue, whatever the OSI layer. Regards, Frédéric. ----- Le 26 Avr 24, à 9:31, Eugen Block eblock@xxxxxx a écrit : > Hi, it's unlikely that all OSDs fail at the same time, it seems like a > network issue. Do you have an active MGR? Just a couple of days ago > someone reported incorrect OSD stats because no MGR was up. Although > your 'ceph health detail' output doesn't mention that, there are still > issues when MGR processes are active according to ceph but don't > respond anymore. > I would probably start with basic network debugging, e. g. iperf, > pings on public and cluster networks (if present) and so on. > > Regards, > Eugen > > Zitat von Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx>: > >> Colleagues, I have the update. >> >> Starting from yestrerday the situation with ceph health is much >> worse than it was previously. >> We found that >> - ceph -s informs us that some PGs are in stale state >> - almost all diagnostic ceph subcommands hang! For example, "ceph >> osd ls" , "ceph osd dump", "ceph osd tree", "ceph health detail" >> provide the output - but "ceph osd status", all the commands "ceph >> pg ..." and other ones hang. >> >> So, it looks that the crashes of MDS daemons were the first signs of >> problems only. >> I read that "stale" state for PGs means that all nodes storing this >> placement group may be down - but it's wrong, all osd daemons are up >> on all three nodes: >> >> ------- ceph osd tree >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 68.05609 root default >> -3 22.68536 host asrv-dev-stor-1 >> 0 hdd 5.45799 osd.0 up 1.00000 1.00000 >> 1 hdd 5.45799 osd.1 up 1.00000 1.00000 >> 2 hdd 5.45799 osd.2 up 1.00000 1.00000 >> 3 hdd 5.45799 osd.3 up 1.00000 1.00000 >> 12 ssd 0.42670 osd.12 up 1.00000 1.00000 >> 13 ssd 0.42670 osd.13 up 1.00000 1.00000 >> -5 22.68536 host asrv-dev-stor-2 >> 4 hdd 5.45799 osd.4 up 1.00000 1.00000 >> 5 hdd 5.45799 osd.5 up 1.00000 1.00000 >> 6 hdd 5.45799 osd.6 up 1.00000 1.00000 >> 7 hdd 5.45799 osd.7 up 1.00000 1.00000 >> 14 ssd 0.42670 osd.14 up 1.00000 1.00000 >> 15 ssd 0.42670 osd.15 up 1.00000 1.00000 >> -7 22.68536 host asrv-dev-stor-3 >> 8 hdd 5.45799 osd.8 up 1.00000 1.00000 >> 10 hdd 5.45799 osd.10 up 1.00000 1.00000 >> 11 hdd 5.45799 osd.11 up 1.00000 1.00000 >> 18 hdd 5.45799 osd.18 up 1.00000 1.00000 >> 16 ssd 0.42670 osd.16 up 1.00000 1.00000 >> 17 ssd 0.42670 osd.17 up 1.00000 1.00000 >> >> May it be the physical problem with our drives? "smartctl -a" >> informs nothing wrong. We started the surface check using dd >> command also but it will be 7 hours per drive at least... >> >> What should we do also? >> >> The output of "ceph health detail": >> >> ceph health detail >> HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS >> daemons available; Reduced data availability: 50 pgs stale; 90 >> daemons have recently crashed; 3 mgr modules have recently crashed >> [ERR] MDS_DAMAGE: 1 MDSs report damaged metadata >> mds.asrv-dev-stor-2(mds.0): Metadata damage detected >> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available >> have 0; want 1 more >> [WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale >> pg 5.0 is stuck stale for 67m, current state stale+active+clean, >> last acting [4,1,11] >> pg 5.13 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,0,10] >> pg 5.18 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,11,2] >> pg 5.19 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,3,10] >> pg 5.1e is stuck stale for 10h, current state >> stale+active+clean, last acting [0,7,11] >> pg 5.22 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,18] >> pg 5.26 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,1,18] >> pg 5.29 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,11,6] >> pg 5.2b is stuck stale for 10h, current state >> stale+active+clean, last acting [0,18,6] >> pg 5.30 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,8,7] >> pg 5.37 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,10,0] >> pg 5.3c is stuck stale for 67m, current state >> stale+active+clean, last acting [4,10,3] >> pg 5.43 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,18] >> pg 5.44 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,2,11] >> pg 5.45 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,11,3] >> pg 5.47 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,10,1] >> pg 5.48 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,5,11] >> pg 5.60 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,10,7] >> pg 7.2 is stuck stale for 67m, current state stale+active+clean, >> last acting [4,2,10] >> pg 7.4 is stuck stale for 67m, current state stale+active+clean, >> last acting [4,18,3] >> pg 7.f is stuck stale for 10h, current state stale+active+clean, >> last acting [0,4,8] >> pg 7.13 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,7,11] >> pg 7.18 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,0,10] >> pg 7.1b is stuck stale for 67m, current state >> stale+active+clean, last acting [4,8,0] >> pg 7.1f is stuck stale for 10h, current state >> stale+active+clean, last acting [0,5,11] >> pg 7.2a is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,8] >> pg 7.35 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,10] >> pg 7.36 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,2,8] >> pg 7.37 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,8,7] >> pg 7.38 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,11] >> pg 9.10 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,0,8] >> pg 9.16 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,4,11] >> pg 9.20 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,3,8] >> pg 9.2a is stuck stale for 67m, current state >> stale+active+clean, last acting [4,8,0] >> pg 9.33 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,18,5] >> pg 9.3a is stuck stale for 10h, current state >> stale+active+clean, last acting [0,8,5] >> pg 9.48 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,2,11] >> pg 9.4b is stuck stale for 10h, current state >> stale+active+clean, last acting [0,7,11] >> pg 9.4f is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,10] >> pg 9.52 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,8,0] >> pg 9.53 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,11,7] >> pg 9.56 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,5,18] >> pg 9.5a is stuck stale for 10h, current state >> stale+active+clean, last acting [0,7,8] >> pg 9.5d is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,10] >> pg 9.6b is stuck stale for 67m, current state >> stale+active+clean, last acting [4,11,0] >> pg 9.6f is stuck stale for 67m, current state >> stale+active+clean, last acting [4,2,18] >> pg 9.73 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,2,10] >> pg 9.76 is stuck stale for 67m, current state >> stale+active+clean, last acting [4,10,2] >> pg 9.79 is stuck stale for 10h, current state >> stale+active+clean, last acting [0,6,8] >> pg 9.7f is stuck stale for 10h, current state >> stale+active+clean, last acting [0,10,5] >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx