Re: MDS crash

Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx> · Thu, 25 Apr 2024 15:44:39 +0000

Colleagues, I have the update.

Starting from yestrerday the situation with ceph health is much worse than it was previously.
We found that 
- ceph -s informs us that some PGs are in stale state
-  almost all diagnostic ceph subcommands hang! For example, "ceph osd ls" , "ceph osd dump",  "ceph osd tree", "ceph health detail" provide the output - but "ceph osd status", all the commands "ceph pg ..." and other ones hang.

So, it looks that the crashes of MDS daemons were the first signs of problems only.
I read that "stale" state for PGs means that all nodes storing this placement group may be down - but it's wrong, all osd daemons are up on all three nodes:

------- ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
-1         68.05609  root default
-3         22.68536      host asrv-dev-stor-1
 0    hdd   5.45799          osd.0                 up   1.00000  1.00000
 1    hdd   5.45799          osd.1                 up   1.00000  1.00000
 2    hdd   5.45799          osd.2                 up   1.00000  1.00000
 3    hdd   5.45799          osd.3                 up   1.00000  1.00000
12    ssd   0.42670          osd.12                up   1.00000  1.00000
13    ssd   0.42670          osd.13                up   1.00000  1.00000
-5         22.68536      host asrv-dev-stor-2
 4    hdd   5.45799          osd.4                 up   1.00000  1.00000
 5    hdd   5.45799          osd.5                 up   1.00000  1.00000
 6    hdd   5.45799          osd.6                 up   1.00000  1.00000
 7    hdd   5.45799          osd.7                 up   1.00000  1.00000
14    ssd   0.42670          osd.14                up   1.00000  1.00000
15    ssd   0.42670          osd.15                up   1.00000  1.00000
-7         22.68536      host asrv-dev-stor-3
 8    hdd   5.45799          osd.8                 up   1.00000  1.00000
10    hdd   5.45799          osd.10                up   1.00000  1.00000
11    hdd   5.45799          osd.11                up   1.00000  1.00000
18    hdd   5.45799          osd.18                up   1.00000  1.00000
16    ssd   0.42670          osd.16                up   1.00000  1.00000
17    ssd   0.42670          osd.17                up   1.00000  1.00000

May it be the physical problem with our drives? "smartctl -a" informs nothing wrong.  We started the surface check using dd command also but it will be 7 hours per drive at least...

What should we do also? 

The output of  "ceph health detail":

ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS daemons available; Reduced data availability: 50 pgs stale; 90 daemons have recently crashed; 3 mgr modules have recently crashed
[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
    mds.asrv-dev-stor-2(mds.0): Metadata damage detected
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
    pg 5.0 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,11]
    pg 5.13 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10]
    pg 5.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,2]
    pg 5.19 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,10]
    pg 5.1e is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11]
    pg 5.22 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18]
    pg 5.26 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,18]
    pg 5.29 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,6]
    pg 5.2b is stuck stale for 10h, current state stale+active+clean, last acting [0,18,6]
    pg 5.30 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7]
    pg 5.37 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,0]
    pg 5.3c is stuck stale for 67m, current state stale+active+clean, last acting [4,10,3]
    pg 5.43 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18]
    pg 5.44 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11]
    pg 5.45 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,3]
    pg 5.47 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,1]
    pg 5.48 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11]
    pg 5.60 is stuck stale for 10h, current state stale+active+clean, last acting [0,10,7]
    pg 7.2 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10]
    pg 7.4 is stuck stale for 67m, current state stale+active+clean, last acting [4,18,3]
    pg 7.f is stuck stale for 10h, current state stale+active+clean, last acting [0,4,8]
    pg 7.13 is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11]
    pg 7.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10]
    pg 7.1b is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0]
    pg 7.1f is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11]
    pg 7.2a is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8]
    pg 7.35 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10]
    pg 7.36 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,8]
    pg 7.37 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7]
    pg 7.38 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,11]
    pg 9.10 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,8]
    pg 9.16 is stuck stale for 10h, current state stale+active+clean, last acting [0,4,11]
    pg 9.20 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,8]
    pg 9.2a is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0]
    pg 9.33 is stuck stale for 10h, current state stale+active+clean, last acting [0,18,5]
    pg 9.3a is stuck stale for 10h, current state stale+active+clean, last acting [0,8,5]
    pg 9.48 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11]
    pg 9.4b is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11]
    pg 9.4f is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10]
    pg 9.52 is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0]
    pg 9.53 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,7]
    pg 9.56 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,18]
    pg 9.5a is stuck stale for 10h, current state stale+active+clean, last acting [0,7,8]
    pg 9.5d is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10]
    pg 9.6b is stuck stale for 67m, current state stale+active+clean, last acting [4,11,0]
    pg 9.6f is stuck stale for 67m, current state stale+active+clean, last acting [4,2,18]
    pg 9.73 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10]
    pg 9.76 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,2]
    pg 9.79 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8]
    pg 9.7f is stuck stale for 10h, current state stale+active+clean, last acting [0,10,5]

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx