Re: MDS crash

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 26 Apr 2024 11:25:38 +0200 (CEST)

Hello,

'almost all diagnostic ceph subcommands hang!' -> this triggered my bell. We've had a similar issue with many ceph commands hanging due to a missing L3 ACL between MGRs and a new MDS machine that we added to the cluster.

I second Eugen analysis: network issue, whatever the OSI layer.

Regards,
Frédéric.

----- Le 26 Avr 24, à 9:31, Eugen Block eblock@xxxxxx a écrit :

> Hi, it's unlikely that all OSDs fail at the same time, it seems like a
> network issue. Do you have an active MGR? Just a couple of days ago
> someone reported incorrect OSD stats because no MGR was up. Although
> your 'ceph health detail' output doesn't mention that, there are still
> issues when MGR processes are active according to ceph but don't
> respond anymore.
> I would probably start with basic network debugging, e. g. iperf,
> pings on public and cluster networks (if present) and so on.
> 
> Regards,
> Eugen
> 
> Zitat von Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx>:
> 
>> Colleagues, I have the update.
>>
>> Starting from yestrerday the situation with ceph health is much
>> worse than it was previously.
>> We found that
>> - ceph -s informs us that some PGs are in stale state
>> -  almost all diagnostic ceph subcommands hang! For example, "ceph
>> osd ls" , "ceph osd dump",  "ceph osd tree", "ceph health detail"
>> provide the output - but "ceph osd status", all the commands "ceph
>> pg ..." and other ones hang.
>>
>> So, it looks that the crashes of MDS daemons were the first signs of
>> problems only.
>> I read that "stale" state for PGs means that all nodes storing this
>> placement group may be down - but it's wrong, all osd daemons are up
>> on all three nodes:
>>
>> ------- ceph osd tree
>> ID  CLASS  WEIGHT    TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
>> -1         68.05609  root default
>> -3         22.68536      host asrv-dev-stor-1
>>  0    hdd   5.45799          osd.0                 up   1.00000  1.00000
>>  1    hdd   5.45799          osd.1                 up   1.00000  1.00000
>>  2    hdd   5.45799          osd.2                 up   1.00000  1.00000
>>  3    hdd   5.45799          osd.3                 up   1.00000  1.00000
>> 12    ssd   0.42670          osd.12                up   1.00000  1.00000
>> 13    ssd   0.42670          osd.13                up   1.00000  1.00000
>> -5         22.68536      host asrv-dev-stor-2
>>  4    hdd   5.45799          osd.4                 up   1.00000  1.00000
>>  5    hdd   5.45799          osd.5                 up   1.00000  1.00000
>>  6    hdd   5.45799          osd.6                 up   1.00000  1.00000
>>  7    hdd   5.45799          osd.7                 up   1.00000  1.00000
>> 14    ssd   0.42670          osd.14                up   1.00000  1.00000
>> 15    ssd   0.42670          osd.15                up   1.00000  1.00000
>> -7         22.68536      host asrv-dev-stor-3
>>  8    hdd   5.45799          osd.8                 up   1.00000  1.00000
>> 10    hdd   5.45799          osd.10                up   1.00000  1.00000
>> 11    hdd   5.45799          osd.11                up   1.00000  1.00000
>> 18    hdd   5.45799          osd.18                up   1.00000  1.00000
>> 16    ssd   0.42670          osd.16                up   1.00000  1.00000
>> 17    ssd   0.42670          osd.17                up   1.00000  1.00000
>>
>> May it be the physical problem with our drives? "smartctl -a"
>> informs nothing wrong.  We started the surface check using dd
>> command also but it will be 7 hours per drive at least...
>>
>> What should we do also?
>>
>> The output of  "ceph health detail":
>>
>> ceph health detail
>> HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS
>> daemons available; Reduced data availability: 50 pgs stale; 90
>> daemons have recently crashed; 3 mgr modules have recently crashed
>> [ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
>>     mds.asrv-dev-stor-2(mds.0): Metadata damage detected
>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
>>     have 0; want 1 more
>> [WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
>>     pg 5.0 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,1,11]
>>     pg 5.13 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,10]
>>     pg 5.18 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,2]
>>     pg 5.19 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,3,10]
>>     pg 5.1e is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>>     pg 5.22 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,18]
>>     pg 5.26 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,1,18]
>>     pg 5.29 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,11,6]
>>     pg 5.2b is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,18,6]
>>     pg 5.30 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,7]
>>     pg 5.37 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,0]
>>     pg 5.3c is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,3]
>>     pg 5.43 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,18]
>>     pg 5.44 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,11]
>>     pg 5.45 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,3]
>>     pg 5.47 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,1]
>>     pg 5.48 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,11]
>>     pg 5.60 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,10,7]
>>     pg 7.2 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,2,10]
>>     pg 7.4 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,18,3]
>>     pg 7.f is stuck stale for 10h, current state stale+active+clean,
>> last acting [0,4,8]
>>     pg 7.13 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>>     pg 7.18 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,10]
>>     pg 7.1b is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>>     pg 7.1f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,11]
>>     pg 7.2a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,8]
>>     pg 7.35 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>>     pg 7.36 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,8]
>>     pg 7.37 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,7]
>>     pg 7.38 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,11]
>>     pg 9.10 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,8]
>>     pg 9.16 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,4,11]
>>     pg 9.20 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,3,8]
>>     pg 9.2a is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>>     pg 9.33 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,18,5]
>>     pg 9.3a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,5]
>>     pg 9.48 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,11]
>>     pg 9.4b is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>>     pg 9.4f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>>     pg 9.52 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>>     pg 9.53 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,11,7]
>>     pg 9.56 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,18]
>>     pg 9.5a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,8]
>>     pg 9.5d is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>>     pg 9.6b is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,0]
>>     pg 9.6f is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,18]
>>     pg 9.73 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,10]
>>     pg 9.76 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,2]
>>     pg 9.79 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,8]
>>     pg 9.7f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,10,5]
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx