Re: MDS crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, it's unlikely that all OSDs fail at the same time, it seems like a network issue. Do you have an active MGR? Just a couple of days ago someone reported incorrect OSD stats because no MGR was up. Although your 'ceph health detail' output doesn't mention that, there are still issues when MGR processes are active according to ceph but don't respond anymore. I would probably start with basic network debugging, e. g. iperf, pings on public and cluster networks (if present) and so on.

Regards,
Eugen

Zitat von Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx>:

Colleagues, I have the update.

Starting from yestrerday the situation with ceph health is much worse than it was previously.
We found that
- ceph -s informs us that some PGs are in stale state
- almost all diagnostic ceph subcommands hang! For example, "ceph osd ls" , "ceph osd dump", "ceph osd tree", "ceph health detail" provide the output - but "ceph osd status", all the commands "ceph pg ..." and other ones hang.

So, it looks that the crashes of MDS daemons were the first signs of problems only. I read that "stale" state for PGs means that all nodes storing this placement group may be down - but it's wrong, all osd daemons are up on all three nodes:

------- ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
-1         68.05609  root default
-3         22.68536      host asrv-dev-stor-1
 0    hdd   5.45799          osd.0                 up   1.00000  1.00000
 1    hdd   5.45799          osd.1                 up   1.00000  1.00000
 2    hdd   5.45799          osd.2                 up   1.00000  1.00000
 3    hdd   5.45799          osd.3                 up   1.00000  1.00000
12    ssd   0.42670          osd.12                up   1.00000  1.00000
13    ssd   0.42670          osd.13                up   1.00000  1.00000
-5         22.68536      host asrv-dev-stor-2
 4    hdd   5.45799          osd.4                 up   1.00000  1.00000
 5    hdd   5.45799          osd.5                 up   1.00000  1.00000
 6    hdd   5.45799          osd.6                 up   1.00000  1.00000
 7    hdd   5.45799          osd.7                 up   1.00000  1.00000
14    ssd   0.42670          osd.14                up   1.00000  1.00000
15    ssd   0.42670          osd.15                up   1.00000  1.00000
-7         22.68536      host asrv-dev-stor-3
 8    hdd   5.45799          osd.8                 up   1.00000  1.00000
10    hdd   5.45799          osd.10                up   1.00000  1.00000
11    hdd   5.45799          osd.11                up   1.00000  1.00000
18    hdd   5.45799          osd.18                up   1.00000  1.00000
16    ssd   0.42670          osd.16                up   1.00000  1.00000
17    ssd   0.42670          osd.17                up   1.00000  1.00000

May it be the physical problem with our drives? "smartctl -a" informs nothing wrong. We started the surface check using dd command also but it will be 7 hours per drive at least...

What should we do also?

The output of  "ceph health detail":

ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS daemons available; Reduced data availability: 50 pgs stale; 90 daemons have recently crashed; 3 mgr modules have recently crashed
[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
    mds.asrv-dev-stor-2(mds.0): Metadata damage detected
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
pg 5.0 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,11] pg 5.13 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10] pg 5.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,2] pg 5.19 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,10] pg 5.1e is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 5.22 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18] pg 5.26 is stuck stale for 67m, current state stale+active+clean, last acting [4,1,18] pg 5.29 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,6] pg 5.2b is stuck stale for 10h, current state stale+active+clean, last acting [0,18,6] pg 5.30 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7] pg 5.37 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,0] pg 5.3c is stuck stale for 67m, current state stale+active+clean, last acting [4,10,3] pg 5.43 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,18] pg 5.44 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11] pg 5.45 is stuck stale for 67m, current state stale+active+clean, last acting [4,11,3] pg 5.47 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,1] pg 5.48 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11] pg 5.60 is stuck stale for 10h, current state stale+active+clean, last acting [0,10,7] pg 7.2 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10] pg 7.4 is stuck stale for 67m, current state stale+active+clean, last acting [4,18,3] pg 7.f is stuck stale for 10h, current state stale+active+clean, last acting [0,4,8] pg 7.13 is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 7.18 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,10] pg 7.1b is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 7.1f is stuck stale for 10h, current state stale+active+clean, last acting [0,5,11] pg 7.2a is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8] pg 7.35 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 7.36 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,8] pg 7.37 is stuck stale for 10h, current state stale+active+clean, last acting [0,8,7] pg 7.38 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,11] pg 9.10 is stuck stale for 67m, current state stale+active+clean, last acting [4,0,8] pg 9.16 is stuck stale for 10h, current state stale+active+clean, last acting [0,4,11] pg 9.20 is stuck stale for 67m, current state stale+active+clean, last acting [4,3,8] pg 9.2a is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 9.33 is stuck stale for 10h, current state stale+active+clean, last acting [0,18,5] pg 9.3a is stuck stale for 10h, current state stale+active+clean, last acting [0,8,5] pg 9.48 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,11] pg 9.4b is stuck stale for 10h, current state stale+active+clean, last acting [0,7,11] pg 9.4f is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 9.52 is stuck stale for 67m, current state stale+active+clean, last acting [4,8,0] pg 9.53 is stuck stale for 10h, current state stale+active+clean, last acting [0,11,7] pg 9.56 is stuck stale for 10h, current state stale+active+clean, last acting [0,5,18] pg 9.5a is stuck stale for 10h, current state stale+active+clean, last acting [0,7,8] pg 9.5d is stuck stale for 10h, current state stale+active+clean, last acting [0,6,10] pg 9.6b is stuck stale for 67m, current state stale+active+clean, last acting [4,11,0] pg 9.6f is stuck stale for 67m, current state stale+active+clean, last acting [4,2,18] pg 9.73 is stuck stale for 67m, current state stale+active+clean, last acting [4,2,10] pg 9.76 is stuck stale for 67m, current state stale+active+clean, last acting [4,10,2] pg 9.79 is stuck stale for 10h, current state stale+active+clean, last acting [0,6,8] pg 9.7f is stuck stale for 10h, current state stale+active+clean, last acting [0,10,5]


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux