Hi, it's unlikely that all OSDs fail at the same time, it seems like a
network issue. Do you have an active MGR? Just a couple of days ago
someone reported incorrect OSD stats because no MGR was up. Although
your 'ceph health detail' output doesn't mention that, there are still
issues when MGR processes are active according to ceph but don't
respond anymore.
I would probably start with basic network debugging, e. g. iperf,
pings on public and cluster networks (if present) and so on.
Regards,
Eugen
Zitat von Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx>:
Colleagues, I have the update.
Starting from yestrerday the situation with ceph health is much
worse than it was previously.
We found that
- ceph -s informs us that some PGs are in stale state
- almost all diagnostic ceph subcommands hang! For example, "ceph
osd ls" , "ceph osd dump", "ceph osd tree", "ceph health detail"
provide the output - but "ceph osd status", all the commands "ceph
pg ..." and other ones hang.
So, it looks that the crashes of MDS daemons were the first signs of
problems only.
I read that "stale" state for PGs means that all nodes storing this
placement group may be down - but it's wrong, all osd daemons are up
on all three nodes:
------- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 68.05609 root default
-3 22.68536 host asrv-dev-stor-1
0 hdd 5.45799 osd.0 up 1.00000 1.00000
1 hdd 5.45799 osd.1 up 1.00000 1.00000
2 hdd 5.45799 osd.2 up 1.00000 1.00000
3 hdd 5.45799 osd.3 up 1.00000 1.00000
12 ssd 0.42670 osd.12 up 1.00000 1.00000
13 ssd 0.42670 osd.13 up 1.00000 1.00000
-5 22.68536 host asrv-dev-stor-2
4 hdd 5.45799 osd.4 up 1.00000 1.00000
5 hdd 5.45799 osd.5 up 1.00000 1.00000
6 hdd 5.45799 osd.6 up 1.00000 1.00000
7 hdd 5.45799 osd.7 up 1.00000 1.00000
14 ssd 0.42670 osd.14 up 1.00000 1.00000
15 ssd 0.42670 osd.15 up 1.00000 1.00000
-7 22.68536 host asrv-dev-stor-3
8 hdd 5.45799 osd.8 up 1.00000 1.00000
10 hdd 5.45799 osd.10 up 1.00000 1.00000
11 hdd 5.45799 osd.11 up 1.00000 1.00000
18 hdd 5.45799 osd.18 up 1.00000 1.00000
16 ssd 0.42670 osd.16 up 1.00000 1.00000
17 ssd 0.42670 osd.17 up 1.00000 1.00000
May it be the physical problem with our drives? "smartctl -a"
informs nothing wrong. We started the surface check using dd
command also but it will be 7 hours per drive at least...
What should we do also?
The output of "ceph health detail":
ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS
daemons available; Reduced data availability: 50 pgs stale; 90
daemons have recently crashed; 3 mgr modules have recently crashed
[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
mds.asrv-dev-stor-2(mds.0): Metadata damage detected
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
have 0; want 1 more
[WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
pg 5.0 is stuck stale for 67m, current state stale+active+clean,
last acting [4,1,11]
pg 5.13 is stuck stale for 67m, current state
stale+active+clean, last acting [4,0,10]
pg 5.18 is stuck stale for 67m, current state
stale+active+clean, last acting [4,11,2]
pg 5.19 is stuck stale for 67m, current state
stale+active+clean, last acting [4,3,10]
pg 5.1e is stuck stale for 10h, current state
stale+active+clean, last acting [0,7,11]
pg 5.22 is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,18]
pg 5.26 is stuck stale for 67m, current state
stale+active+clean, last acting [4,1,18]
pg 5.29 is stuck stale for 10h, current state
stale+active+clean, last acting [0,11,6]
pg 5.2b is stuck stale for 10h, current state
stale+active+clean, last acting [0,18,6]
pg 5.30 is stuck stale for 10h, current state
stale+active+clean, last acting [0,8,7]
pg 5.37 is stuck stale for 67m, current state
stale+active+clean, last acting [4,10,0]
pg 5.3c is stuck stale for 67m, current state
stale+active+clean, last acting [4,10,3]
pg 5.43 is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,18]
pg 5.44 is stuck stale for 67m, current state
stale+active+clean, last acting [4,2,11]
pg 5.45 is stuck stale for 67m, current state
stale+active+clean, last acting [4,11,3]
pg 5.47 is stuck stale for 67m, current state
stale+active+clean, last acting [4,10,1]
pg 5.48 is stuck stale for 10h, current state
stale+active+clean, last acting [0,5,11]
pg 5.60 is stuck stale for 10h, current state
stale+active+clean, last acting [0,10,7]
pg 7.2 is stuck stale for 67m, current state stale+active+clean,
last acting [4,2,10]
pg 7.4 is stuck stale for 67m, current state stale+active+clean,
last acting [4,18,3]
pg 7.f is stuck stale for 10h, current state stale+active+clean,
last acting [0,4,8]
pg 7.13 is stuck stale for 10h, current state
stale+active+clean, last acting [0,7,11]
pg 7.18 is stuck stale for 67m, current state
stale+active+clean, last acting [4,0,10]
pg 7.1b is stuck stale for 67m, current state
stale+active+clean, last acting [4,8,0]
pg 7.1f is stuck stale for 10h, current state
stale+active+clean, last acting [0,5,11]
pg 7.2a is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,8]
pg 7.35 is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,10]
pg 7.36 is stuck stale for 67m, current state
stale+active+clean, last acting [4,2,8]
pg 7.37 is stuck stale for 10h, current state
stale+active+clean, last acting [0,8,7]
pg 7.38 is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,11]
pg 9.10 is stuck stale for 67m, current state
stale+active+clean, last acting [4,0,8]
pg 9.16 is stuck stale for 10h, current state
stale+active+clean, last acting [0,4,11]
pg 9.20 is stuck stale for 67m, current state
stale+active+clean, last acting [4,3,8]
pg 9.2a is stuck stale for 67m, current state
stale+active+clean, last acting [4,8,0]
pg 9.33 is stuck stale for 10h, current state
stale+active+clean, last acting [0,18,5]
pg 9.3a is stuck stale for 10h, current state
stale+active+clean, last acting [0,8,5]
pg 9.48 is stuck stale for 67m, current state
stale+active+clean, last acting [4,2,11]
pg 9.4b is stuck stale for 10h, current state
stale+active+clean, last acting [0,7,11]
pg 9.4f is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,10]
pg 9.52 is stuck stale for 67m, current state
stale+active+clean, last acting [4,8,0]
pg 9.53 is stuck stale for 10h, current state
stale+active+clean, last acting [0,11,7]
pg 9.56 is stuck stale for 10h, current state
stale+active+clean, last acting [0,5,18]
pg 9.5a is stuck stale for 10h, current state
stale+active+clean, last acting [0,7,8]
pg 9.5d is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,10]
pg 9.6b is stuck stale for 67m, current state
stale+active+clean, last acting [4,11,0]
pg 9.6f is stuck stale for 67m, current state
stale+active+clean, last acting [4,2,18]
pg 9.73 is stuck stale for 67m, current state
stale+active+clean, last acting [4,2,10]
pg 9.76 is stuck stale for 67m, current state
stale+active+clean, last acting [4,10,2]
pg 9.79 is stuck stale for 10h, current state
stale+active+clean, last acting [0,6,8]
pg 9.7f is stuck stale for 10h, current state
stale+active+clean, last acting [0,10,5]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx