you can start by posting more details. atleast
"ceph osd tree" "cat ceph.conf" and "ceph osd df" so we can see what
settings you are running, and how your cluster is balanced at the moment.
generally:
inconsistent pg's are pg's that have scrub errors. use rados
list-inconsistent-pg [pool] and rados-list-inconsistent-obj [pg] to
locate the objects with problems. compare and fix the objects using info
from
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent
also read http://ceph.com/geen-categorie/ceph-manually-repair-object/
since you have so many scrub errors i would assume there are more bad
disks, check all disk's smart values and look for read errors in logs.
if you find any you should drain those disks by setting crush weight to
0. and when they are empty remove them from the cluster. personally i
use smartmontools it sends me emails about bad disks, and check disks
manually with smartctl -a /dev/sda || echo bad-disk: $?
pg's that are down+peering need to have one of the acting osd's started
again. or to have the objects recovered using the methods we have
discussed previously.
ref:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure
nb: do not mark any osd's as lost since that = dataloss.
I would
- check smart stats of all disks. drain disks that are going bad. make
sure you have enough space on good disks to drain them properly.
- check scrub errors and objects. fix those that are fixable. some may
require an object from a down osd.
- try to get down osd's running again if possible. if you manage to get
one running, let it recover and stabilize.
- recover and inject objects from osd's that do not run. stasrt by doing
one and one pg. and once you get the hang of the method you can do
multiple pg's at the same time.
good luck
Ronny Aasen
On 11. sep. 2017 06:51, hjcho616 wrote:
It took a while. It appears to have cleaned up quite a bit... but still
has issues. I've been seeing below message for more than a day and cpu
utilization and io utilization is low... looks like something is
stuck... I rebooted OSDs several times when it looked like it was stuck
earlier and it would work on something else, but now it is not changing
much. What can I try now?
Regards,
Hong
# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs
degraded; 6 pgs down; 11 pgs inconsistent; 6 pgs peering; 6 pgs
recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive;
16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16
pgs undersized; 1 requests are blocked > 32 sec; 1 osds have slow
requests; recovery 221990/4503980 objects degraded (4.929%); recovery
147/2251990 unfound (0.007%); 95 scrub errors; mds cluster is degraded;
no legacy OSD present but 'sortbitwise' flag is not set
pg 0.e is stuck inactive since forever, current state down+peering, last
acting [11,2]
pg 1.d is stuck inactive since forever, current state down+peering, last
acting [11,2]
pg 1.28 is stuck inactive since forever, current state down+peering,
last acting [11,6]
pg 0.29 is stuck inactive since forever, current state down+peering,
last acting [11,6]
pg 1.2b is stuck inactive since forever, current state down+peering,
last acting [1,11]
pg 0.2c is stuck inactive since forever, current state down+peering,
last acting [1,11]
pg 0.e is stuck unclean since forever, current state down+peering, last
acting [11,2]
pg 0.a is stuck unclean for 1233182.248198, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.8 is stuck unclean for 1238044.714421, current state
stale+active+undersized+degraded, last acting [0]
pg 2.1a is stuck unclean for 1238933.203920, current state
active+recovering+degraded, last acting [2,11]
pg 2.3 is stuck unclean for 1238882.443876, current state
stale+active+undersized+degraded, last acting [0]
pg 2.27 is stuck unclean for 1295260.765981, current state
active+recovering+degraded, last acting [11,6]
pg 0.d is stuck unclean for 1230831.504001, current state
stale+active+undersized+degraded, last acting [0]
pg 1.c is stuck unclean for 1238044.715698, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3d is stuck unclean for 1232066.572856, current state
stale+active+undersized+degraded, last acting [0]
pg 1.28 is stuck unclean since forever, current state down+peering, last
acting [11,6]
pg 0.29 is stuck unclean since forever, current state down+peering, last
acting [11,6]
pg 1.2b is stuck unclean since forever, current state down+peering, last
acting [1,11]
pg 2.2f is stuck unclean for 1238127.474088, current state
active+recovering+degraded+remapped, last acting [9,10]
pg 0.0 is stuck unclean for 1233182.247776, current state
stale+active+undersized+degraded, last acting [0]
pg 0.2c is stuck unclean since forever, current state down+peering, last
acting [1,11]
pg 2.b is stuck unclean for 1238044.640982, current state
stale+active+undersized+degraded, last acting [0]
pg 1.1b is stuck unclean for 1234021.660986, current state
stale+active+undersized+degraded, last acting [0]
pg 0.1c is stuck unclean for 1232574.189549, current state
stale+active+undersized+degraded, last acting [0]
pg 1.4 is stuck unclean for 1293624.075753, current state
stale+active+undersized+degraded, last acting [0]
pg 0.5 is stuck unclean for 1237356.776788, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.1f is stuck unclean for 8825246.729513, current state
active+recovering+degraded, last acting [10,2]
pg 1.d is stuck unclean since forever, current state down+peering, last
acting [11,2]
pg 2.39 is stuck unclean for 1238933.214406, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3a is stuck unclean for 2125299.164204, current state
stale+active+undersized+degraded, last acting [0]
pg 0.3b is stuck unclean for 1233432.895409, current state
stale+active+undersized+degraded, last acting [0]
pg 2.3c is stuck unclean for 1238933.208648, current state
active+recovering+degraded, last acting [10,2]
pg 2.35 is stuck unclean for 1295260.753354, current state
active+recovering+degraded, last acting [11,6]
pg 1.9 is stuck unclean for 1238044.722811, current state
stale+active+undersized+degraded, last acting [0]
pg 0.a is stuck undersized for 1229917.081228, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.8 is stuck undersized for 1229917.081016, current state
stale+active+undersized+degraded, last acting [0]
pg 2.b is stuck undersized for 1229917.068181, current state
stale+active+undersized+degraded, last acting [0]
pg 1.9 is stuck undersized for 1229917.075164, current state
stale+active+undersized+degraded, last acting [0]
pg 0.5 is stuck undersized for 1229917.085330, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 1.4 is stuck undersized for 1229917.085148, current state
stale+active+undersized+degraded, last acting [0]
pg 0.d is stuck undersized for 1229917.080800, current state
stale+active+undersized+degraded, last acting [0]
pg 1.c is stuck undersized for 1229917.080592, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3d is stuck undersized for 1229816.808393, current state
stale+active+undersized+degraded, last acting [0]
pg 0.3b is stuck undersized for 1229917.074358, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3a is stuck undersized for 1229917.076592, current state
stale+active+undersized+degraded, last acting [0]
pg 2.39 is stuck undersized for 1229917.077505, current state
stale+active+undersized+degraded, last acting [0]
pg 0.1c is stuck undersized for 1229816.811773, current state
stale+active+undersized+degraded, last acting [0]
pg 1.1b is stuck undersized for 1229816.812506, current state
stale+active+undersized+degraded, last acting [0]
pg 2.3 is stuck undersized for 1229917.090143, current state
stale+active+undersized+degraded, last acting [0]
pg 0.0 is stuck undersized for 1229917.073670, current state
stale+active+undersized+degraded, last acting [0]
pg 0.a is stuck degraded for 1229917.081375, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.8 is stuck degraded for 1229917.081162, current state
stale+active+undersized+degraded, last acting [0]
pg 2.b is stuck degraded for 1229917.068328, current state
stale+active+undersized+degraded, last acting [0]
pg 0.5 is stuck degraded for 1229917.085470, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 1.4 is stuck degraded for 1229917.085288, current state
stale+active+undersized+degraded, last acting [0]
pg 2.3c is stuck degraded for 2732.174512, current state
active+recovering+degraded, last acting [10,2]
pg 0.d is stuck degraded for 1229917.080946, current state
stale+active+undersized+degraded, last acting [0]
pg 1.c is stuck degraded for 1229917.080739, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3d is stuck degraded for 1229816.808539, current state
stale+active+undersized+degraded, last acting [0]
pg 0.3b is stuck degraded for 1229917.074504, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3a is stuck degraded for 1229917.076739, current state
stale+active+undersized+degraded, last acting [0]
pg 2.39 is stuck degraded for 1229917.077652, current state
stale+active+undersized+degraded, last acting [0]
pg 2.1f is stuck degraded for 2732.122575, current state
active+recovering+degraded, last acting [10,2]
pg 0.1c is stuck degraded for 1229816.811926, current state
stale+active+undersized+degraded, last acting [0]
pg 1.1b is stuck degraded for 1229816.812659, current state
stale+active+undersized+degraded, last acting [0]
pg 2.1a is stuck degraded for 2744.851402, current state
active+recovering+degraded, last acting [2,11]
pg 2.3 is stuck degraded for 1229917.090302, current state
stale+active+undersized+degraded, last acting [0]
pg 0.0 is stuck degraded for 1229917.073830, current state
stale+active+undersized+degraded, last acting [0]
pg 2.27 is stuck degraded for 2744.828928, current state
active+recovering+degraded, last acting [11,6]
pg 2.2f is stuck degraded for 2731.468651, current state
active+recovering+degraded+remapped, last acting [9,10]
pg 1.9 is stuck degraded for 1229917.075428, current state
stale+active+undersized+degraded, last acting [0]
pg 2.35 is stuck degraded for 2744.828894, current state
active+recovering+degraded, last acting [11,6]
pg 0.a is stuck stale for 1227812.807624, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.8 is stuck stale for 1227812.807638, current state
stale+active+undersized+degraded, last acting [0]
pg 2.b is stuck stale for 1227812.807651, current state
stale+active+undersized+degraded, last acting [0]
pg 1.9 is stuck stale for 1227812.807665, current state
stale+active+undersized+degraded, last acting [0]
pg 0.5 is stuck stale for 1227812.807699, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 1.4 is stuck stale for 1227812.807709, current state
stale+active+undersized+degraded, last acting [0]
pg 0.d is stuck stale for 1227812.807624, current state
stale+active+undersized+degraded, last acting [0]
pg 1.c is stuck stale for 1227812.807634, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3d is stuck stale for 1227812.807799, current state
stale+active+undersized+degraded, last acting [0]
pg 0.3b is stuck stale for 1227812.807813, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3a is stuck stale for 1227812.807823, current state
stale+active+undersized+degraded, last acting [0]
pg 2.39 is stuck stale for 1227812.807833, current state
stale+active+undersized+degraded, last acting [0]
pg 0.1c is stuck stale for 1227812.807936, current state
stale+active+undersized+degraded, last acting [0]
pg 1.1b is stuck stale for 1227812.807960, current state
stale+active+undersized+degraded, last acting [0]
pg 2.3 is stuck stale for 1227812.808049, current state
stale+active+undersized+degraded, last acting [0]
pg 0.0 is stuck stale for 1227812.808073, current state
stale+active+undersized+degraded, last acting [0]
pg 0.38 is active+clean+inconsistent, acting [11,1]
pg 2.35 is active+recovering+degraded, acting [11,6], 29 unfound
pg 0.36 is active+clean+inconsistent, acting [10,1]
pg 2.2f is active+recovering+degraded+remapped, acting [9,10], 24 unfound
pg 0.2c is down+peering, acting [1,11]
pg 1.2b is down+peering, acting [1,11]
pg 0.29 is down+peering, acting [11,6]
pg 1.28 is down+peering, acting [11,6]
pg 0.26 is active+clean+inconsistent, acting [6,11]
pg 2.27 is active+recovering+degraded, acting [11,6], 19 unfound
pg 0.23 is active+clean+inconsistent, acting [6,10]
pg 2.1a is active+recovering+degraded, acting [2,11], 29 unfound
pg 0.18 is active+clean+inconsistent, acting [11,1]
pg 2.1f is active+recovering+degraded, acting [10,2], 20 unfound
pg 0.3c is active+clean+inconsistent, acting [1,11]
pg 0.3d is active+clean+inconsistent, acting [2,11]
pg 2.3c is active+recovering+degraded, acting [10,2], 26 unfound
pg 0.b is active+clean+inconsistent, acting [11,2]
pg 1.d is down+peering, acting [11,2]
pg 0.e is down+peering, acting [11,2]
pg 0.f is active+clean+inconsistent, acting [11,6]
1 ops are blocked > 4194.3 sec on osd.10
1 osds have slow requests
recovery 221990/4503980 objects degraded (4.929%)
recovery 147/2251990 unfound (0.007%)
95 scrub errors
mds cluster is degraded
mds.MDS1.1 at 192.168.1.20:6802/1862098088 rank 0 is replaying journal
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com