Re: Power outages!!! help!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



you can start by posting more details. atleast
"ceph osd tree" "cat ceph.conf" and "ceph osd df" so we can see what settings you are running, and how your cluster is balanced at the moment.

generally:

inconsistent pg's are pg's that have scrub errors. use rados list-inconsistent-pg [pool] and rados-list-inconsistent-obj [pg] to locate the objects with problems. compare and fix the objects using info from http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent also read http://ceph.com/geen-categorie/ceph-manually-repair-object/


since you have so many scrub errors i would assume there are more bad disks, check all disk's smart values and look for read errors in logs. if you find any you should drain those disks by setting crush weight to 0. and when they are empty remove them from the cluster. personally i use smartmontools it sends me emails about bad disks, and check disks manually with smartctl -a /dev/sda || echo bad-disk: $?


pg's that are down+peering need to have one of the acting osd's started again. or to have the objects recovered using the methods we have discussed previously. ref: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure

nb: do not mark any osd's as lost since that = dataloss.


I would
- check smart stats of all disks. drain disks that are going bad. make sure you have enough space on good disks to drain them properly. - check scrub errors and objects. fix those that are fixable. some may require an object from a down osd. - try to get down osd's running again if possible. if you manage to get one running, let it recover and stabilize. - recover and inject objects from osd's that do not run. stasrt by doing one and one pg. and once you get the hang of the method you can do multiple pg's at the same time.


good luck
Ronny Aasen



On 11. sep. 2017 06:51, hjcho616 wrote:
It took a while. It appears to have cleaned up quite a bit... but still has issues. I've been seeing below message for more than a day and cpu utilization and io utilization is low... looks like something is stuck... I rebooted OSDs several times when it looked like it was stuck earlier and it would work on something else, but now it is not changing much. What can I try now?

Regards,
Hong

# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs degraded; 6 pgs down; 11 pgs inconsistent; 6 pgs peering; 6 pgs recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs undersized; 1 requests are blocked > 32 sec; 1 osds have slow requests; recovery 221990/4503980 objects degraded (4.929%); recovery 147/2251990 unfound (0.007%); 95 scrub errors; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set pg 0.e is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.d is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.28 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 0.29 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 1.2b is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.2c is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.e is stuck unclean since forever, current state down+peering, last acting [11,2] pg 0.a is stuck unclean for 1233182.248198, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.8 is stuck unclean for 1238044.714421, current state stale+active+undersized+degraded, last acting [0] pg 2.1a is stuck unclean for 1238933.203920, current state active+recovering+degraded, last acting [2,11] pg 2.3 is stuck unclean for 1238882.443876, current state stale+active+undersized+degraded, last acting [0] pg 2.27 is stuck unclean for 1295260.765981, current state active+recovering+degraded, last acting [11,6] pg 0.d is stuck unclean for 1230831.504001, current state stale+active+undersized+degraded, last acting [0] pg 1.c is stuck unclean for 1238044.715698, current state stale+active+undersized+degraded, last acting [0] pg 1.3d is stuck unclean for 1232066.572856, current state stale+active+undersized+degraded, last acting [0] pg 1.28 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 0.29 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 1.2b is stuck unclean since forever, current state down+peering, last acting [1,11] pg 2.2f is stuck unclean for 1238127.474088, current state active+recovering+degraded+remapped, last acting [9,10] pg 0.0 is stuck unclean for 1233182.247776, current state stale+active+undersized+degraded, last acting [0] pg 0.2c is stuck unclean since forever, current state down+peering, last acting [1,11] pg 2.b is stuck unclean for 1238044.640982, current state stale+active+undersized+degraded, last acting [0] pg 1.1b is stuck unclean for 1234021.660986, current state stale+active+undersized+degraded, last acting [0] pg 0.1c is stuck unclean for 1232574.189549, current state stale+active+undersized+degraded, last acting [0] pg 1.4 is stuck unclean for 1293624.075753, current state stale+active+undersized+degraded, last acting [0] pg 0.5 is stuck unclean for 1237356.776788, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.1f is stuck unclean for 8825246.729513, current state active+recovering+degraded, last acting [10,2] pg 1.d is stuck unclean since forever, current state down+peering, last acting [11,2] pg 2.39 is stuck unclean for 1238933.214406, current state stale+active+undersized+degraded, last acting [0] pg 1.3a is stuck unclean for 2125299.164204, current state stale+active+undersized+degraded, last acting [0] pg 0.3b is stuck unclean for 1233432.895409, current state stale+active+undersized+degraded, last acting [0] pg 2.3c is stuck unclean for 1238933.208648, current state active+recovering+degraded, last acting [10,2] pg 2.35 is stuck unclean for 1295260.753354, current state active+recovering+degraded, last acting [11,6] pg 1.9 is stuck unclean for 1238044.722811, current state stale+active+undersized+degraded, last acting [0] pg 0.a is stuck undersized for 1229917.081228, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.8 is stuck undersized for 1229917.081016, current state stale+active+undersized+degraded, last acting [0] pg 2.b is stuck undersized for 1229917.068181, current state stale+active+undersized+degraded, last acting [0] pg 1.9 is stuck undersized for 1229917.075164, current state stale+active+undersized+degraded, last acting [0] pg 0.5 is stuck undersized for 1229917.085330, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 1.4 is stuck undersized for 1229917.085148, current state stale+active+undersized+degraded, last acting [0] pg 0.d is stuck undersized for 1229917.080800, current state stale+active+undersized+degraded, last acting [0] pg 1.c is stuck undersized for 1229917.080592, current state stale+active+undersized+degraded, last acting [0] pg 1.3d is stuck undersized for 1229816.808393, current state stale+active+undersized+degraded, last acting [0] pg 0.3b is stuck undersized for 1229917.074358, current state stale+active+undersized+degraded, last acting [0] pg 1.3a is stuck undersized for 1229917.076592, current state stale+active+undersized+degraded, last acting [0] pg 2.39 is stuck undersized for 1229917.077505, current state stale+active+undersized+degraded, last acting [0] pg 0.1c is stuck undersized for 1229816.811773, current state stale+active+undersized+degraded, last acting [0] pg 1.1b is stuck undersized for 1229816.812506, current state stale+active+undersized+degraded, last acting [0] pg 2.3 is stuck undersized for 1229917.090143, current state stale+active+undersized+degraded, last acting [0] pg 0.0 is stuck undersized for 1229917.073670, current state stale+active+undersized+degraded, last acting [0] pg 0.a is stuck degraded for 1229917.081375, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.8 is stuck degraded for 1229917.081162, current state stale+active+undersized+degraded, last acting [0] pg 2.b is stuck degraded for 1229917.068328, current state stale+active+undersized+degraded, last acting [0] pg 0.5 is stuck degraded for 1229917.085470, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 1.4 is stuck degraded for 1229917.085288, current state stale+active+undersized+degraded, last acting [0] pg 2.3c is stuck degraded for 2732.174512, current state active+recovering+degraded, last acting [10,2] pg 0.d is stuck degraded for 1229917.080946, current state stale+active+undersized+degraded, last acting [0] pg 1.c is stuck degraded for 1229917.080739, current state stale+active+undersized+degraded, last acting [0] pg 1.3d is stuck degraded for 1229816.808539, current state stale+active+undersized+degraded, last acting [0] pg 0.3b is stuck degraded for 1229917.074504, current state stale+active+undersized+degraded, last acting [0] pg 1.3a is stuck degraded for 1229917.076739, current state stale+active+undersized+degraded, last acting [0] pg 2.39 is stuck degraded for 1229917.077652, current state stale+active+undersized+degraded, last acting [0] pg 2.1f is stuck degraded for 2732.122575, current state active+recovering+degraded, last acting [10,2] pg 0.1c is stuck degraded for 1229816.811926, current state stale+active+undersized+degraded, last acting [0] pg 1.1b is stuck degraded for 1229816.812659, current state stale+active+undersized+degraded, last acting [0] pg 2.1a is stuck degraded for 2744.851402, current state active+recovering+degraded, last acting [2,11] pg 2.3 is stuck degraded for 1229917.090302, current state stale+active+undersized+degraded, last acting [0] pg 0.0 is stuck degraded for 1229917.073830, current state stale+active+undersized+degraded, last acting [0] pg 2.27 is stuck degraded for 2744.828928, current state active+recovering+degraded, last acting [11,6] pg 2.2f is stuck degraded for 2731.468651, current state active+recovering+degraded+remapped, last acting [9,10] pg 1.9 is stuck degraded for 1229917.075428, current state stale+active+undersized+degraded, last acting [0] pg 2.35 is stuck degraded for 2744.828894, current state active+recovering+degraded, last acting [11,6] pg 0.a is stuck stale for 1227812.807624, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.8 is stuck stale for 1227812.807638, current state stale+active+undersized+degraded, last acting [0] pg 2.b is stuck stale for 1227812.807651, current state stale+active+undersized+degraded, last acting [0] pg 1.9 is stuck stale for 1227812.807665, current state stale+active+undersized+degraded, last acting [0] pg 0.5 is stuck stale for 1227812.807699, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 1.4 is stuck stale for 1227812.807709, current state stale+active+undersized+degraded, last acting [0] pg 0.d is stuck stale for 1227812.807624, current state stale+active+undersized+degraded, last acting [0] pg 1.c is stuck stale for 1227812.807634, current state stale+active+undersized+degraded, last acting [0] pg 1.3d is stuck stale for 1227812.807799, current state stale+active+undersized+degraded, last acting [0] pg 0.3b is stuck stale for 1227812.807813, current state stale+active+undersized+degraded, last acting [0] pg 1.3a is stuck stale for 1227812.807823, current state stale+active+undersized+degraded, last acting [0] pg 2.39 is stuck stale for 1227812.807833, current state stale+active+undersized+degraded, last acting [0] pg 0.1c is stuck stale for 1227812.807936, current state stale+active+undersized+degraded, last acting [0] pg 1.1b is stuck stale for 1227812.807960, current state stale+active+undersized+degraded, last acting [0] pg 2.3 is stuck stale for 1227812.808049, current state stale+active+undersized+degraded, last acting [0] pg 0.0 is stuck stale for 1227812.808073, current state stale+active+undersized+degraded, last acting [0]
pg 0.38 is active+clean+inconsistent, acting [11,1]
pg 2.35 is active+recovering+degraded, acting [11,6], 29 unfound
pg 0.36 is active+clean+inconsistent, acting [10,1]
pg 2.2f is active+recovering+degraded+remapped, acting [9,10], 24 unfound
pg 0.2c is down+peering, acting [1,11]
pg 1.2b is down+peering, acting [1,11]
pg 0.29 is down+peering, acting [11,6]
pg 1.28 is down+peering, acting [11,6]
pg 0.26 is active+clean+inconsistent, acting [6,11]
pg 2.27 is active+recovering+degraded, acting [11,6], 19 unfound
pg 0.23 is active+clean+inconsistent, acting [6,10]
pg 2.1a is active+recovering+degraded, acting [2,11], 29 unfound
pg 0.18 is active+clean+inconsistent, acting [11,1]
pg 2.1f is active+recovering+degraded, acting [10,2], 20 unfound
pg 0.3c is active+clean+inconsistent, acting [1,11]
pg 0.3d is active+clean+inconsistent, acting [2,11]
pg 2.3c is active+recovering+degraded, acting [10,2], 26 unfound
pg 0.b is active+clean+inconsistent, acting [11,2]
pg 1.d is down+peering, acting [11,2]
pg 0.e is down+peering, acting [11,2]
pg 0.f is active+clean+inconsistent, acting [11,6]
1 ops are blocked > 4194.3 sec on osd.10
1 osds have slow requests
recovery 221990/4503980 objects degraded (4.929%)
recovery 147/2251990 unfound (0.007%)
95 scrub errors
mds cluster is degraded
mds.MDS1.1 at 192.168.1.20:6802/1862098088 rank 0 is replaying journal

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux