Hi Sam
I have restarted osd.23 with the debug log settings and have extracted these 0.cfa related log lines - I can't interpret them. There might be more, I can provide the complete log file if you need it: http://pastebin.com/dYsihsx4
0.cfa has been out so long, that it shows up as being down forever
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 mons down, quorum 0,1,2,4 h1,h5,s2,s4 pg 0.cfa is stuck inactive since forever, current state incomplete, last acting [23,50,18] pg 0.cfa is stuck unclean since forever, current state incomplete, last acting [23,50,18] pg 0.cfa is incomplete, acting [23,50,18]
also, we can't revert 0.cfa
root@h0:~# ceph pg 0.cfa mark_unfound_lost revert pg has no unfound objects
This stuck pg seems to fill up our mons (they need to keep old data, right?) which makes starting a new mon a task of seemingly herculean proportions.
Any ideas on how to proceed?
thanks
Jens-Christian
-- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fischer@xxxxxxxxx
Try restarting the two osd processes with debug osd = 20, debug ms = 1, debug filestore = 20. Restarting the osds may clear the problem, but if it recurs, the logs should help explain what's going on. -Sam
On Wed, Aug 14, 2013 at 12:17 AM, Jens-Christian Fischer <jens-christian.fischer@xxxxxxxxx> wrote:
On 13.08.2013, at 21:09, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
You can run 'ceph pg 0.cfa mark_unfound_lost revert'. (Revert Lost section of http://ceph.com/docs/master/rados/operations/placement-groups/). -Sam
As I wrote further down the info, ceph wouldn't let me do that:
root@ineri ~$ ceph pg 0.cfa mark_unfound_lost revert pg has 2 objects but we haven't probed all sources, not marking lost
I'm looking for a way that forces the (re) probing of the sources…
cheers jc
On Tue, Aug 13, 2013 at 6:50 AM, Jens-Christian Fischer <jens-christian.fischer@xxxxxxxxx> wrote:
We have a cluster with 10 servers, 64 OSDs and 5 Mons on them. The OSDs are 3TB disk, formatted with btrfs and the servers are either on Ubuntu 12.10 or 13.04.
Recently one of the servers (13.04) stood still (due to problems with btrfs - something we have seen a few times). I decided to not try to recover the disks, but reformat them with XFS. I removed the OSDs, reformatted, and re-created them (they got the same OSD numbers)
I redid this twice (because I wrongly partioned the disks in the first place) and I ended up with 2 unfound "pieces" in one pg:
root@s2:~# ceph health details HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%) pg 0.cfa is stuck unclean for 1004252.309704, current state active+recovering+degraded+remapped, last acting [23,50] pg 0.cfa is active+recovering+degraded+remapped, acting [23,50], 2 unfound recovery 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%)
root@s2:~# ceph pg 0.cfa query
{ "state": "active+recovering+degraded+remapped", "epoch": 28197, "up": [ 23, 50, 18], "acting": [ 23, 50], "info": { "pgid": "0.cfa", "last_update": "28082'7774", "last_complete": "23686'7083", "log_tail": "14360'4061", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 1, "last_epoch_started": 28197, "last_epoch_clean": 24810, "last_epoch_split": 0, "same_up_since": 28195, "same_interval_since": 28196, "same_primary_since": 26036, "last_scrub": "20585'6801", "last_scrub_stamp": "2013-07-28 15:40:53.298786", "last_deep_scrub": "20585'6801", "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786", "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786"}, "stats": { "version": "28082'7774", "reported": "28197'41950", "state": "active+recovering+degraded+remapped", "last_fresh": "2013-08-13 14:34:33.057271", "last_change": "2013-08-13 14:34:33.057271", "last_active": "2013-08-13 14:34:33.057271", "last_clean": "2013-08-01 23:50:18.414082", "last_became_active": "2013-05-29 13:10:51.366237", "last_unstale": "2013-08-13 14:34:33.057271", "mapping_epoch": 28195, "log_start": "14360'4061", "ondisk_log_start": "14360'4061", "created": 1, "last_epoch_clean": 24810, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "20585'6801", "last_scrub_stamp": "2013-07-28 15:40:53.298786", "last_deep_scrub": "20585'6801", "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786", "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": "0", "stat_sum": { "num_bytes": 145307402, "num_objects": 2234, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 744, "num_read_kb": 410184, "num_write": 7774, "num_write_kb": 1155438, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 3998, "num_bytes_recovered": 278803622, "num_keys_recovered": 0}, "stat_cat_sum": {}, "up": [ 23, 50, 18], "acting": [ 23, 50]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 28197}, "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2013-08-13 14:34:33.026698", "might_have_unfound": [ { "osd": 9, "status": "querying"}, { "osd": 18, "status": "querying"}, { "osd": 50, "status": "already probed"}], "recovery_progress": { "backfill_target": 50, "waiting_on_backfill": 0, "backfill_pos": "96220cfa\/10000799e82.00000000\/head\/\/0", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "backfills_in_flight": [], "pull_from_peer": [], "pushing": []}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2013-08-13 14:34:32.024282"}]}
I have tried to mark those two pieces as lost, but ceph wouldn't let me (due to the fact that it is still in querying state on osd 9 and 18). I have restarted the OSDs, but I can't force any other status change.
What next? Take the OSDs (9, 18) out again and rebuilding?
thanks for your help Jens-Christian
-- SWITCH Jens-Christian Fischer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 15 71 jens-christian.fischer@xxxxxxxxx http://www.switch.ch
http://www.switch.ch/socialmedia
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|