Re: one pg stuck with 2 unfound pieces

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Try restarting the two osd processes with debug osd = 20, debug ms =
1, debug filestore = 20.  Restarting the osds may clear the problem,
but if it recurs, the logs should help explain what's going on.
-Sam

On Wed, Aug 14, 2013 at 12:17 AM, Jens-Christian Fischer
<jens-christian.fischer@xxxxxxxxx> wrote:
> On 13.08.2013, at 21:09, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>
>> You can run 'ceph pg 0.cfa mark_unfound_lost revert'. (Revert Lost
>> section of http://ceph.com/docs/master/rados/operations/placement-groups/).
>> -Sam
>
>
> As I wrote further down the info, ceph wouldn't let me do that:
>
> root@ineri ~$ ceph pg 0.cfa  mark_unfound_lost revert
> pg has 2 objects but we haven't probed all sources, not marking lost
>
> I'm looking for a way that forces the (re) probing of the sources…
>
> cheers
> jc
>
>
>
>
>>
>> On Tue, Aug 13, 2013 at 6:50 AM, Jens-Christian Fischer
>> <jens-christian.fischer@xxxxxxxxx> wrote:
>>> We have a cluster with 10 servers, 64 OSDs and 5 Mons on them. The OSDs are
>>> 3TB disk, formatted with btrfs and the servers are either on Ubuntu 12.10 or
>>> 13.04.
>>>
>>> Recently one of the servers (13.04) stood still (due to problems with btrfs
>>> - something we have seen a few times). I decided to not try to recover the
>>> disks, but reformat them with XFS. I removed the OSDs, reformatted, and
>>> re-created them (they got the same OSD numbers)
>>>
>>> I redid this twice (because I wrongly partioned the disks in the first
>>> place) and I ended up with 2 unfound "pieces" in one pg:
>>>
>>> root@s2:~# ceph health details
>>> HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; recovery
>>> 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%)
>>> pg 0.cfa is stuck unclean for 1004252.309704, current state
>>> active+recovering+degraded+remapped, last acting [23,50]
>>> pg 0.cfa is active+recovering+degraded+remapped, acting [23,50], 2 unfound
>>> recovery 4448/28915270 degraded (0.015%); 2/9854766 unfound (0.000%)
>>>
>>>
>>> root@s2:~# ceph pg 0.cfa query
>>>
>>> { "state": "active+recovering+degraded+remapped",
>>>  "epoch": 28197,
>>>  "up": [
>>>        23,
>>>        50,
>>>        18],
>>>  "acting": [
>>>        23,
>>>        50],
>>>  "info": { "pgid": "0.cfa",
>>>      "last_update": "28082'7774",
>>>      "last_complete": "23686'7083",
>>>      "log_tail": "14360'4061",
>>>      "last_backfill": "MAX",
>>>      "purged_snaps": "[]",
>>>      "history": { "epoch_created": 1,
>>>          "last_epoch_started": 28197,
>>>          "last_epoch_clean": 24810,
>>>          "last_epoch_split": 0,
>>>          "same_up_since": 28195,
>>>          "same_interval_since": 28196,
>>>          "same_primary_since": 26036,
>>>          "last_scrub": "20585'6801",
>>>          "last_scrub_stamp": "2013-07-28 15:40:53.298786",
>>>          "last_deep_scrub": "20585'6801",
>>>          "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786",
>>>          "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786"},
>>>      "stats": { "version": "28082'7774",
>>>          "reported": "28197'41950",
>>>          "state": "active+recovering+degraded+remapped",
>>>          "last_fresh": "2013-08-13 14:34:33.057271",
>>>          "last_change": "2013-08-13 14:34:33.057271",
>>>          "last_active": "2013-08-13 14:34:33.057271",
>>>          "last_clean": "2013-08-01 23:50:18.414082",
>>>          "last_became_active": "2013-05-29 13:10:51.366237",
>>>          "last_unstale": "2013-08-13 14:34:33.057271",
>>>          "mapping_epoch": 28195,
>>>          "log_start": "14360'4061",
>>>          "ondisk_log_start": "14360'4061",
>>>          "created": 1,
>>>          "last_epoch_clean": 24810,
>>>          "parent": "0.0",
>>>          "parent_split_bits": 0,
>>>          "last_scrub": "20585'6801",
>>>          "last_scrub_stamp": "2013-07-28 15:40:53.298786",
>>>          "last_deep_scrub": "20585'6801",
>>>          "last_deep_scrub_stamp": "2013-07-28 15:40:53.298786",
>>>          "last_clean_scrub_stamp": "2013-07-28 15:40:53.298786",
>>>          "log_size": 0,
>>>          "ondisk_log_size": 0,
>>>          "stats_invalid": "0",
>>>          "stat_sum": { "num_bytes": 145307402,
>>>              "num_objects": 2234,
>>>              "num_object_clones": 0,
>>>              "num_object_copies": 0,
>>>              "num_objects_missing_on_primary": 0,
>>>              "num_objects_degraded": 0,
>>>              "num_objects_unfound": 0,
>>>              "num_read": 744,
>>>              "num_read_kb": 410184,
>>>              "num_write": 7774,
>>>              "num_write_kb": 1155438,
>>>              "num_scrub_errors": 0,
>>>              "num_shallow_scrub_errors": 0,
>>>              "num_deep_scrub_errors": 0,
>>>              "num_objects_recovered": 3998,
>>>              "num_bytes_recovered": 278803622,
>>>              "num_keys_recovered": 0},
>>>          "stat_cat_sum": {},
>>>          "up": [
>>>                23,
>>>                50,
>>>                18],
>>>          "acting": [
>>>                23,
>>>                50]},
>>>      "empty": 0,
>>>      "dne": 0,
>>>      "incomplete": 0,
>>>      "last_epoch_started": 28197},
>>>  "recovery_state": [
>>>        { "name": "Started\/Primary\/Active",
>>>          "enter_time": "2013-08-13 14:34:33.026698",
>>>          "might_have_unfound": [
>>>                { "osd": 9,
>>>                  "status": "querying"},
>>>                { "osd": 18,
>>>                  "status": "querying"},
>>>                { "osd": 50,
>>>                  "status": "already probed"}],
>>>          "recovery_progress": { "backfill_target": 50,
>>>              "waiting_on_backfill": 0,
>>>              "backfill_pos": "96220cfa\/10000799e82.00000000\/head\/\/0",
>>>              "backfill_info": { "begin": "0\/\/0\/\/-1",
>>>                  "end": "0\/\/0\/\/-1",
>>>                  "objects": []},
>>>              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>>>                  "end": "0\/\/0\/\/-1",
>>>                  "objects": []},
>>>              "backfills_in_flight": [],
>>>              "pull_from_peer": [],
>>>              "pushing": []},
>>>          "scrub": { "scrubber.epoch_start": "0",
>>>              "scrubber.active": 0,
>>>              "scrubber.block_writes": 0,
>>>              "scrubber.finalizing": 0,
>>>              "scrubber.waiting_on": 0,
>>>              "scrubber.waiting_on_whom": []}},
>>>        { "name": "Started",
>>>          "enter_time": "2013-08-13 14:34:32.024282"}]}
>>>
>>> I have tried to mark those two pieces as lost, but ceph wouldn't let me (due
>>> to the fact that it is still in querying state on osd 9 and 18). I have
>>> restarted the OSDs, but I can't force any other status change.
>>>
>>> What next? Take the OSDs (9, 18) out again and rebuilding?
>>>
>>> thanks for your help
>>> Jens-Christian
>>>
>>>
>>> --
>>> SWITCH
>>> Jens-Christian Fischer, Peta Solutions
>>> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
>>> phone +41 44 268 15 15, direct +41 44 268 15 71
>>> jens-christian.fischer@xxxxxxxxx
>>> http://www.switch.ch
>>>
>>> http://www.switch.ch/socialmedia
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux