Re: Ceph pgs stuck unclean

"Howarth, Chris " <chris.howarth@xxxxxxxx> · Tue, 13 Aug 2013 14:52:14 +0000

Hi Sam,
	Thanks for your reply here. Unfortunately I didn't capture all this data at the time of the issue. What I do have I've pasted below. FYI the only way I found to fix this issue was to temporarily reduce the number of replicas in the pool to 1. The stuck pgs then disappeared and so I then increased the replicas back to 2 at this point. Obviously this is not a great workaround so I am keen to get to the bottom of the problem here.

Thanks again for your help.

Chris

# ceph health detail
HEALTH_WARN 7 pgs stuck unclean
pg 3.5a is stuck unclean for 335339.172516, current state active, last acting [5,4]
pg 3.54 is stuck unclean for 335339.157608, current state active, last acting [15,7]
pg 3.55 is stuck unclean for 335339.167154, current state active, last acting [16,9]
pg 3.1c is stuck unclean for 335339.174150, current state active, last acting [8,16]
pg 3.a is stuck unclean for 335339.177001, current state active, last acting [0,8]
pg 3.4 is stuck unclean for 335339.165377, current state active, last acting [17,4]
pg 3.5 is stuck unclean for 335339.149507, current state active, last acting [2,6]
# ceph pg 3.5a query
{ "state": "active",
  "epoch": 699,
  "up": [
        5,
        4],
  "acting": [
        5,
        4],
  "info": { "pgid": "3.5a",
      "last_update": "413'688",
      "last_complete": "413'688",
      "log_tail": "0'0",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 67,
          "last_epoch_started": 644,
          "last_epoch_clean": 644,
          "last_epoch_split": 0,
          "same_up_since": 643,
          "same_interval_since": 643,
          "same_primary_since": 561,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-01 15:23:29.253783",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-01 15:23:29.253783",
          "last_clean_scrub_stamp": "2013-08-01 15:23:29.253783"},
      "stats": { "version": "413'688",
          "reported": "561'1484",
          "state": "active",
          "last_fresh": "2013-08-02 12:25:41.793582",
          "last_change": "2013-08-02 09:54:08.163758",
          "last_active": "2013-08-02 12:25:41.793582",
          "last_clean": "2013-08-02 09:49:34.246621",
          "last_became_active": "0.000000",
          "last_unstale": "2013-08-02 12:25:41.793582",
          "mapping_epoch": 641,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 67,
          "last_epoch_clean": 67,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2013-08-01 15:23:29.253783",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2013-08-01 15:23:29.253783",
          "last_clean_scrub_stamp": "2013-08-01 15:23:29.253783",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 134217728,
              "num_objects": 32,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 688,
              "num_write_kb": 327680,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 45,
              "num_bytes_recovered": 188743680,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                5,
                4],
          "acting": [
                5,
                4]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 644},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-08-02 09:49:56.504882",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-08-02 09:49:55.501261"}]}

-----Original Message-----
From: Samuel Just [mailto:sam.just@xxxxxxxxxxx] 
Sent: 12 August 2013 22:52
To: Howarth, Chris [CCC-OT_IT]
Cc: ceph-users@xxxxxxxx
Subject: Re:  Ceph pgs stuck unclean

Can you attach the output of:

ceph -s
ceph pg dump
ceph osd dump

and run

ceph osd getmap -o /tmp/osdmap

and attach /tmp/osdmap/
-Sam

On Wed, Aug 7, 2013 at 1:58 AM, Howarth, Chris <chris.howarth@xxxxxxxx> wrote:
> Hi,
>
>     One of our OSD disks failed on a cluster and I replaced it, but 
> when it failed it did not completely recover and I have a number of 
> pgs which are stuck unclean:
>
>
>
> # ceph health detail
>
> HEALTH_WARN 7 pgs stuck unclean
>
> pg 3.5a is stuck unclean for 335339.172516, current state active, last 
> acting [5,4]
>
> pg 3.54 is stuck unclean for 335339.157608, current state active, last 
> acting [15,7]
>
> pg 3.55 is stuck unclean for 335339.167154, current state active, last 
> acting [16,9]
>
> pg 3.1c is stuck unclean for 335339.174150, current state active, last 
> acting [8,16]
>
> pg 3.a is stuck unclean for 335339.177001, current state active, last 
> acting [0,8]
>
> pg 3.4 is stuck unclean for 335339.165377, current state active, last 
> acting [17,4]
>
> pg 3.5 is stuck unclean for 335339.149507, current state active, last 
> acting [2,6]
>
>
>
> Does anyone know how to fix these ? I tried the following, but this 
> does not seem to work:
>
>
>
> # ceph pg 3.5 mark_unfound_lost revert
>
> pg has no unfound objects
>
>
>
> thanks
>
>
>
> Chris
>
> __________________________
>
> Chris Howarth
>
> OS Platforms Engineering
>
> Citi Architecture & Technology Engineering
>
> (e) chris.howarth@xxxxxxxx
>
> (t) +44 (0) 20 7508 3848
>
> (f) +44 (0) 20 7508 0964
>
> (mail-drop) CGC-06-3A
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com