Re: trying to understand stuck_unclean

YIP Wai Peng <yipwp@xxxxxxxxxxxxxxx> · Fri, 10 Jan 2014 18:58:52 +0800

Dear all,
I've solved the issue. Turns out my CRUSH map was a bit wonky. The weight of a datacenter bucket was not equal to the weight of all the osds below it. I must have edited it manually accidentally.

was

-9	3		datacenter COM1
-6	6			room 02-WIRECEN
-4	3				host ceph2
<snip>
-2	3				host ceph1
<snip>

should be

-9	6		datacenter COM1
-6	6			room 02-WIRECEN
-4	3				host ceph2
<snip>
-2	3				host ceph1
<snip>

Moving a host away from the bucket and moving it back solved the problem.

- WP

On Fri, Jan 10, 2014 at 12:22 PM, YIP Wai Peng <yipwp@xxxxxxxxxxxxxxx> wrote:

Hi Wido,
Thanks for the reply. I've dumped the query below.

"recovery_state" doesn't say anything, there are also no missing or unfounded objects. What else could be wrong?

- WP

P.S: I am running tunables optimal already.

{ "state": "active+remapped",
  "epoch": 6500,

  "up": [
        7],
  "acting": [
        7,
        3],
  "info": { "pgid": "1.fa",
      "last_update": "0'0",

      "last_complete": "0'0",
      "log_tail": "0'0",
      "last_user_version": 0,
      "last_backfill": "MAX",

      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 6377,
          "last_epoch_clean": 6379,

          "last_epoch_split": 0,
          "same_up_since": 6365,
          "same_interval_since": 6365,
          "same_primary_since": 6348,

          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_deep_scrub": "0'0",

          "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247"},
      "stats": { "version": "0'0",

          "reported_seq": "4320",
          "reported_epoch": "6500",
          "state": "active+remapped",
          "last_fresh": "2014-01-10 12:19:46.219163",

          "last_change": "2014-01-10 11:18:53.147842",
          "last_active": "2014-01-10 12:19:46.219163",
          "last_clean": "2014-01-09 22:02:41.243761",

          "last_became_active": "0.000000",
          "last_unstale": "2014-01-10 12:19:46.219163",
          "mapping_epoch": 6351,
          "log_start": "0'0",

          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 6379,
          "parent": "0.0",

          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_deep_scrub": "0'0",

          "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247",
          "log_size": 0,

          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,

              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,

              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,

              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,

              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                7],

          "acting": [
                7,
                3]},
      "empty": 1,
      "dne": 0,
      "incomplete": 0,

      "last_epoch_started": 6377},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2014-01-10 11:18:53.147802",

          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "last_backfill_started": "0\/\/0\/\/-1",

              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},

              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},

              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},

          "scrub": { "scrubber.epoch_start": "4757",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,

              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2014-01-10 11:18:40.137868"}]}

On Fri, Jan 10, 2014 at 12:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote:

On 01/10/2014 05:13 AM, YIP Wai Peng wrote:

Dear all,

I have some pgs that are stuck_unclean, I'm trying to understand why.

Hopefully someone can help me shed some light on it.

For example, one of them is

# ceph pg dump_stuck unclean

1.fa0000000active+remapped2014-01-10

11:18:53.1478420'06452:4272[7][7,3]0'02014-01-09

11:37:18.2022470'02014-01-09 11:37:18.202247

My pool 1 looks like this

pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 3 object_hash

rjenkins pg_num 256 pgp_num 256 last_change 2605 owner 0

The rule 3 is

rule different_host {

         ruleset 3

         type replicated

         min_size 1

         max_size 10

         step take default

         step chooseleaf firstn 0 type host

         step emit

}

My osd tree looks like

# idweighttype nameup/downreweight

-140root default

-73datacenter CR2

-53host ceph3

61osd.6up1

71osd.7up1

81osd.8up1

<snip>

-93datacenter COM1

-66room 02-WIRECEN

-43host ceph2

31osd.3up1

41osd.4up1

51osd.5up1

osd.7 and osd.3 are in different hosts, so the rule is satisfied. Why is

it still in the 'remapped' status, and what is it waiting for?

Try:

$ ceph pg 1.fa query

That will tell you the cause of why the PG is stuck.

- Peng

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Wido den Hollander

42on B.V.

Phone: +31 (0)20 700 9902

Skype: contact42on

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com