Re: trying to understand stuck_unclean

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

I've solved the issue. Turns out my CRUSH map was a bit wonky. The weight of a datacenter bucket was not equal to the weight of all the osds below it. I must have edited it manually accidentally.

was

-9 3 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>


should be

-9 6 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>


Moving a host away from the bucket and moving it back solved the problem.

- WP


On Fri, Jan 10, 2014 at 12:22 PM, YIP Wai Peng <yipwp@xxxxxxxxxxxxxxx> wrote:
Hi Wido,

Thanks for the reply. I've dumped the query below.

"recovery_state" doesn't say anything, there are also no missing or unfounded objects. What else could be wrong?

- WP

P.S: I am running tunables optimal already.


{ "state": "active+remapped",
  "epoch": 6500,
  "up": [
        7],
  "acting": [
        7,
        3],
  "info": { "pgid": "1.fa",
      "last_update": "0'0",
      "last_complete": "0'0",
      "log_tail": "0'0",
      "last_user_version": 0,
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 1,
          "last_epoch_started": 6377,
          "last_epoch_clean": 6379,
          "last_epoch_split": 0,
          "same_up_since": 6365,
          "same_interval_since": 6365,
          "same_primary_since": 6348,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247"},
      "stats": { "version": "0'0",
          "reported_seq": "4320",
          "reported_epoch": "6500",
          "state": "active+remapped",
          "last_fresh": "2014-01-10 12:19:46.219163",
          "last_change": "2014-01-10 11:18:53.147842",
          "last_active": "2014-01-10 12:19:46.219163",
          "last_clean": "2014-01-09 22:02:41.243761",
          "last_became_active": "0.000000",
          "last_unstale": "2014-01-10 12:19:46.219163",
          "mapping_epoch": 6351,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 1,
          "last_epoch_clean": 6379,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2014-01-09 11:37:18.202247",
          "last_clean_scrub_stamp": "2014-01-09 11:37:18.202247",
          "log_size": 0,
          "ondisk_log_size": 0,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 0,
              "num_objects": 0,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 0,
              "num_read_kb": 0,
              "num_write": 0,
              "num_write_kb": 0,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0},
          "stat_cat_sum": {},
          "up": [
                7],
          "acting": [
                7,
                3]},
      "empty": 1,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 6377},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2014-01-10 11:18:53.147802",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "last_backfill_started": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},
          "scrub": { "scrubber.epoch_start": "4757",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2014-01-10 11:18:40.137868"}]}



On Fri, Jan 10, 2014 at 12:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
On 01/10/2014 05:13 AM, YIP Wai Peng wrote:
Dear all,

I have some pgs that are stuck_unclean, I'm trying to understand why.
Hopefully someone can help me shed some light on it.

For example, one of them is

# ceph pg dump_stuck unclean
1.fa0000000active+remapped2014-01-10
11:18:53.1478420'06452:4272[7][7,3]0'02014-01-09
11:37:18.2022470'02014-01-09 11:37:18.202247



My pool 1 looks like this

pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 2605 owner 0


The rule 3 is

rule different_host {
         ruleset 3
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type host
         step emit
}


My osd tree looks like

# idweighttype nameup/downreweight
-140root default
-73datacenter CR2
-53host ceph3
61osd.6up1
71osd.7up1
81osd.8up1
<snip>
-93datacenter COM1
-66room 02-WIRECEN
-43host ceph2
31osd.3up1
41osd.4up1
51osd.5up1


osd.7 and osd.3 are in different hosts, so the rule is satisfied. Why is
it still in the 'remapped' status, and what is it waiting for?


Try:

$ ceph pg 1.fa query

That will tell you the cause of why the PG is stuck.

- Peng


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux