Dear all,
I've solved the issue. Turns out my CRUSH map was a bit wonky. The weight of a datacenter bucket was not equal to the weight of all the osds below it. I must have edited it manually accidentally.
was
-9 3 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>
should be
-9 6 datacenter COM1
-6 6 room 02-WIRECEN
-4 3 host ceph2
<snip>
-2 3 host ceph1
<snip>
Moving a host away from the bucket and moving it back solved the problem.
- WP
On Fri, Jan 10, 2014 at 12:22 PM, YIP Wai Peng <yipwp@xxxxxxxxxxxxxxx> wrote:
Hi Wido,Thanks for the reply. I've dumped the query below."recovery_state" doesn't say anything, there are also no missing or unfounded objects. What else could be wrong?- WPP.S: I am running tunables optimal already.{ "state": "active+remapped","epoch": 6500,"up": [7],"acting": [7,3],"info": { "pgid": "1.fa","last_update": "0'0","last_complete": "0'0","log_tail": "0'0","last_user_version": 0,"last_backfill": "MAX","purged_snaps": "[]","history": { "epoch_created": 1,"last_epoch_started": 6377,"last_epoch_clean": 6379,"last_epoch_split": 0,"same_up_since": 6365,"same_interval_since": 6365,"same_primary_since": 6348,"last_scrub": "0'0","last_scrub_stamp": "2014-01-09 11:37:18.202247","last_deep_scrub": "0'0","last_deep_scrub_stamp": "2014-01-09 11:37:18.202247","last_clean_scrub_stamp": "2014-01-09 11:37:18.202247"},"stats": { "version": "0'0","reported_seq": "4320","reported_epoch": "6500","state": "active+remapped","last_fresh": "2014-01-10 12:19:46.219163","last_change": "2014-01-10 11:18:53.147842","last_active": "2014-01-10 12:19:46.219163","last_clean": "2014-01-09 22:02:41.243761","last_became_active": "0.000000","last_unstale": "2014-01-10 12:19:46.219163","mapping_epoch": 6351,"log_start": "0'0","ondisk_log_start": "0'0","created": 1,"last_epoch_clean": 6379,"parent": "0.0","parent_split_bits": 0,"last_scrub": "0'0","last_scrub_stamp": "2014-01-09 11:37:18.202247","last_deep_scrub": "0'0","last_deep_scrub_stamp": "2014-01-09 11:37:18.202247","last_clean_scrub_stamp": "2014-01-09 11:37:18.202247","log_size": 0,"ondisk_log_size": 0,"stats_invalid": "0","stat_sum": { "num_bytes": 0,"num_objects": 0,"num_object_clones": 0,"num_object_copies": 0,"num_objects_missing_on_primary": 0,"num_objects_degraded": 0,"num_objects_unfound": 0,"num_read": 0,"num_read_kb": 0,"num_write": 0,"num_write_kb": 0,"num_scrub_errors": 0,"num_shallow_scrub_errors": 0,"num_deep_scrub_errors": 0,"num_objects_recovered": 0,"num_bytes_recovered": 0,"num_keys_recovered": 0},"stat_cat_sum": {},"up": [7],"acting": [7,3]},"empty": 1,"dne": 0,"incomplete": 0,"last_epoch_started": 6377},"recovery_state": [{ "name": "Started\/Primary\/Active","enter_time": "2014-01-10 11:18:53.147802","might_have_unfound": [],"recovery_progress": { "backfill_target": -1,"waiting_on_backfill": 0,"last_backfill_started": "0\/\/0\/\/-1","backfill_info": { "begin": "0\/\/0\/\/-1","end": "0\/\/0\/\/-1","objects": []},"peer_backfill_info": { "begin": "0\/\/0\/\/-1","end": "0\/\/0\/\/-1","objects": []},"backfills_in_flight": [],"recovering": [],"pg_backend": { "pull_from_peer": [],"pushing": []}},"scrub": { "scrubber.epoch_start": "4757","scrubber.active": 0,"scrubber.block_writes": 0,"scrubber.finalizing": 0,"scrubber.waiting_on": 0,"scrubber.waiting_on_whom": []}},{ "name": "Started","enter_time": "2014-01-10 11:18:40.137868"}]}On Fri, Jan 10, 2014 at 12:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
On 01/10/2014 05:13 AM, YIP Wai Peng wrote:
Dear all,11:37:18.2022470'02014-01-09 11:37:18.202247
I have some pgs that are stuck_unclean, I'm trying to understand why.
Hopefully someone can help me shed some light on it.
For example, one of them is
# ceph pg dump_stuck unclean
1.fa0000000active+remapped2014-01-10
11:18:53.1478420'06452:4272[7][7,3]0'02014-01-09
My pool 1 looks like this
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 2605 owner 0
The rule 3 is
rule different_host {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
My osd tree looks like
# idweighttype nameup/downreweight
-140root default
-73datacenter CR2
-53host ceph3
61osd.6up1
71osd.7up1
81osd.8up1
<snip>
-93datacenter COM1
-66room 02-WIRECEN
-43host ceph2
31osd.3up1
41osd.4up1
51osd.5up1
osd.7 and osd.3 are in different hosts, so the rule is satisfied. Why is
it still in the 'remapped' status, and what is it waiting for?
Try:
$ ceph pg 1.fa query
That will tell you the cause of why the PG is stuck.
- Peng
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com