Hi Sam, Thanks for your reply here. Unfortunately I didn't capture all this data at the time of the issue. What I do have I've pasted below. FYI the only way I found to fix this issue was to temporarily reduce the number of replicas in the pool to 1. The stuck pgs then disappeared and so I then increased the replicas back to 2 at this point. Obviously this is not a great workaround so I am keen to get to the bottom of the problem here. Thanks again for your help. Chris # ceph health detail HEALTH_WARN 7 pgs stuck unclean pg 3.5a is stuck unclean for 335339.172516, current state active, last acting [5,4] pg 3.54 is stuck unclean for 335339.157608, current state active, last acting [15,7] pg 3.55 is stuck unclean for 335339.167154, current state active, last acting [16,9] pg 3.1c is stuck unclean for 335339.174150, current state active, last acting [8,16] pg 3.a is stuck unclean for 335339.177001, current state active, last acting [0,8] pg 3.4 is stuck unclean for 335339.165377, current state active, last acting [17,4] pg 3.5 is stuck unclean for 335339.149507, current state active, last acting [2,6] # ceph pg 3.5a query { "state": "active", "epoch": 699, "up": [ 5, 4], "acting": [ 5, 4], "info": { "pgid": "3.5a", "last_update": "413'688", "last_complete": "413'688", "log_tail": "0'0", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 67, "last_epoch_started": 644, "last_epoch_clean": 644, "last_epoch_split": 0, "same_up_since": 643, "same_interval_since": 643, "same_primary_since": 561, "last_scrub": "0'0", "last_scrub_stamp": "2013-08-01 15:23:29.253783", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2013-08-01 15:23:29.253783", "last_clean_scrub_stamp": "2013-08-01 15:23:29.253783"}, "stats": { "version": "413'688", "reported": "561'1484", "state": "active", "last_fresh": "2013-08-02 12:25:41.793582", "last_change": "2013-08-02 09:54:08.163758", "last_active": "2013-08-02 12:25:41.793582", "last_clean": "2013-08-02 09:49:34.246621", "last_became_active": "0.000000", "last_unstale": "2013-08-02 12:25:41.793582", "mapping_epoch": 641, "log_start": "0'0", "ondisk_log_start": "0'0", "created": 67, "last_epoch_clean": 67, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "0'0", "last_scrub_stamp": "2013-08-01 15:23:29.253783", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2013-08-01 15:23:29.253783", "last_clean_scrub_stamp": "2013-08-01 15:23:29.253783", "log_size": 0, "ondisk_log_size": 0, "stats_invalid": "0", "stat_sum": { "num_bytes": 134217728, "num_objects": 32, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 0, "num_read_kb": 0, "num_write": 688, "num_write_kb": 327680, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 45, "num_bytes_recovered": 188743680, "num_keys_recovered": 0}, "stat_cat_sum": {}, "up": [ 5, 4], "acting": [ 5, 4]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 644}, "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2013-08-02 09:49:56.504882", "might_have_unfound": [], "recovery_progress": { "backfill_target": -1, "waiting_on_backfill": 0, "backfill_pos": "0\/\/0\/\/-1", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "backfills_in_flight": [], "pull_from_peer": [], "pushing": []}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2013-08-02 09:49:55.501261"}]} -----Original Message----- From: Samuel Just [mailto:sam.just@xxxxxxxxxxx] Sent: 12 August 2013 22:52 To: Howarth, Chris [CCC-OT_IT] Cc: ceph-users@xxxxxxxx Subject: Re: Ceph pgs stuck unclean Can you attach the output of: ceph -s ceph pg dump ceph osd dump and run ceph osd getmap -o /tmp/osdmap and attach /tmp/osdmap/ -Sam On Wed, Aug 7, 2013 at 1:58 AM, Howarth, Chris <chris.howarth@xxxxxxxx> wrote: > Hi, > > One of our OSD disks failed on a cluster and I replaced it, but > when it failed it did not completely recover and I have a number of > pgs which are stuck unclean: > > > > # ceph health detail > > HEALTH_WARN 7 pgs stuck unclean > > pg 3.5a is stuck unclean for 335339.172516, current state active, last > acting [5,4] > > pg 3.54 is stuck unclean for 335339.157608, current state active, last > acting [15,7] > > pg 3.55 is stuck unclean for 335339.167154, current state active, last > acting [16,9] > > pg 3.1c is stuck unclean for 335339.174150, current state active, last > acting [8,16] > > pg 3.a is stuck unclean for 335339.177001, current state active, last > acting [0,8] > > pg 3.4 is stuck unclean for 335339.165377, current state active, last > acting [17,4] > > pg 3.5 is stuck unclean for 335339.149507, current state active, last > acting [2,6] > > > > Does anyone know how to fix these ? I tried the following, but this > does not seem to work: > > > > # ceph pg 3.5 mark_unfound_lost revert > > pg has no unfound objects > > > > thanks > > > > Chris > > __________________________ > > Chris Howarth > > OS Platforms Engineering > > Citi Architecture & Technology Engineering > > (e) chris.howarth@xxxxxxxx > > (t) +44 (0) 20 7508 3848 > > (f) +44 (0) 20 7508 0964 > > (mail-drop) CGC-06-3A > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com