Right, if you think about it, any objects written during the time without 1,2,3 really do require 4 to recover. You can reduce the risk of this by setting min_size to something greater than 8, but you also won't be able to recover with fewer than min_size, so if you set min_size to 9 and lose 1,2,3, you won't have lost data, but you won't be able to recover until you reduce min_size. It's mainly there so that you won't accept writes during a brief outage which brings you down to 8. Note, I think you could have marked osd 8 lost and then marked the unrecoverable objects lost. -Sam On Thu, Nov 13, 2014 at 11:20 AM, GuangYang <yguang11@xxxxxxxxxxx> wrote: > Thanks Sam for the quick response. Just want to make sure I understand it correctly: > > If we have [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] and all of 1,2,3 are down, the PG is active as we are using 8 + 3, and once 4 is down, even though we bring up 1,2,3, the PG could not become active unless we bring 4 up. Is my understanding correct here? > > Thanks, > Guang > > ---------------------------------------- >> Date: Thu, 13 Nov 2014 09:06:27 -0800 >> Subject: Re: PG down >> From: sam.just@xxxxxxxxxxx >> To: yguang11@xxxxxxxxxxx >> CC: ceph-devel@xxxxxxxxxxxxxxx >> >> It looks like the acting set went down to the min allowable size and >> went active with osd 8. At that point you needed every member of that >> acting set to go active later on to avoiding loosing writes. You can >> prevent this by setting a min_size above the number of data chunks. >> -Sam >> >> On Thu, Nov 13, 2014 at 4:15 AM, GuangYang <yguang11@xxxxxxxxxxx> wrote: >>> Hi Sam, >>> Yesterday there was one PG down in our cluster and I am confused by the PG state, I am not sure if it is a bug (or an issue has been fixed as I see a couple of related fixes in giant), it would be nice you can help to take a look. >>> >>> Here is what happened: >>> >>> We are using EC pool with 8 data chunks and 3 code chunks, saying the PG has up/acting set as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], there was one OSD in the set down and up, so that it triggered PG recovering. However, when doing recover, the primary OSD crash as due to a corrupted file chunk, then another OSD become primary, start recover and crashed, and so on so forth until there are 4 OSDs down in the set and the PG is marked down. >>> >>> After that, we left the OSD having corrupted data down and started all other crashed OSDs, we expected the PG could become active, however, the PG is still down with the following query information: >>> >>> { "state": "down+remapped+inconsistent+peering", >>> "epoch": 4469, >>> "up": [ >>> 377, >>> 107, >>> 328, >>> 263, >>> 395, >>> 467, >>> 352, >>> 475, >>> 333, >>> 37, >>> 380], >>> "acting": [ >>> 2147483647, >>> 107, >>> 328, >>> 263, >>> 395, >>> 2147483647, >>> 352, >>> 475, >>> 333, >>> 37, >>> 380], >>> ... >>> 377]}], >>> "probing_osds": [ >>> "37(9)", >>> "107(1)", >>> "263(3)", >>> "328(2)", >>> "333(8)", >>> "352(6)", >>> "377(0)", >>> "380(10)", >>> "395(4)", >>> "467(5)", >>> "475(7)"], >>> "blocked": "peering is blocked due to down osds", >>> "down_osds_we_would_probe": [ >>> 8], >>> "peering_blocked_by": [ >>> { "osd": 8, >>> "current_lost_at": 0, >>> "comment": "starting or marking this osd lost may let us proceed"}]}, >>> { "name": "Started", >>> "enter_time": "2014-11-12 10:12:23.067369"}], >>> } >>> >>> Here osd.8 is the one having corrupted data. >>> >>> The way we worked around this issue is to set norecover and start osd.8, get that PG active and then removed the object (via rados), unset norecover and things become clean again. But the most confusing part is that even we only left osd.8 down, the PG couldn't become active. >>> >>> We are using firefly v0.80.4. >>> >>> Thanks, >>> Guang > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html