Re: default min_size for erasure pools

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 9 Mar 2016 21:32:30 +0100

On 3/9/16, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Wed, Mar 9, 2016 at 6:25 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
>> Hi,
>>
>> For replicated pools we default to min_size=2 when size=3
>> (size-size/2) in order to avoid the split brain scenario, for example
>> as described here:
>> http://www.spinics.net/lists/ceph-devel/msg27008.html
>>
>> But for erasure pools we default to min_size=k which I think is a
>> recipe for similar problems.
>>
>> Shouldn't we default to at least min_size=k+1??
>>
>> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
>> index 77e26de..5d51686 100644
>> --- a/src/mon/OSDMonitor.cc
>> +++ b/src/mon/OSDMonitor.cc
>> @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned
>> pool_type,
>>        err = get_erasure_code(erasure_code_profile, &erasure_code, ss);
>>        if (err == 0) {
>>         *size = erasure_code->get_chunk_count();
>> -       *min_size = erasure_code->get_data_chunk_count();
>> +       *min_size = erasure_code->get_data_chunk_count() + 1;
>>        }
>>      }
>>      break;
>
> Well, losing any OSDs at that point would be bad since it would become
> inaccessible until you get that whole set back, but there's not really
> any chance of serving up bad reads like Sam is worried about in the
> ReplicatedPG case. (...at least, assuming you have more data chunks
> than parity chunks.) Send in a PR on github?
> -Greg
>

Oops, that link discussed reads, but I'm more worried about writes.
I.e. if we allow writes when only k osds are up, then one of the m
down osds comes back and starts backfilling or recovery, but then one
of the k osds that took writes goes down before recovery completes.

PR incoming.

.. Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html