RE: Notes from a discussion a design to allow EC overwrites

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This scheme fundamentally relies on the temporary objects "gracefully" transitioning into being portions of full-up long-term durable objects.

This means that if the allocation size for a temporary object significantly mismatches the size of the mutation (partial stripe write) you're creating a problem that's proportional to the mismatch size. 

So either, NewStore is able to efficiently allocate small chunks
OR
You have some kind of background cleanup process that reclaims that space (i.e., a "straightener")

The right choices depend on having a shared notion of the operational profile that you're trying to optimize for.

The fundamental question becomes, are you going to optimize for small-block random writes?

In my experience this is a key use-case in virtually every customer's evaluation scenario. I believe we MUST make this case reasonably efficient.

It seems to me that the lowest-complexity "fix" for the problem is to teach NewStore to have two different allocation sizes (big and small :)). 
Naturally the allocator becomes more complex. Worst case, you're now left with the garbage collection problem. Which I suspect could be punted to a subsequent release (i.e., I'm out of large blocks, but there's plenty of fragmented available space -- This can happen, but's a pretty pathological case which becomes rare-er and rare-er as you scale-out)

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx


-----Original Message-----
From: Samuel Just [mailto:sjust@xxxxxxxxxx] 
Sent: Friday, November 13, 2015 7:39 AM
To: Sage Weil <sweil@xxxxxxxxxx>
Cc: ceph-devel@xxxxxxxxxxxxxxx; Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Durgin, Josh <jdurgin@xxxxxxxxxx>; Farnum, Gregory <gfarnum@xxxxxxxxxx>
Subject: Re: Notes from a discussion a design to allow EC overwrites

Lazily persisting the intermediate entries would certainly also work, but there's an argument that it needlessly adds to the write transaction.

Actually, we probably want to avoid having small writes be full stripe writes -- with a 8+3 code the difference between modifying a single stripelet and modifying the full stripe is 4 writes vs 11 writes.

It means that during peering, any log we can find (particularly the shortest one) from the most recent active interval isn't an upper bound on writes committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have to think carefully about the implications of that.
-Sam

On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 12 Nov 2015, Samuel Just wrote:
>> I was present for a discussion about allowing EC overwrites and 
>> thought it would be good to summarize it for the list:
>>
>> Commit Protocol:
>> 1) client sends write to primary
>> 2) primary reads in partial stripes needed for partial stripe 
>> overwrites from replicas
>> 3) primary sends prepares to participating replicas and queues its 
>> own prepare locally
>> 4) once all prepares are complete, primary sends a commit to the 
>> client
>> 5) primary sends applies to all participating replicas
>>
>> When we get the prepare, we write out a temp object with the data to 
>> be written.  On apply, we use an objectstore primitive to atomically 
>> move those extents into the actual object.  The log entry contains 
>> the name/id for the temp object so it can be applied on apply or removed on rollback.
>
> Currently we assume that temp objects are/can be cleared out on restart.
> This will need to change.  And we'll need to be careful that they get 
> cleaned out when peering completes (and the rollforward/rollback 
> decision is made.
>
> If the stripes are small, then the objectstore primitive may not 
> actually be that efficient.  I'd suggest also hinting that the temp 
> object will be swapped later, so that the backend can, if it's small, 
> store it in a cheap temporary location in the expectation that it will get rewritten later.
> (In particular, the newstore allocation chunk is currently targetting 
> 512kb, and this will only be efficient with narrow stripes, so it'll 
> just get double-written.  We'll want to keep the temp value in the kv 
> store [log, hopefully] and not bother to allocate disk and rewrite 
> it.)
>
>> Each log entry contains a list of the shard ids modified.  During 
>> peering, we use the same protocol for choosing the authoritative log 
>> for the existing EC pool, except that we first take the longest 
>> candidate log and use it to extend shorter logs until they hit an entry they should have witnessed, but didn't.
>>
>> Implicit in the above scheme is the fact that if an object is 
>> written, but a particular shard isn't changed, the osd with that 
>> shard will have a copy of the object with the correct data, but an 
>> out of date object_into (notably, the version will be old).  To avoid 
>> this messing up the missing set during the log scan on OSD start, 
>> we'll skip log entries we wouldn't have participated in (we may not 
>> even choose to store them, see below).  This does generally pose a 
>> challenge for maintaining prior_version.  It seems like it shouldn't 
>> be much of a problem since rollbacks can only happen on prepared log 
>> entries which haven't been applied, so log merging can never result in a divergent entry causing a missing object.  I think we can get by without it then?
>>
>> We can go further with the above and actually never persist a log 
>> entry on a shard which it did not participate in.  As long as peering 
>> works correctly, the union of the logs we got must have all entries.  
>> The primary will need to keep a complete copy in memory to manage dup op detection and recovery, however.
>
> That sounds more complex to me.  Maybe instead we could lazily persist 
> the entries (on the next pg write) so that it is always a contiguous sequence?
>
>> 2) above can also be done much more efficiently.  Some codes will 
>> allow the parity chunks to be reconstructed more efficiently, so some 
>> thought will have to go into restructuring the plugin interface to 
>> allow more efficient
>
> Hopefully the stripe size is chosen such that most writes will end up 
> being full stripe writes (we should figure out if the EC performance 
> degrades significantly in that case?).
>
> An alternative would be to do something like
>
> 1) client sends write to primary
> 2) primary sends prepare to the first M+1 shards, who write it in a 
> temporary object/location
> 3) primary acks write once they ack
> 4) asynchronously, primary recalculates the affected stripes and sends 
> an overwrite.
>
>  - step 4 doesn't need to be 2-phase, since we have the original data 
> persisted already on >M shards
>  - the client-observed latency is bounded by only M+1 OSDs (not
> acting.size())
>
> I suspect you discussed this option, though, and have other concerns 
> around its complexity?
>
> sage
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux