Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.) Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition.... Maybe I've missed something critical (vacation-headed) if so, please ignore. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Thursday, May 12, 2016 6:54 AM > To: Igor Fedotov <ifedotov@xxxxxxxxxxxx> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph- > devel@xxxxxxxxxxxxxxx > Subject: Re: 2 related bluestore questions > > On Wed, 11 May 2016, Sage Weil wrote: > > On Wed, 11 May 2016, Igor Fedotov wrote: > > > On 11.05.2016 16:10, Sage Weil wrote: > > > > On Wed, 11 May 2016, Igor Fedotov wrote: > > > > > > I took a stab at a revised wal_op_t here: > > > > > > > > > > > > > > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/ > > > > > > os/bluestore/bluestore_types.h#L595-L605 > > > > > > > > > > > > This is enough to implement the basic wal overwrite case here: > > > > > > > > > > > > > > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/ > > > > > > os/bluestore/BlueStore.cc#L5522-L5578 > > > > > > > > > > > > It's overkill for that, but something like this ought to be > > > > > > sufficiently general to express the more complicated wal (and > > > > > > compaction/gc/cleanup) operations, where we are reading bits > > > > > > of data from lots of different previous blobs, verifying > > > > > > checksums, and then assembling the results into a new buffer > > > > > > that gets written somewhere else. The read_extent_map and > > > > > > write_map offsets are logical offsets in a buffer we assemble > > > > > > and then write to b_off~b_len in the specific blob. I didn't > > > > > > get to the _do_wal_op part that actually does it, but it would > > > > > > do the final write, csum calculation, and metadata update. > > > > > > Probably... the allocation would happen then too, if the > > > > > > specified blob didn't already have pextents. Tha way we can > > > > > > do compression at that stage as well? > > > > > > > > > > > > What do you think? > > > > > Not completely sure that it's a good idea to have read stage > > > > > description stored in WAL record? Wouldn't that produce any > > > > > conflicts/inconsistencies when multiple WAL records deal with > > > > > the same or close lextents and previous WAL updates lextents to > > > > > read. May be it's better to prepare such a description exactly > > > > > when WAL is applied? And WAL record to have just a basic write > > > > > info? > > > > Yeah, I think this is a problem. I see two basic paths: > > > > > > > > - We do a wal flush before queueing a new wal event to avoid > > > > races like this. Or perhaps we only do it when the wal event(s) > > > > touch the same blob(s). That's simple to reason about, but means > > > > that a series of small IOs to the same object (or blob) will > > > > serialize the kv commit and wal r/m/w operations. (Note that this > > > > is no worse than the naive approach of doing the read part up > > > > front, and it only happens when you have successive wal ops on the > same object (or blob)). > > > > > > > > - We describe the wal read-side in terms of the current onode > > > > state. For example, 'read object offset 0..100, use provided > > > > buffer for 100..4096, overwrite block'. That can be pipelined. > > > > But there are other operations that would require we flush the wal > > > > events, like a truncate or zero or other write that clobbers that region > of the object. > > > > Maybe/hopefully in those cases we don't care (it no longer matters > > > > that this wal event do the write we originally intended) but we'd > > > > need to think pretty carefully about it. FWIW, truncate already > > > > does an > > > > o->flush(). > > > I'd prefer the second approach. Probably with some modification... > > > As far as I understand with the approach above you are trying to > > > locate all write logic at a single place and have WAL machinery as a > > > straightforward executor for already prepared tasks. Not sure this > > > is beneficial enough. But definitely it's more complex and > > > error-prone. And potentially you will need extend WAL machinery task > description from time to time... > > > As an alternative one can eliminate that read description in WAL > > > record at all. Let's simply record what loffset we are going to > > > write to and data itself. Thus we have simple write request description. > > > And when WAL is applied corresponding code should determine how to > > > do the write properly using the current lextent/blob maps state. > > > This way Write Op apply can be just a regular write handling that > > > performs sync RMW or any other implementation depending on the > > > current state, some policy, or whatever else that fits the best at the > specific moment. > > > > I like that way better! We can just add a force_sync argument to > > _do_write. That also lets us trivially disable wal (by forcing sync > > w/ a config option or whatever). > > > > The downside is that any logically conflicting request (an overlapping > > write or truncate or zero) needs to drain the wal events, whereas with > > a lower-level wal description there might be cases where we can ignore > > the wal operation. I suspect the trivial solution of o->flush() on > > write/truncate/zero will be pretty visible in benchmarks. Tracking > > in-flight wal ops with an interval_set would probably work well enough. > > Hmm, I'm not sure this will pan out. The main problem is that if we call back > into the write code (with a sync flag), we will have to do write IO, and this > wreaks havoc on our otherwise (mostly) orderly state machine. > I think it can be done if we build in a similar guard like _txc_finish_io so that > we wait for the wal events to also complete IO in order before committing > them. I think. > > But the other problem is the checksum thing that came up in another thread, > where the read-side of a read/modify/write might fail teh checksum because > the wal write hit disk but the kv portion didn't commit. I see a few options: > > 1) If there are checksums and we're doing a sub-block overwrite, we have to > write/cow it elsewhere. This probably means min_alloc_size cow operations > for small writes. In which case we needn't bother doing a wal even in the > first place--the whole point is to enable an overwrite. > > 2) We do loose checksum validation that will accept either the old checksum > or the expected new checksum for the read stage. This handles these two > crash cases: > > * kv commit of op + wal event > <crash here, or> > * do wal io (completely) > <crash before cleaning up wal event> > * kv cleanup of wal event > > but not the case where we only partially complete the wal io. Which means > there is a small probability is "corrupt" ourselves on crash (not really corrupt, > but confuse ourselves such that we refuse to replay the wal events on > startup). > > 3) Same as 2, but simply warn if we fail that read-side checksum on replay. > This basically introduces a *very* small window which could allow an ondisk > corruption to get absorbed into our checksum. This could just be #2 + a > config option so we warn instead of erroring out. > > 4) Same as 2, but we try every combination of old and new data on > block/sector boundaries to find a valid checksum on the read-side. > > I think #1 is a non-starter because it turns a 4K write into a 64K read + seek + > 64K write on an HDD. Or forces us to run with min_alloc_size=4K on HDD, > which would risk very bad fragmentation. > > Which makes we want #3 (initially) and then #4. But... if we do the "wal is > just a logical write", that means this weird replay handling logic creeps into > the normal write path. > > I'm currently leaning toward keeping the wal events special (lower-level), but > doing what we can to make it work with the same mid- to low-level helper > functions (for reading and verifying blobs, etc.). > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html