Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 12 May 2016 17:29:23 +0300

On 12.05.2016 0:38, Allen Samuels wrote:
Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition....
Completely agree
Maybe I've missed something critical (vacation-headed) if so, please ignore.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
Sent: Thursday, May 12, 2016 6:54 AM
To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-
devel@xxxxxxxxxxxxxxx
Subject: Re: 2 related bluestore questions

On Wed, 11 May 2016, Sage Weil wrote:
On Wed, 11 May 2016, Igor Fedotov wrote:
On 11.05.2016 16:10, Sage Weil wrote:
On Wed, 11 May 2016, Igor Fedotov wrote:
I took a stab at a revised wal_op_t here:

https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
os/bluestore/bluestore_types.h#L595-L605

This is enough to implement the basic wal overwrite case here:

https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
os/bluestore/BlueStore.cc#L5522-L5578

It's overkill for that, but something like this ought to be
sufficiently general to express the more complicated wal (and
compaction/gc/cleanup) operations, where we are reading bits
of data from lots of different previous blobs, verifying
checksums, and then assembling the results into a new buffer
that gets written somewhere else.  The read_extent_map and
write_map offsets are logical offsets in a buffer we assemble
and then write to b_off~b_len in the specific blob.  I didn't
get to the _do_wal_op part that actually does it, but it would
do the final write, csum calculation, and metadata update.
Probably... the allocation would happen then too, if the
specified blob didn't already have pextents.  Tha way we can
do compression at that stage as well?

What do you think?
Not completely sure that it's a good idea to have read stage
description stored in WAL record? Wouldn't that produce any
conflicts/inconsistencies when multiple WAL records deal with
the same or close lextents and previous WAL updates lextents to
read. May be it's better to prepare such a description exactly
when WAL is applied? And WAL record to have just a basic write
info?
Yeah, I think this is a problem.  I see two basic paths:

   - We do a wal flush before queueing a new wal event to avoid
races like this. Or perhaps we only do it when the wal event(s)
touch the same blob(s).  That's simple to reason about, but means
that a series of small IOs to the same object (or blob) will
serialize the kv commit and wal r/m/w operations.  (Note that this
is no worse than the naive approach of doing the read part up
front, and it only happens when you have successive wal ops on the
same object (or blob)).
   - We describe the wal read-side in terms of the current onode
state.  For example, 'read object offset 0..100, use provided
buffer for 100..4096, overwrite block'.  That can be pipelined.
But there are other operations that would require we flush the wal
events, like a truncate or zero or other write that clobbers that region
of the object.
Maybe/hopefully in those cases we don't care (it no longer matters
that this wal event do the write we originally intended) but we'd
need to think pretty carefully about it.  FWIW, truncate already
does an
o->flush().
I'd prefer the second approach. Probably with some modification...
As far as I understand with the approach above you are trying to
locate all write logic at a single place and have WAL machinery as a
straightforward executor for already prepared tasks. Not sure this
is beneficial enough. But definitely it's more complex and
error-prone. And potentially you will need extend WAL machinery task
description from time to time...
As an alternative one can eliminate that read description in WAL
record at all. Let's simply record what loffset we are going to
write to and data itself. Thus we have simple write request description.
And when WAL is applied corresponding code should determine how to
do the write properly using the current lextent/blob maps state.
This way Write Op apply can be just a regular write handling that
performs sync RMW or any other implementation depending on the
current state, some policy, or whatever else that fits the best at the
specific moment.
I like that way better!  We can just add a force_sync argument to
_do_write.  That also lets us trivially disable wal (by forcing sync
w/ a config option or whatever).

The downside is that any logically conflicting request (an overlapping
write or truncate or zero) needs to drain the wal events, whereas with
a lower-level wal description there might be cases where we can ignore
the wal operation.  I suspect the trivial solution of o->flush() on
write/truncate/zero will be pretty visible in benchmarks.  Tracking
in-flight wal ops with an interval_set would probably work well enough.
Hmm, I'm not sure this will pan out.  The main problem is that if we call back
into the write code (with a sync flag), we will have to do write IO, and this
wreaks havoc on our otherwise (mostly) orderly state machine.
I think it can be done if we build in a similar guard like _txc_finish_io so that
we wait for the wal events to also complete IO in order before committing
them.  I think.

But the other problem is the checksum thing that came up in another thread,
where the read-side of a read/modify/write might fail teh checksum because
the wal write hit disk but the kv portion didn't commit. I see a few options:

  1) If there are checksums and we're doing a sub-block overwrite, we have to
write/cow it elsewhere.  This probably means min_alloc_size cow operations
for small writes.  In which case we needn't bother doing a wal even in the
first place--the whole point is to enable an overwrite.

  2) We do loose checksum validation that will accept either the old checksum
or the expected new checksum for the read stage.  This handles these two
crash cases:

  * kv commit of op + wal event
    <crash here, or>
  * do wal io (completely)
    <crash before cleaning up wal event>
  * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which means
there is a small probability is "corrupt" ourselves on crash (not really corrupt,
but confuse ourselves such that we refuse to replay the wal events on
startup).

  3) Same as 2, but simply warn if we fail that read-side checksum on replay.
This basically introduces a *very* small window which could allow an ondisk
corruption to get absorbed into our checksum.  This could just be #2 + a
config option so we warn instead of erroring out.

  4) Same as 2, but we try every combination of old and new data on
block/sector boundaries to find a valid checksum on the read-side.

I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
just a logical write", that means this weird replay handling logic creeps into
the normal write path.

I'm currently leaning toward keeping the wal events special (lower-level), but
doing what we can to make it work with the same mid- to low-level helper
functions (for reading and verifying blobs, etc.).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html