Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 12 May 2016 17:27:12 +0300

On 11.05.2016 23:54, Sage Weil wrote:
On Wed, 11 May 2016, Sage Weil wrote:
On Wed, 11 May 2016, Igor Fedotov wrote:

I like that way better!  We can just add a force_sync argument to
_do_write.  That also lets us trivially disable wal (by forcing sync w/
a config option or whatever).

The downside is that any logically conflicting request (an overlapping
write or truncate or zero) needs to drain the wal events, whereas with a
lower-level wal description there might be cases where we can ignore the
wal operation.  I suspect the trivial solution of o->flush() on
write/truncate/zero will be pretty visible in benchmarks.  Tracking
in-flight wal ops with an interval_set would probably work well enough.
Hmm, I'm not sure this will pan out.  The main problem is that if we call
back into the write code (with a sync flag), we will have to do write
IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.
I think it can be done if we build in a similar guard like _txc_finish_io
so that we wait for the wal events to also complete IO in order before
committing them.  I think.

But the other problem is the checksum thing that came up in another
thread, where the read-side of a read/modify/write might fail teh checksum
because the wal write hit disk but the kv portion didn't commit. I see a
few options:

  1) If there are checksums and we're doing a sub-block overwrite, we
have to write/cow it elsewhere.  This probably means min_alloc_size cow
operations for small writes.  In which case we needn't bother doing a wal
even in the first place--the whole point is to enable an overwrite.

  2) We do loose checksum validation that will accept either the old
checksum or the expected new checksum for the read stage.  This handles
these two crash cases:
Probably I missed something but It seems to me that we don't have any 
'expected new checksum' for the whole new block after the crash.
What we can have are old block checksum in KV and checksum for 
overwritten portion of the block in WAL. To have full new checksum one 
has to do the read and store new checksum to KV afterwards.
Or you mean write+KV update under 'do  wal io'?
  * kv commit of op + wal event
    <crash here, or>
  * do wal io (completely)
    <crash before cleaning up wal event>
  * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which means
there is a small probability is "corrupt" ourselves on crash (not really
corrupt, but confuse ourselves such that we refuse to replay the
wal events on startup).

  3) Same as 2, but simply warn if we fail that read-side checksum on
replay.  This basically introduces a *very* small window which could allow
an ondisk corruption to get absorbed into our checksum.  This could just
be #2 + a config option so we warn instead of erroring out.

  4) Same as 2, but we try every combination of old and new data on
block/sector boundaries to find a valid checksum on the read-side.
Still unclear for me where can we get old data from when we've just 
overwritten them?
E.g. old block  was
<1,2,3> and the new one <4> with the resulting one = <4,2,3>
We have checksums for <1,2,3> and for <4> in KV. And <4,2,3> block at 
the disk. How one can detect an error in an invalud <4,5,3> block unless 
we store checksum for <4,2,3> before the write?

I think #1 is a non-starter because it turns a 4K write into a 64K read +
seek + 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
HDD, which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the "wal
is just a logical write", that means this weird replay handling logic
creeps into the normal write path.

I'm currently leaning toward keeping the wal events special (lower-level),
but doing what we can to make it work with the same mid- to low-level
helper functions (for reading and verifying blobs, etc.).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html