Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 12 May 2016 17:38:47 +0300

On 12.05.2016 5:58, Sage Weil wrote:
On Wed, 11 May 2016, Allen Samuels wrote:
Sorry, still on vacation and I haven't really wrapped my head around
everything that's being discussed. However, w.r.t. wal operations, I
would strongly favor an approach that minimizes the amount of "future"
operations that are recorded (which I'll call intentions -- i.e.,
non-binding hints about extra work that needs to get done). Much of the
complexity here is because the intentions -- after being recorded --
will need to be altered based on subsequent operations. Hence every
write operation will need to digest the historical intentions and
potentially update them -- this is VERY complex, potentially much more
complex than code that simply examines the current state and
re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of state
that require consistency checking -- in my experience, this road leads
to perdition....
I agree is has to be something manageable that we can reason about.  I
think the question for me is mostly about which path minimizes the
complexity while still getting us a reasonable level of performance.

I had one new thought, see below...

The downside is that any logically conflicting request (an overlapping
write or truncate or zero) needs to drain the wal events, whereas with
a lower-level wal description there might be cases where we can ignore
the wal operation.  I suspect the trivial solution of o->flush() on
write/truncate/zero will be pretty visible in benchmarks.  Tracking
in-flight wal ops with an interval_set would probably work well enough.
Hmm, I'm not sure this will pan out.  The main problem is that if we call back
into the write code (with a sync flag), we will have to do write IO, and this
wreaks havoc on our otherwise (mostly) orderly state machine.
I think it can be done if we build in a similar guard like _txc_finish_io so that
we wait for the wal events to also complete IO in order before committing
them.  I think.

But the other problem is the checksum thing that came up in another thread,
where the read-side of a read/modify/write might fail teh checksum because
the wal write hit disk but the kv portion didn't commit. I see a few options:

  1) If there are checksums and we're doing a sub-block overwrite, we have to
write/cow it elsewhere.  This probably means min_alloc_size cow operations
for small writes.  In which case we needn't bother doing a wal even in the
first place--the whole point is to enable an overwrite.

  2) We do loose checksum validation that will accept either the old checksum
or the expected new checksum for the read stage.  This handles these two
crash cases:

  * kv commit of op + wal event
    <crash here, or>
  * do wal io (completely)
    <crash before cleaning up wal event>
  * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which means
there is a small probability is "corrupt" ourselves on crash (not really corrupt,
but confuse ourselves such that we refuse to replay the wal events on
startup).

  3) Same as 2, but simply warn if we fail that read-side checksum on replay.
This basically introduces a *very* small window which could allow an ondisk
corruption to get absorbed into our checksum.  This could just be #2 + a
config option so we warn instead of erroring out.

  4) Same as 2, but we try every combination of old and new data on
block/sector boundaries to find a valid checksum on the read-side.

I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
just a logical write", that means this weird replay handling logic creeps into
the normal write path.

I'm currently leaning toward keeping the wal events special (lower-level), but
doing what we can to make it work with the same mid- to low-level helper
functions (for reading and verifying blobs, etc.).
It occured to me that this checksum consistency issue only comes up when
we are updating something that is smaller than the csum block size.  And
the real source of the problem is that you have a sequence of

  1- journal intent (kv wal item)
  2- do read io
  3- verify csum
  4- do write io
  5- cancel intent (remove kv wal item)

If we have an order like

  1- do read io
  2- journal intent for entire csum chunk (kv wal item)
  3- do write io
  4- cancel intent

I suspect this will cause consistency issues when handling multiple 
writes for the same extent if subsequent write doesn't wait for WAL 
apply completion.
E.g. we have block <1,2,3> and two writes <4,,> & <,,5>
In your case the second WAL will have <1,2,5> block instead of <4,2,5> one.

And remember you have o->flush for reading and don't have one for 
writing. But for your case you're introducing o->flush for writing as 
well to perform a read...

Then the issue goes away.  And I'm thinking if the csum chunk is big
enough that the #2 step is too big of a wal item to perform well, then the
problem is your choice of csum block size, not the approach.  I.e., use a
4kb csum block size for rbd images, and use large blocks (128k, 512k,
whatever) only for things that never see random overwrites (rgw data).

If that is good enough, then it might also mean that we can make the wal
operations never do reads--just (over)writes, further simplifying things
on that end.  In the jewel bluestore the only times we do reads is for
partial block updates (do we really care about these?  a buffer cache
could absorb them when it matters) and for copy/cow operations post-clone
(which i think are simple enough to be deal with separately).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html