Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 12 May 2016 17:47:44 +0300

On 12.05.2016 14:54, Allen Samuels wrote:
Ok, I hope I'm starting to understand.

The real source of the problem is the desire to "write in place". The kv wal item (intention) and the associated complexity are associated with the need to recognize that a chunk may have an incorrect checksum for a transient period of time because of an in-place update operation.
+1

One simple solution to this is to eliminate write-in-place. You're already reading a bunch of data, computing the new checksums and then writing out the correct data. If you cow the write, then all of the torn page stuff goes away and you cut down the number of kv updates significantly (no intention is required, only a single kv update is required for the entire cow process). Then the only thing that's required to make this perform more or less equivalent to the complicated logic that you're dealing with now is to make sure that the COW destination location is "close" to the original source data location. Basically, you'll want to modify the allocation algorithm so as to leave some "space" in each cylinder so that the COW destination can always be close to the source (you only need a few per cylinder since each COW frees up another one). This is a tiny bit ugly, but much much simpler than all of the machinations that you're going through now.
Why do we need close COW destination? To reduce drive head re-position 
or what?

I would point out that one important optimization is to do a good job of handling append log files. These will look like multiple-write-in-places that will have to get merged. I believe that this optimization is particularly easy to handle with the COW approach that I described.
Well, this starts to remind me original proposal I did for WAL/GC 
implementation for ExtentManager ;)
- just do every write to a new location and update lextent map then 
perform lazy/deferred garbage collection/defragmentation.

I would point out that nothing prevents an addition of a non-COW update-in-place at a future time.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
Sent: Thursday, May 12, 2016 12:58 PM
To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
Cc: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: 2 related bluestore questions

On Wed, 11 May 2016, Allen Samuels wrote:
Sorry, still on vacation and I haven't really wrapped my head around
everything that's being discussed. However, w.r.t. wal operations, I
would strongly favor an approach that minimizes the amount of "future"
operations that are recorded (which I'll call intentions -- i.e.,
non-binding hints about extra work that needs to get done). Much of
the complexity here is because the intentions -- after being recorded
-- will need to be altered based on subsequent operations. Hence every
write operation will need to digest the historical intentions and
potentially update them -- this is VERY complex, potentially much more
complex than code that simply examines the current state and
re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of
state that require consistency checking -- in my experience, this road
leads to perdition....
I agree is has to be something manageable that we can reason about.  I think
the question for me is mostly about which path minimizes the complexity
while still getting us a reasonable level of performance.

I had one new thought, see below...

The downside is that any logically conflicting request (an
overlapping write or truncate or zero) needs to drain the wal
events, whereas with a lower-level wal description there might be
cases where we can ignore the wal operation.  I suspect the
trivial solution of o->flush() on write/truncate/zero will be
pretty visible in benchmarks.  Tracking in-flight wal ops with an
interval_set would probably work well enough.
Hmm, I'm not sure this will pan out.  The main problem is that if we
call back into the write code (with a sync flag), we will have to do
write IO, and this wreaks havoc on our otherwise (mostly) orderly state
machine.
I think it can be done if we build in a similar guard like
_txc_finish_io so that we wait for the wal events to also complete
IO in order before committing them.  I think.

But the other problem is the checksum thing that came up in another
thread, where the read-side of a read/modify/write might fail teh
checksum because the wal write hit disk but the kv portion didn't commit.
I see a few options:
  1) If there are checksums and we're doing a sub-block overwrite, we
have to write/cow it elsewhere.  This probably means min_alloc_size
cow operations for small writes.  In which case we needn't bother
doing a wal even in the first place--the whole point is to enable an
overwrite.
  2) We do loose checksum validation that will accept either the old
checksum or the expected new checksum for the read stage.  This
handles these two crash cases:

  * kv commit of op + wal event
    <crash here, or>
  * do wal io (completely)
    <crash before cleaning up wal event>
  * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which
means there is a small probability is "corrupt" ourselves on crash
(not really corrupt, but confuse ourselves such that we refuse to
replay the wal events on startup).

  3) Same as 2, but simply warn if we fail that read-side checksum on
replay.
This basically introduces a *very* small window which could allow an
ondisk corruption to get absorbed into our checksum.  This could
just be #2 + a config option so we warn instead of erroring out.

  4) Same as 2, but we try every combination of old and new data on
block/sector boundaries to find a valid checksum on the read-side.

I think #1 is a non-starter because it turns a 4K write into a 64K
read + seek + 64K write on an HDD.  Or forces us to run with
min_alloc_size=4K on HDD, which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the
"wal is just a logical write", that means this weird replay handling
logic creeps into the normal write path.

I'm currently leaning toward keeping the wal events special
(lower-level), but doing what we can to make it work with the same
mid- to low-level helper functions (for reading and verifying blobs, etc.).
It occured to me that this checksum consistency issue only comes up when
we are updating something that is smaller than the csum block size.  And the
real source of the problem is that you have a sequence of

  1- journal intent (kv wal item)
  2- do read io
  3- verify csum
  4- do write io
  5- cancel intent (remove kv wal item)

If we have an order like

  1- do read io
  2- journal intent for entire csum chunk (kv wal item)
  3- do write io
  4- cancel intent

Then the issue goes away.  And I'm thinking if the csum chunk is big enough
that the #2 step is too big of a wal item to perform well, then the problem is
your choice of csum block size, not the approach.  I.e., use a 4kb csum block
size for rbd images, and use large blocks (128k, 512k,
whatever) only for things that never see random overwrites (rgw data).

If that is good enough, then it might also mean that we can make the wal
operations never do reads--just (over)writes, further simplifying things on
that end.  In the jewel bluestore the only times we do reads is for partial
block updates (do we really care about these?  a buffer cache could absorb
them when it matters) and for copy/cow operations post-clone (which i think
are simple enough to be deal with separately).

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html