Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Fri, 13 May 2016 20:07:34 +0300

On 12.05.2016 20:09, Sage Weil wrote:
On Thu, 12 May 2016, Igor Fedotov wrote:
Well, it goes to the new space and updates all the maps.

Then WAL comes to action - where will it write? To the new location? And
overwrite new data?
The the old location.  The modes I describe in the doc all are in terms of
pextents (with the possible exception of E+F) for this reason... deferred
IO, not deferred logical operations.
Sounds almost good.
Sorry for obtrusiveness but IMHO there is still a minor chance that 
destination pextent (released by the second write bypassing WAL) is 
reallocated for another object and thus the write might destroy data there.
	https://github.com/liewegas/ceph/commit/c7cb76889669bf2c1abffd69f05d1c9e15c41e3c#commitcomment-17453409

For E+F, the source is always immutable (compressed blob or clone).  To
avoid this sort of race on the destination... I'm not sure.  I'm sort of
wondering, though, if we should even bother with the 'cow' part.  We used
to have to do this because we didn't have a lextent mapping.  Now, if we
have a small overwrite of a cloned/shared lextent/blob, we can

  - allocate a new min_alloc_size blob
  - write the new data into the relevant block in that blob
  - update lextent_t map *only for those bytes*
That's cool!
The only (minor?) drawback I can see is potential read ineffectiveness 
when csum is enabled. One should read both overlapping blocks to 
calculate csum when read happens for the spanning interval.
There's not write- or read-amp that way, and if we later get more random
overwrites nearby they can just fill in the other unused parts of the blob
and eventually the lextent mapping will merge/simplify to reference the
whole thing.  (I'm assuming that if we wrote at, say, object offset 68k
and min_alloc_size is 64k, we'd write at offset 4k in the new 64k blob, so
that later when adjacent blocks get filled in it would be contiguous.)
Anyway, that would be *no* copy/cow type wal events at all.  The only
read-like thing that would remain would be C, which is a pretty trivial
case (no csum, no comp, just a read/modify/write of a partial block.)  I
think it also means that no wal events would need to do metadata (csum)
updates after all.

I pushed an update to that doc:

	https://github.com/liewegas/ceph/blob/76ab431ec2aed0b90f2f0354d89f4bccd23e7ae2/doc/dev/bluestore.rst

The D case may or may not be worth it.  It's nice for efficient small
overwrites of big compressed blobs.  OTOH, E accomplishes the same thing
at the expense of using a bit more disk space.  (For SSDs, E won't matter,
since min_alloc_size would be 4K anyway.)

sage

On 12.05.2016 19:48, Sage Weil wrote:
On Thu, 12 May 2016, Igor Fedotov wrote:
The second write in my example isn't processed through WAL - it's large
and
overwrites the whole blob...
If it's large, it wouldn't overwrite--it would go to newly allocated
space.  We can *never* overwrite without wal or else we corrupt previous
data...

sage

On 12.05.2016 19:43, Sage Weil wrote:
On Thu, 12 May 2016, Igor Fedotov wrote:
Yet another potential issue with WAL I can imagine:

Let's have some small write going to WAL followed by an larger aligned
overwrite to the same extent that bypasses WAL. Is it possible if the
first
write is processed later and overwrites the second one? I think so.
Yeah, that would be chaos.  The wal ops are already ordered by the
sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so
this
can't happen.

sage

This way we can probably come to the conclusion that all requests
should
be
processed in-sequence. One should prohibit multiple flows for requests
processing as this may eliminate their order.

Yeah - I'm attacking WAL concept this way...

Thanks,
Igor

On 12.05.2016 5:58, Sage Weil wrote:
On Wed, 11 May 2016, Allen Samuels wrote:
Sorry, still on vacation and I haven't really wrapped my head
around
everything that's being discussed. However, w.r.t. wal operations,
I
would strongly favor an approach that minimizes the amount of
"future"
operations that are recorded (which I'll call intentions -- i.e.,
non-binding hints about extra work that needs to get done). Much
of
the
complexity here is because the intentions -- after being recorded
--
will need to be altered based on subsequent operations. Hence
every
write operation will need to digest the historical intentions and
potentially update them -- this is VERY complex, potentially much
more
complex than code that simply examines the current state and
re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of
state
that require consistency checking -- in my experience, this road
leads
to perdition....
I agree is has to be something manageable that we can reason about.
I
think the question for me is mostly about which path minimizes the
complexity while still getting us a reasonable level of performance.

I had one new thought, see below...

The downside is that any logically conflicting request (an
overlapping
write or truncate or zero) needs to drain the wal events,
whereas
with
a lower-level wal description there might be cases where we
can
ignore
the wal operation.  I suspect the trivial solution of
o->flush()
on
write/truncate/zero will be pretty visible in benchmarks.
Tracking
in-flight wal ops with an interval_set would probably work
well
enough.
Hmm, I'm not sure this will pan out.  The main problem is that
if we
call back
into the write code (with a sync flag), we will have to do write
IO,
and
this
wreaks havoc on our otherwise (mostly) orderly state machine.
I think it can be done if we build in a similar guard like
_txc_finish_io so that
we wait for the wal events to also complete IO in order before
committing
them.  I think.

But the other problem is the checksum thing that came up in
another
thread,
where the read-side of a read/modify/write might fail teh
checksum
because
the wal write hit disk but the kv portion didn't commit. I see a
few
options:

     1) If there are checksums and we're doing a sub-block
overwrite,
we
have to
write/cow it elsewhere.  This probably means min_alloc_size cow
operations
for small writes.  In which case we needn't bother doing a wal
even
in
the
first place--the whole point is to enable an overwrite.

     2) We do loose checksum validation that will accept either
the
old
checksum
or the expected new checksum for the read stage.  This handles
these
two
crash cases:

     * kv commit of op + wal event
       <crash here, or>
     * do wal io (completely)
       <crash before cleaning up wal event>
     * kv cleanup of wal event

but not the case where we only partially complete the wal io.
Which
means
there is a small probability is "corrupt" ourselves on crash
(not
really
corrupt,
but confuse ourselves such that we refuse to replay the wal
events
on
startup).

     3) Same as 2, but simply warn if we fail that read-side
checksum
on
replay.
This basically introduces a *very* small window which could
allow an
ondisk
corruption to get absorbed into our checksum.  This could just
be #2
+ a
config option so we warn instead of erroring out.

     4) Same as 2, but we try every combination of old and new
data on
block/sector boundaries to find a valid checksum on the
read-side.

I think #1 is a non-starter because it turns a 4K write into a
64K
read
+ seek +
64K write on an HDD.  Or forces us to run with min_alloc_size=4K
on
HDD,
which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do
the
"wal is
just a logical write", that means this weird replay handling
logic
creeps into
the normal write path.

I'm currently leaning toward keeping the wal events special
(lower-level), but
doing what we can to make it work with the same mid- to
low-level
helper
functions (for reading and verifying blobs, etc.).
It occured to me that this checksum consistency issue only comes up
when
we are updating something that is smaller than the csum block size.
And
the real source of the problem is that you have a sequence of

     1- journal intent (kv wal item)
     2- do read io
     3- verify csum
     4- do write io
     5- cancel intent (remove kv wal item)

If we have an order like

     1- do read io
     2- journal intent for entire csum chunk (kv wal item)
     3- do write io
     4- cancel intent

Then the issue goes away.  And I'm thinking if the csum chunk is big
enough that the #2 step is too big of a wal item to perform well,
then
the
problem is your choice of csum block size, not the approach.  I.e.,
use
a
4kb csum block size for rbd images, and use large blocks (128k,
512k,
whatever) only for things that never see random overwrites (rgw
data).

If that is good enough, then it might also mean that we can make the
wal
operations never do reads--just (over)writes, further simplifying
things
on that end.  In the jewel bluestore the only times we do reads is
for
partial block updates (do we really care about these?  a buffer
cache
could absorb them when it matters) and for copy/cow operations
post-clone
(which i think are simple enough to be deal with separately).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html