Re: 2 related bluestore questions

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 12 May 2016 12:48:58 -0400 (EDT)

On Thu, 12 May 2016, Igor Fedotov wrote:
> The second write in my example isn't processed through WAL - it's large and
> overwrites the whole blob...

If it's large, it wouldn't overwrite--it would go to newly allocated 
space.  We can *never* overwrite without wal or else we corrupt previous 
data...

sage

> 
> 
> On 12.05.2016 19:43, Sage Weil wrote:
> > On Thu, 12 May 2016, Igor Fedotov wrote:
> > > Yet another potential issue with WAL I can imagine:
> > > 
> > > Let's have some small write going to WAL followed by an larger aligned
> > > overwrite to the same extent that bypasses WAL. Is it possible if the
> > > first
> > > write is processed later and overwrites the second one? I think so.
> > Yeah, that would be chaos.  The wal ops are already ordered by the
> > sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
> > can't happen.
> > 
> > sage
> > 
> > 
> > > This way we can probably come to the conclusion that all requests should
> > > be
> > > processed in-sequence. One should prohibit multiple flows for requests
> > > processing as this may eliminate their order.
> > > 
> > > Yeah - I'm attacking WAL concept this way...
> > > 
> > > 
> > > Thanks,
> > > Igor
> > > 
> > > On 12.05.2016 5:58, Sage Weil wrote:
> > > > On Wed, 11 May 2016, Allen Samuels wrote:
> > > > > Sorry, still on vacation and I haven't really wrapped my head around
> > > > > everything that's being discussed. However, w.r.t. wal operations, I
> > > > > would strongly favor an approach that minimizes the amount of "future"
> > > > > operations that are recorded (which I'll call intentions -- i.e.,
> > > > > non-binding hints about extra work that needs to get done). Much of
> > > > > the
> > > > > complexity here is because the intentions -- after being recorded --
> > > > > will need to be altered based on subsequent operations. Hence every
> > > > > write operation will need to digest the historical intentions and
> > > > > potentially update them -- this is VERY complex, potentially much more
> > > > > complex than code that simply examines the current state and
> > > > > re-determines the correct next operation (i.e., de-wal, gc, etc.)
> > > > > 
> > > > > Additional complexity arises because you're recording two sets of
> > > > > state
> > > > > that require consistency checking -- in my experience, this road leads
> > > > > to perdition....
> > > > I agree is has to be something manageable that we can reason about.  I
> > > > think the question for me is mostly about which path minimizes the
> > > > complexity while still getting us a reasonable level of performance.
> > > > 
> > > > I had one new thought, see below...
> > > > 
> > > > > > > The downside is that any logically conflicting request (an
> > > > > > > overlapping
> > > > > > > write or truncate or zero) needs to drain the wal events, whereas
> > > > > > > with
> > > > > > > a lower-level wal description there might be cases where we can
> > > > > > > ignore
> > > > > > > the wal operation.  I suspect the trivial solution of o->flush()
> > > > > > > on
> > > > > > > write/truncate/zero will be pretty visible in benchmarks.
> > > > > > > Tracking
> > > > > > > in-flight wal ops with an interval_set would probably work well
> > > > > > > enough.
> > > > > > Hmm, I'm not sure this will pan out.  The main problem is that if we
> > > > > > call back
> > > > > > into the write code (with a sync flag), we will have to do write IO,
> > > > > > and
> > > > > > this
> > > > > > wreaks havoc on our otherwise (mostly) orderly state machine.
> > > > > > I think it can be done if we build in a similar guard like
> > > > > > _txc_finish_io so that
> > > > > > we wait for the wal events to also complete IO in order before
> > > > > > committing
> > > > > > them.  I think.
> > > > > > 
> > > > > > But the other problem is the checksum thing that came up in another
> > > > > > thread,
> > > > > > where the read-side of a read/modify/write might fail teh checksum
> > > > > > because
> > > > > > the wal write hit disk but the kv portion didn't commit. I see a few
> > > > > > options:
> > > > > > 
> > > > > >    1) If there are checksums and we're doing a sub-block overwrite,
> > > > > > we
> > > > > > have to
> > > > > > write/cow it elsewhere.  This probably means min_alloc_size cow
> > > > > > operations
> > > > > > for small writes.  In which case we needn't bother doing a wal even
> > > > > > in
> > > > > > the
> > > > > > first place--the whole point is to enable an overwrite.
> > > > > > 
> > > > > >    2) We do loose checksum validation that will accept either the
> > > > > > old
> > > > > > checksum
> > > > > > or the expected new checksum for the read stage.  This handles these
> > > > > > two
> > > > > > crash cases:
> > > > > > 
> > > > > >    * kv commit of op + wal event
> > > > > >      <crash here, or>
> > > > > >    * do wal io (completely)
> > > > > >      <crash before cleaning up wal event>
> > > > > >    * kv cleanup of wal event
> > > > > > 
> > > > > > but not the case where we only partially complete the wal io.  Which
> > > > > > means
> > > > > > there is a small probability is "corrupt" ourselves on crash (not
> > > > > > really
> > > > > > corrupt,
> > > > > > but confuse ourselves such that we refuse to replay the wal events
> > > > > > on
> > > > > > startup).
> > > > > > 
> > > > > >    3) Same as 2, but simply warn if we fail that read-side checksum
> > > > > > on
> > > > > > replay.
> > > > > > This basically introduces a *very* small window which could allow an
> > > > > > ondisk
> > > > > > corruption to get absorbed into our checksum.  This could just be #2
> > > > > > + a
> > > > > > config option so we warn instead of erroring out.
> > > > > > 
> > > > > >    4) Same as 2, but we try every combination of old and new data on
> > > > > > block/sector boundaries to find a valid checksum on the read-side.
> > > > > > 
> > > > > > I think #1 is a non-starter because it turns a 4K write into a 64K
> > > > > > read
> > > > > > + seek +
> > > > > > 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
> > > > > > HDD,
> > > > > > which would risk very bad fragmentation.
> > > > > > 
> > > > > > Which makes we want #3 (initially) and then #4.  But... if we do the
> > > > > > "wal is
> > > > > > just a logical write", that means this weird replay handling logic
> > > > > > creeps into
> > > > > > the normal write path.
> > > > > > 
> > > > > > I'm currently leaning toward keeping the wal events special
> > > > > > (lower-level), but
> > > > > > doing what we can to make it work with the same mid- to low-level
> > > > > > helper
> > > > > > functions (for reading and verifying blobs, etc.).
> > > > It occured to me that this checksum consistency issue only comes up when
> > > > we are updating something that is smaller than the csum block size.  And
> > > > the real source of the problem is that you have a sequence of
> > > > 
> > > >    1- journal intent (kv wal item)
> > > >    2- do read io
> > > >    3- verify csum
> > > >    4- do write io
> > > >    5- cancel intent (remove kv wal item)
> > > > 
> > > > If we have an order like
> > > > 
> > > >    1- do read io
> > > >    2- journal intent for entire csum chunk (kv wal item)
> > > >    3- do write io
> > > >    4- cancel intent
> > > > 
> > > > Then the issue goes away.  And I'm thinking if the csum chunk is big
> > > > enough that the #2 step is too big of a wal item to perform well, then
> > > > the
> > > > problem is your choice of csum block size, not the approach.  I.e., use
> > > > a
> > > > 4kb csum block size for rbd images, and use large blocks (128k, 512k,
> > > > whatever) only for things that never see random overwrites (rgw data).
> > > > 
> > > > If that is good enough, then it might also mean that we can make the wal
> > > > operations never do reads--just (over)writes, further simplifying things
> > > > on that end.  In the jewel bluestore the only times we do reads is for
> > > > partial block updates (do we really care about these?  a buffer cache
> > > > could absorb them when it matters) and for copy/cow operations
> > > > post-clone
> > > > (which i think are simple enough to be deal with separately).
> > > > 
> > > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html