Re: [PATCH] ceph: add buffered/direct exclusionary locking for reads and writes

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 6 Aug 2019 15:27:50 +0000 (UTC)

On Tue, 6 Aug 2019, Jeff Layton wrote:
> On Tue, 2019-08-06 at 14:39 +0000, Sage Weil wrote:
> > On Tue, 6 Aug 2019, Jeff Layton wrote:
> > > On Tue, 2019-08-06 at 13:25 +0000, Sage Weil wrote:
> > > > On Tue, 6 Aug 2019, Jeff Layton wrote:
> > > > > On Tue, 2019-08-06 at 03:27 +0000, Sage Weil wrote:
> > > > > > On Mon, 5 Aug 2019, Jeff Layton wrote:
> > > > > > > xfstest generic/451 intermittently fails. The test does O_DIRECT writes
> > > > > > > to a file, and then reads back the result using buffered I/O, while
> > > > > > > running a separate set of tasks that are also doing buffered reads.
> > > > > > > 
> > > > > > > The client will invalidate the cache prior to a direct write, but it's
> > > > > > > easy for one of the other readers' replies to race in and reinstantiate
> > > > > > > the invalidated range with stale data.
> > > > > > 
> > > > > > Maybe a silly question, but: what if the write path did the invalidation 
> > > > > > after the write instead of before?  Then any racing read will see the new 
> > > > > > data on disk.
> > > > > > 
> > > > > 
> > > > > I tried that originally. It reduces the race window somewhat, but it's
> > > > > still present since a reply to a concurrent read can get in just after
> > > > > the invalidation occurs. You really do have to serialize them to fix
> > > > > this, AFAICT.
> > > > 
> > > > I've always assumed that viewing the ordering for concurrent operations as 
> > > > non-deterministic is the only sane approach.  If the read initiates before 
> > > > the write completes you have no obligation to reflect the result of the 
> > > > write.
> > > > 
> > > > Is that what you're trying to do?
> > > > 
> > > 
> > > Not exactly.
> > > 
> > > In this testcase, we have one thread that is alternating between DIO
> > > writes and buffered reads. Logically, the buffered read should always
> > > reflect the result of the DIO write...and indeed if we run that program
> > > in isolation it always does, since the cache is invalidated just prior
> > > to the DIO write.
> > > 
> > > The issue occurs when other tasks are doing buffered reads at the same
> > > time. Sometimes the reply to one of those will come in after the
> > > invalidation but before the subsequent buffered read by the writing
> > > task. If that OSD read occurs before the OSD write then it'll populate
> > 
> > This part confuses me.  Let's say the DIO thread does 'write a', 'read 
> > (a)', 'write b', then 'read (b)'.  And the interfering buffered reader 
> > thread does a 'read (a)'.  In order for that to pollute the cache, the 
> > reply has to arrive *after* the 'write b' ack is received and the 
> > invalidation happens.  However, I don't think it's possible for a read ack 
> > to arrive *after* a write and not reflect that write, since the reply 
> > delivery from the osd to client is ordered.
> > 
> > Am I misunderstanding the sequence of events?
> > 
> > 1: write a
> > 1: get write ack, invalidate
> > 1: read (sees a)
> > 1: write b
> > 2: start buffered read
> > 1: get write ack, invalidate
> > 2: get buffered read reply, cache 'b'
> > 1: read (sees cached b)
> >
> > If the buffered read starts earlier, then we would get the reply before 
> > the write ack, and the cached result would get invalidated when the direct 
> > write completes.
> > 
> 
> The problem here though (IIUC) is that we don't have a strict guarantee
> on the ordering of the reply processing, in particular between buffered
> and direct I/O. We issue requests to libceph's OSD infrastructure, but
> the readpages machinery is mostly done in the context of the task that
> initiated it.
> 
> When we get the reply from the OSD, the pagevec is populated and then
> the initiating task is awoken, but nothing ensures that the initiating
> tasks doing the postprocessing end up serialized in the "correct" order.
> 
> IOW, even though the read reply came in before the write reply, the
> writing task ended up getting scheduled first when they were both
> awoken.

Aha, I understand now.

Thanks!
sage

> 
> > It's sometimes possible for reads that race with writes to complete 
> > *sooner* (i.e., return pre-write value, because they don't wait for things 
> > degraded objects to be repaired, which writes do wait for), but then the 
> > read completes first.  There is a flag you can set to force reads to be 
> > ordered like writes (RWORDERED), which means they will take at least as 
> > long as racing writes. But IIRC reads never take longer than writes.
> > 
> >
> > > the pagecache with stale data. The subsequent read by the writing thread
> > > then ends up reading that stale data out of the cache.
> > > 
> > > Doing the invalidation after the DIO write reduces this race window to
> > > some degree, but doesn't fully close it. You have to serialize things
> > > such that buffered read requests aren't dispatched until all of the DIO
> > > write replies have come in.
> > > -- 
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > 
> > > 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>
> 
>