Re: [PATCH] ceph: add buffered/direct exclusionary locking for reads and writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2019-08-06 at 14:39 +0000, Sage Weil wrote:
> On Tue, 6 Aug 2019, Jeff Layton wrote:
> > On Tue, 2019-08-06 at 13:25 +0000, Sage Weil wrote:
> > > On Tue, 6 Aug 2019, Jeff Layton wrote:
> > > > On Tue, 2019-08-06 at 03:27 +0000, Sage Weil wrote:
> > > > > On Mon, 5 Aug 2019, Jeff Layton wrote:
> > > > > > xfstest generic/451 intermittently fails. The test does O_DIRECT writes
> > > > > > to a file, and then reads back the result using buffered I/O, while
> > > > > > running a separate set of tasks that are also doing buffered reads.
> > > > > > 
> > > > > > The client will invalidate the cache prior to a direct write, but it's
> > > > > > easy for one of the other readers' replies to race in and reinstantiate
> > > > > > the invalidated range with stale data.
> > > > > 
> > > > > Maybe a silly question, but: what if the write path did the invalidation 
> > > > > after the write instead of before?  Then any racing read will see the new 
> > > > > data on disk.
> > > > > 
> > > > 
> > > > I tried that originally. It reduces the race window somewhat, but it's
> > > > still present since a reply to a concurrent read can get in just after
> > > > the invalidation occurs. You really do have to serialize them to fix
> > > > this, AFAICT.
> > > 
> > > I've always assumed that viewing the ordering for concurrent operations as 
> > > non-deterministic is the only sane approach.  If the read initiates before 
> > > the write completes you have no obligation to reflect the result of the 
> > > write.
> > > 
> > > Is that what you're trying to do?
> > > 
> > 
> > Not exactly.
> > 
> > In this testcase, we have one thread that is alternating between DIO
> > writes and buffered reads. Logically, the buffered read should always
> > reflect the result of the DIO write...and indeed if we run that program
> > in isolation it always does, since the cache is invalidated just prior
> > to the DIO write.
> > 
> > The issue occurs when other tasks are doing buffered reads at the same
> > time. Sometimes the reply to one of those will come in after the
> > invalidation but before the subsequent buffered read by the writing
> > task. If that OSD read occurs before the OSD write then it'll populate
> 
> This part confuses me.  Let's say the DIO thread does 'write a', 'read 
> (a)', 'write b', then 'read (b)'.  And the interfering buffered reader 
> thread does a 'read (a)'.  In order for that to pollute the cache, the 
> reply has to arrive *after* the 'write b' ack is received and the 
> invalidation happens.  However, I don't think it's possible for a read ack 
> to arrive *after* a write and not reflect that write, since the reply 
> delivery from the osd to client is ordered.
> 
> Am I misunderstanding the sequence of events?
> 
> 1: write a
> 1: get write ack, invalidate
> 1: read (sees a)
> 1: write b
> 2: start buffered read
> 1: get write ack, invalidate
> 2: get buffered read reply, cache 'b'
> 1: read (sees cached b)
>
> If the buffered read starts earlier, then we would get the reply before 
> the write ack, and the cached result would get invalidated when the direct 
> write completes.
> 

The problem here though (IIUC) is that we don't have a strict guarantee
on the ordering of the reply processing, in particular between buffered
and direct I/O. We issue requests to libceph's OSD infrastructure, but
the readpages machinery is mostly done in the context of the task that
initiated it.

When we get the reply from the OSD, the pagevec is populated and then
the initiating task is awoken, but nothing ensures that the initiating
tasks doing the postprocessing end up serialized in the "correct" order.

IOW, even though the read reply came in before the write reply, the
writing task ended up getting scheduled first when they were both
awoken.

> It's sometimes possible for reads that race with writes to complete 
> *sooner* (i.e., return pre-write value, because they don't wait for things 
> degraded objects to be repaired, which writes do wait for), but then the 
> read completes first.  There is a flag you can set to force reads to be 
> ordered like writes (RWORDERED), which means they will take at least as 
> long as racing writes. But IIRC reads never take longer than writes.
> 
>
> > the pagecache with stale data. The subsequent read by the writing thread
> > then ends up reading that stale data out of the cache.
> > 
> > Doing the invalidation after the DIO write reduces this race window to
> > some degree, but doesn't fully close it. You have to serialize things
> > such that buffered read requests aren't dispatched until all of the DIO
> > write replies have come in.
> > -- 
> > Jeff Layton <jlayton@xxxxxxxxxx>
> > 
> > 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux