On Tue, 6 Aug 2019, Jeff Layton wrote: > On Tue, 2019-08-06 at 14:39 +0000, Sage Weil wrote: > > On Tue, 6 Aug 2019, Jeff Layton wrote: > > > On Tue, 2019-08-06 at 13:25 +0000, Sage Weil wrote: > > > > On Tue, 6 Aug 2019, Jeff Layton wrote: > > > > > On Tue, 2019-08-06 at 03:27 +0000, Sage Weil wrote: > > > > > > On Mon, 5 Aug 2019, Jeff Layton wrote: > > > > > > > xfstest generic/451 intermittently fails. The test does O_DIRECT writes > > > > > > > to a file, and then reads back the result using buffered I/O, while > > > > > > > running a separate set of tasks that are also doing buffered reads. > > > > > > > > > > > > > > The client will invalidate the cache prior to a direct write, but it's > > > > > > > easy for one of the other readers' replies to race in and reinstantiate > > > > > > > the invalidated range with stale data. > > > > > > > > > > > > Maybe a silly question, but: what if the write path did the invalidation > > > > > > after the write instead of before? Then any racing read will see the new > > > > > > data on disk. > > > > > > > > > > > > > > > > I tried that originally. It reduces the race window somewhat, but it's > > > > > still present since a reply to a concurrent read can get in just after > > > > > the invalidation occurs. You really do have to serialize them to fix > > > > > this, AFAICT. > > > > > > > > I've always assumed that viewing the ordering for concurrent operations as > > > > non-deterministic is the only sane approach. If the read initiates before > > > > the write completes you have no obligation to reflect the result of the > > > > write. > > > > > > > > Is that what you're trying to do? > > > > > > > > > > Not exactly. > > > > > > In this testcase, we have one thread that is alternating between DIO > > > writes and buffered reads. Logically, the buffered read should always > > > reflect the result of the DIO write...and indeed if we run that program > > > in isolation it always does, since the cache is invalidated just prior > > > to the DIO write. > > > > > > The issue occurs when other tasks are doing buffered reads at the same > > > time. Sometimes the reply to one of those will come in after the > > > invalidation but before the subsequent buffered read by the writing > > > task. If that OSD read occurs before the OSD write then it'll populate > > > > This part confuses me. Let's say the DIO thread does 'write a', 'read > > (a)', 'write b', then 'read (b)'. And the interfering buffered reader > > thread does a 'read (a)'. In order for that to pollute the cache, the > > reply has to arrive *after* the 'write b' ack is received and the > > invalidation happens. However, I don't think it's possible for a read ack > > to arrive *after* a write and not reflect that write, since the reply > > delivery from the osd to client is ordered. > > > > Am I misunderstanding the sequence of events? > > > > 1: write a > > 1: get write ack, invalidate > > 1: read (sees a) > > 1: write b > > 2: start buffered read > > 1: get write ack, invalidate > > 2: get buffered read reply, cache 'b' > > 1: read (sees cached b) > > > > If the buffered read starts earlier, then we would get the reply before > > the write ack, and the cached result would get invalidated when the direct > > write completes. > > > > The problem here though (IIUC) is that we don't have a strict guarantee > on the ordering of the reply processing, in particular between buffered > and direct I/O. We issue requests to libceph's OSD infrastructure, but > the readpages machinery is mostly done in the context of the task that > initiated it. > > When we get the reply from the OSD, the pagevec is populated and then > the initiating task is awoken, but nothing ensures that the initiating > tasks doing the postprocessing end up serialized in the "correct" order. > > IOW, even though the read reply came in before the write reply, the > writing task ended up getting scheduled first when they were both > awoken. Aha, I understand now. Thanks! sage > > > It's sometimes possible for reads that race with writes to complete > > *sooner* (i.e., return pre-write value, because they don't wait for things > > degraded objects to be repaired, which writes do wait for), but then the > > read completes first. There is a flag you can set to force reads to be > > ordered like writes (RWORDERED), which means they will take at least as > > long as racing writes. But IIRC reads never take longer than writes. > > > > > > > the pagecache with stale data. The subsequent read by the writing thread > > > then ends up reading that stale data out of the cache. > > > > > > Doing the invalidation after the DIO write reduces this race window to > > > some degree, but doesn't fully close it. You have to serialize things > > > such that buffered read requests aren't dispatched until all of the DIO > > > write replies have come in. > > > -- > > > Jeff Layton <jlayton@xxxxxxxxxx> > > > > > > > > -- > Jeff Layton <jlayton@xxxxxxxxxx> > >