Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

"Verma, Vishal L" <vishal.l.verma@xxxxxxxxx> · Mon, 2 May 2016 18:52:02 +0000

On Mon, 2016-05-02 at 19:03 +0300, Boaz Harrosh wrote:
> On 05/02/2016 06:51 PM, Vishal Verma wrote:
> > 
> > On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:
> > > 
> > > On 04/29/2016 12:16 AM, Vishal Verma wrote:
> > > > 
> > > > 
> > > > All IO in a dax filesystem used to go through dax_do_io, which
> > > > cannot
> > > > handle media errors, and thus cannot provide a recovery path
> > > > that
> > > > can
> > > > send a write through the driver to clear errors.
> > > > 
> > > > Add a new iocb flag for DAX, and set it only for DAX mounts. In
> > > > the
> > > > IO
> > > > path for DAX filesystems, use the same direct_IO path for both
> > > > DAX
> > > > and
> > > > direct_io iocbs, but use the flags to identify when we are in
> > > > O_DIRECT
> > > > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
> > > > conventional
> > > > direct_IO path instead of DAX.
> > > > 
> > > Really? What are your thinking here?
> > > 
> > > What about all the current users of O_DIRECT, you have just made
> > > them
> > > 4 times slower and "less concurrent*" then "buffred io" users.
> > > Since
> > > direct_IO path will queue an IO request and all.
> > > (And if it is not so slow then why do we need dax_do_io at all?
> > > [Rhetorical])
> > > 
> > > I hate it that you overload the semantics of a known and expected
> > > O_DIRECT flag, for special pmem quirks. This is an incompatible
> > > and unrelated overload of the semantics of O_DIRECT.
> > We overloaded O_DIRECT a long time ago when we made DAX piggyback on
> > the same path:
> > 
> > static inline bool io_is_direct(struct file *filp)
> > {
> > 	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping-
> > >host);
> > }
> > 
> No as far as the user is concerned we have not. The O_DIRECT user
> is still getting all the semantics he wants, .i.e no syncs no
> memory cache usage, no copies ...
> 
> Only with DAX the buffered IO is the same since with pmem it is
> faster.
> Then why not? The basic contract with the user did not break.
> 
> The above was just an implementation detail to easily navigate
> through the Linux vfs IO stack and make the least amount of changes
> in every FS that wanted to support DAX.(And since dax_do_io is much
> more like direct_IO then like page-cache IO)
> 
> > 
> > Yes O_DIRECT on a DAX mounted file system will now be slower, but -
> > 
> > > 
> > > 
> > > > 
> > > > 
> > > > This allows us a recovery path in the form of opening the file
> > > > with
> > > > O_DIRECT and writing to it with the usual O_DIRECT semantics
> > > > (sector
> > > > alignment restrictions).
> > > > 
> > > I understand that you want a sector aligned IO, right? for the
> > > clear of errors. But I hate it that you forced all O_DIRECT IO
> > > to be slow for this.
> > > Can you not make dax_do_io handle media errors? At least for the
> > > parts of the IO that are aligned.
> > > (And your recovery path application above can use only aligned
> > >  IO to make sure)
> > > 
> > > Please look for another solution. Even a special
> > > IOCTL_DAX_CLEAR_ERROR
> >  - see all the versions of this series prior to this one, where we
> > try
> > to do a fallback...
> > 
> And?
> 
> So now all O_DIRECT APPs go 4 times slower. I will have a look but if
> it is really so bad than please consider an IOCTL or syscall. Or a
> special
> O_DAX_ERRORS flag ...

I'm curious where the 4x slower comes from.. The O_DIRECT path is still
without page-cache copies, and nor does it go through request queues
(since pmem is a bio-based driver). The only overhead is that of
submitting a bio - and while I agree it is more overhead than dax_do_io,
4x seems a bit high.

> 
> Please do not trash all the O_DIRECT users, they are the more
> important
> clients, like DBs and VMs.

Shouldn't they be using mmaps and dax faults? I was under the impression
that the dax_do_io path is a nice-to-have, but for anyone that will want
to use DAX, they will want the mmap/fault path, not the IO path. This is
just making the IO path 'more correct' by allowing it a way to deal with
errors.

> 
> Thanks
> Boaz
> 
> > 
> > > 
> > > 
> > > [*"less concurrent" because of the queuing done in bdev. Note how
> > >   pmem is not even multi-queue, and even if it was it will be much
> > >   slower then DAX because of the code depth and all the locks and
> > > task
> > >   switches done in the block layer. In DAX the final memcpy is
> > > done
> > > directly
> > >   on the user-mode thread]
> > > 
> > > Thanks
> > > Boaz
> > > ��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥