On Mon, 2016-05-02 at 19:03 +0300, Boaz Harrosh wrote: > On 05/02/2016 06:51 PM, Vishal Verma wrote: > > > > On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote: > > > > > > On 04/29/2016 12:16 AM, Vishal Verma wrote: > > > > > > > > > > > > All IO in a dax filesystem used to go through dax_do_io, which > > > > cannot > > > > handle media errors, and thus cannot provide a recovery path > > > > that > > > > can > > > > send a write through the driver to clear errors. > > > > > > > > Add a new iocb flag for DAX, and set it only for DAX mounts. In > > > > the > > > > IO > > > > path for DAX filesystems, use the same direct_IO path for both > > > > DAX > > > > and > > > > direct_io iocbs, but use the flags to identify when we are in > > > > O_DIRECT > > > > mode vs non O_DIRECT with DAX, and for O_DIRECT, use the > > > > conventional > > > > direct_IO path instead of DAX. > > > > > > > Really? What are your thinking here? > > > > > > What about all the current users of O_DIRECT, you have just made > > > them > > > 4 times slower and "less concurrent*" then "buffred io" users. > > > Since > > > direct_IO path will queue an IO request and all. > > > (And if it is not so slow then why do we need dax_do_io at all? > > > [Rhetorical]) > > > > > > I hate it that you overload the semantics of a known and expected > > > O_DIRECT flag, for special pmem quirks. This is an incompatible > > > and unrelated overload of the semantics of O_DIRECT. > > We overloaded O_DIRECT a long time ago when we made DAX piggyback on > > the same path: > > > > static inline bool io_is_direct(struct file *filp) > > { > > return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping- > > >host); > > } > > > No as far as the user is concerned we have not. The O_DIRECT user > is still getting all the semantics he wants, .i.e no syncs no > memory cache usage, no copies ... > > Only with DAX the buffered IO is the same since with pmem it is > faster. > Then why not? The basic contract with the user did not break. > > The above was just an implementation detail to easily navigate > through the Linux vfs IO stack and make the least amount of changes > in every FS that wanted to support DAX.(And since dax_do_io is much > more like direct_IO then like page-cache IO) > > > > > Yes O_DIRECT on a DAX mounted file system will now be slower, but - > > > > > > > > > > > > > > > > > > > > This allows us a recovery path in the form of opening the file > > > > with > > > > O_DIRECT and writing to it with the usual O_DIRECT semantics > > > > (sector > > > > alignment restrictions). > > > > > > > I understand that you want a sector aligned IO, right? for the > > > clear of errors. But I hate it that you forced all O_DIRECT IO > > > to be slow for this. > > > Can you not make dax_do_io handle media errors? At least for the > > > parts of the IO that are aligned. > > > (And your recovery path application above can use only aligned > > > IO to make sure) > > > > > > Please look for another solution. Even a special > > > IOCTL_DAX_CLEAR_ERROR > > - see all the versions of this series prior to this one, where we > > try > > to do a fallback... > > > And? > > So now all O_DIRECT APPs go 4 times slower. I will have a look but if > it is really so bad than please consider an IOCTL or syscall. Or a > special > O_DAX_ERRORS flag ... I'm curious where the 4x slower comes from.. The O_DIRECT path is still without page-cache copies, and nor does it go through request queues (since pmem is a bio-based driver). The only overhead is that of submitting a bio - and while I agree it is more overhead than dax_do_io, 4x seems a bit high. > > Please do not trash all the O_DIRECT users, they are the more > important > clients, like DBs and VMs. Shouldn't they be using mmaps and dax faults? I was under the impression that the dax_do_io path is a nice-to-have, but for anyone that will want to use DAX, they will want the mmap/fault path, not the IO path. This is just making the IO path 'more correct' by allowing it a way to deal with errors. > > Thanks > Boaz > > > > > > > > > > > > [*"less concurrent" because of the queuing done in bdev. Note how > > > pmem is not even multi-queue, and even if it was it will be much > > > slower then DAX because of the code depth and all the locks and > > > task > > > switches done in the block layer. In DAX the final memcpy is > > > done > > > directly > > > on the user-mode thread] > > > > > > Thanks > > > Boaz > > > ��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥