On Thu 26-01-17 11:36:35, NeilBrown wrote: > On Wed, Jan 25 2017, Theodore Ts'o wrote: > > On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote: > >> The reason why I'm thinking open() is because it has to be a contract > >> between a specific application and the kernel. If the application > >> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see > >> nasty non-POSIX timeout errors, even if there is another process that > >> is using that flag on the same file. > >> > >> The only place where that is difficult to manage is when the file is > >> mmap()ed (no file descriptor), so you'd presumably have to disallow > >> mixing mmap and O_TIMEOUT. > > > > Well, technically there *is* a file descriptor when you do an mmap. > > You can close the fd after you call mmap(), but the mmap bumps the > > refcount on the struct file while the memory map is active. > > > > I would argue though that at least for buffered writes, the timeout > > has to be property of the underlying inode, and if there is an attempt > > to set timeout on an inode that already has a timeout set to some > > other non-zero value, the "set timeout" operation should fail with a > > "timeout already set". That's becuase we really don't want to have to > > keep track, on a per-page basis, which struct file was responsible for > > dirtying a page --- and what if it is dirtied by two different file > > descriptors? > > You seem to have a very different idea to the one that is forming in my > mind. In my vision, once the data has entered the page cache, it > doesn't matter at all where it came from. It will remain in the page > cache, as a dirty page, until it is successfully written or until an > unrecoverable error occurs. There are no timeouts once the data is in > the page cache. Heh, this has somehow drifted away from the original topic of handling IO errors :) > Actually, I'm leaning away from timeouts in general. I'm not against > them, but not entirely sure they are useful. > > To be more specific, I imagine a new open flag "O_IO_NDELAY". It is a > bit like O_NDELAY, but it explicitly affects IO, never the actual open() > call, and it is explicitly allowed on regular files and block devices. > > When combined with O_DIRECT, it effectively means "no retries". For > block devices and files backed by block devices, > REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be > reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't > help. > Non-block-device filesystems would behave differently. e.g. NFS would > probably use a RPC_TASK_SOFT call instead of the normal 'hard' call. > > When used without O_DIRECT: > - read would trigger read-ahead much as it does now (which can do > nothing if there are resource issues) and would only return data > if it was already in the cache. There was a patch set which did this [1]. Not on per-fd basis but rather on per-IO basis. Andrew blocked it because he was convinced that mincore() is good enough interface for this. > - write would try to allocate a page, tell the filesystem that it > is dirty so that journal space is reserved or whatever is needed, > and would tell the dirty_pages rate-limiting that another page was > dirty. If the rate-limiting reported that we cannot dirty a page > without waiting, or if any other needed resources were not available, > then the write would fail (-EWOULDBLOCK). > - fsync would just fail if there were any dirty pages. It might also > do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without > any *WAIT* flags. (alternately, fsync could remain unchanged, and > sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag). > > > With O_DIRECT there would be a delay, but it would be limited and there > would be no retry. There is not currently any way to impose a specific > delay on REQ_FAILFAST* requests. > Without O_DIRECT, there could be no significant delay, though code might > have to wait for a mutex or similar. > There are a few places that a timeout could usefully be inserted, but > I'm not sure that would be better than just having the app try again in > a little while - it would have to be prepared for that anyway. > > I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with > devices that block when no paths are available. For O_DIRECT writes, there are database people who want to do non-blocking AIO writes. Although the problem they want to solve is different - rather similar to the one patch set [1] is trying to solve for buffered reads - they want to do AIO write and they want it really non-blocking so they can do IO submission directly from computation thread without the cost of the offload to a different process which normally does the IO. Now you need something different for mdadm but interfaces should probably be consistent... > > That being said, I suspect that for many applications, the timeout is > > going to be *much* more interesting for O_DIRECT writes, and there we > > can certainly have different timeouts on a per-fd basis. This is > > especially for cases where the timeout is implemented in storage > > device, using multi-media extensions, and where the timout might be > > measured in milliseconds (e.g., no point reading a video frame if its > > been delayed too long). That being said, it block layer would need to > > know about this as well, since the timeout needs to be relative to > > when the read(2) system call is issued, not to when it is finally > > submitted to the storage device. > > Yes. If a deadline could be added to "struct bio", and honoured by > drivers, then that would make a timeout much more interesting for > O_DIRECT. Timeouts are nice but IMO a lot of work and I suspect you'd really need a dedicated "real-time" IO scheduler for this. Honza [1] https://lwn.net/Articles/636955/ -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>