Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 25 Jan 2017 13:35:42 -0500

On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
> The reason why I'm thinking open() is because it has to be a contract
> between a specific application and the kernel. If the application
> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
> nasty non-POSIX timeout errors, even if there is another process that
> is using that flag on the same file.
> 
> The only place where that is difficult to manage is when the file is
> mmap()ed (no file descriptor), so you'd presumably have to disallow
> mixing mmap and O_TIMEOUT.

Well, technically there *is* a file descriptor when you do an mmap.
You can close the fd after you call mmap(), but the mmap bumps the
refcount on the struct file while the memory map is active.

I would argue though that at least for buffered writes, the timeout
has to be property of the underlying inode, and if there is an attempt
to set timeout on an inode that already has a timeout set to some
other non-zero value, the "set timeout" operation should fail with a
"timeout already set".  That's becuase we really don't want to have to
keep track, on a per-page basis, which struct file was responsible for
dirtying a page --- and what if it is dirtied by two different file
descriptors?

That being said, I suspect that for many applications, the timeout is
going to be *much* more interesting for O_DIRECT writes, and there we
can certainly have different timeouts on a per-fd basis.  This is
especially for cases where the timeout is implemented in storage
device, using multi-media extensions, and where the timout might be
measured in milliseconds (e.g., no point reading a video frame if its
been delayed too long).  That being said, it block layer would need to
know about this as well, since the timeout needs to be relative to
when the read(2) system call is issued, not to when it is finally
submitted to the storage device.

And if the process has suitable privileges, perhaps the I/O scheduler
should take the timeout into account, so that reads with a timeout
attached should be submitted, with the presumption that reads w/o a
timeout can afford to be queued.  If the process doesn't have suitable
privileges, or if cgroup has exceeded its I/O quota, perhaps the right
answer would be to fail the read right away.  In the case of a cluster
file system such, if a particular server knows its can't serve a
particular low latency read within the SLO, it might be worthwhile to
signal to the cluster file system client that it should start doing an
erasure code reconstruction right away (or read from one of the
mirrors if the file is stored with n=3 replication, etc.)

So depending on what the goals of userspace are, there are number of
different kernel policies that might be the best match for the
particular application in question.  In particular, if you are trying
to provide low latency reads to assure decent response time for web
applications, it may be *reads* that are much more interesting for
timeout purposes rather than *writes*.

(Especially in a distributed system, you're going to be using some
kind of encoding with redundancy, so as long as enough of the writes
have completed, it doesn't matter if the other writes take a long time
--- although if you eventually decide that the write's never going to
make it, it's ideal if you can reshard the chunk more aggressively,
instead of waiting for the scurbbing pass to notice that some of the
redundant copies of the chunk had gotten corrupted or were never
written out.)

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html