On Thu, 25 Oct 2012, Theodore Ts'o wrote:
On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
By trusting fsync(). And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.
The challenge is when you have entagled metadata updates. That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated. You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/
If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.
hmm, two thoughts occur to me.
1. to avoid entanglement, put the two files in separate directories
2. take advantage of entaglement to enforce ordering
thread 1 (repeated): write new message to file 1, spawn new thread to
fsync
thread 2: write to file 2 that message1-5 are being worked on
thread 2 (later): write to file 2 that messages 1-5 are done
when thread 1 spawns the new thread to do the fsync, the system will be
forced to write the data to file 2 as of the time it does the fsync.
This should make it so that you never have data written to file2 that
refers to data that hasn't been written to file1 yet.
It certainly is not impossible; RDBMS's have implemented this. On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.
the RDBMS's have implemented stronger guarantees than what we are needing
A few years ago I was investigating this for logging. With the reliable
(RDBMS style) , but inefficent disk queue that rsyslog has, writing to a
high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K
logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered
stable enough at the time to be tested)
Still, if you want to try to implement such a thing, by all means,
give it a try. But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.
The question is trying to figure a way to get ordering right with existing
filesystms (preferrably without using something too tied to a single
filesystem implementation), not try and create a new one.
The frustrating thing is that when people point out how things like sqlite
are so horribly slow, the reply seems to be "well, that's what you get for
doing so many fsyncs, don't do that", when there is a 'problem' like the
KDE "config loss" problem a few years ago, the response is "well, that's
what you get for not doing fsync"
Both responses are correct, from a purely technical point of view.
But what's missing is any way to get the result of ordered I/O that will
let you do something pretty fast, but with the guarantee that, if you
loose data in a crash, the only loss you are risking is that your most
recent data may be missing. (either for one file, or using multiple files
if that's what it takes)
Since this topic came up again, I figured I'd poke a bit and try to either
get educated on how to do this "right" or try and see if there's something
that could be added to the kernel to make it possible for userspace
programs to do this.
What I think userspace really needs is something like a barrier function
call. "for this fd, don't re-order writes as they go down through the
stack"
If the hardware is going to reorder things once it hits the hardware, this
is going to hurt performance (how much depends on a lot of stuff)
but the filesystems are able to make their journals work, so there should
be some way to let userspace do some sort of similar ordering
David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html