Jeff Garzik wrote: > >It's not optimal even then. > > > > Devices: On a software RAID, you ideally don't want to issue flushes > > to all drives if your database did a 1 block commit entry. (But they > > probably use O_DIRECT anyway, changing the rules again). But all that > > can be optimised in generic VFS code eventually. It doesn't need > > filesystem assistance in most cases. > > My own idea is that we create a FLUSH command for blkdev request queues, > to exist alongside READ, WRITE, and the current barrier implementation. > Then FLUSH could be passed down through MD or DM. I like your thought, and it has the benefit of being simple. My thought is very similar, but with (hopefully not premature...) optimisations: - I would merge FLUSH with a preceding write in some cases, converting to an FUA-write command. Probably the generic request queue is the best place to detect and merge. This is so that userspace filesystems (including guest VMs) and databases can do journal commits with the same I/O sequence as in kernel filesystems. - I would create BARRIER too, so that a userspace API can ask for this weaker form of fsync, which may improve throughput of userspace journalling. - I would include a sector range in FLUSH and BARRIER, for MD and DM to flush _only_ relevant sub-devices. This may improve performance for journalling both kernel and userspace filesystems, as journal commits are often very small and hit one or two sub-devices in RAID. - I would ask the nice MD and DM people to take tag-barriers rather than flush-barriers on the input queue, converting to tag-barriers, flush-barriers and independent FLUSH on the sub-device queues according to sector ranges and subsequent writes. It's not obvious, but my barrier proposal which started this thread is designed to support an efficient inter-sub-device flush-barrier when necessary, and single-sub-device tag-barrier when possible. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html