Christoph Hellwig, on 07/30/2010 06:20 PM wrote:
On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
Yes, but why not to make step further and allow to completely eliminate
the waiting/draining using ORDERED requests? Current advanced storage
hardware allows that.
There is a few caes where we could do that - the fsync without metadata
changes above would be the prime example. But there's a lot lower
hanging fruit until we get to the point where it's worth trying.
Yes, but, since there is also interface and file systems update coming,
why not to design the interface now and then gracefully fill it with
implementation?
All barriers discussions are always very hot. It definitely means the
current approach doesn't satisfy too many people, from FS developers to
storage vendors and users. I believe this is because the whole barriers
ideology is not natural, hence there are too many troubles to fit it in
the real life. Apparently, this approach needs some redesign to get in a
more acceptable form.
IMHO, all is needed are:
1. Allow to optionally combine requests in groups and set for groups
optional properties: caching and ordering modes (see below). Each group
would reflect a higher level operation.
2. Allow to chain requests groups. Each chain would reflect order
dependency between groups, i.e. higher level operations.
This interface is a natural extension of the current interface. Natural
for storage too. In the extreme, when a group is empty, it could be
implemented as a barrier, although, since there would be no dependencies
between not chained groups, they would be freely reordered between each
other.
We would need grouping requests sooner or later anyway, because
otherwise it is impossible to implement selective cache flushing instead
of flushing cache for the whole device as currently. This is highly
demanded feature, especially for shared and distributed devices.
The caching properties would be:
- None (default) - no cache flushing needed.
- "Flush after each request". It would be translated to FUA on write
back devices with FUA, (write, sync_cache) sequence on write back
devices without FUA, and to nothing on write through devices.
- "Flush at once after all finished". It would be translated to one or
more SYNC_CACHE commands, executed after all done and syncing _only_
what was modified in the group, not the whole device as now.
The order properties would be:
- None (default) - there are no order dependency between requests in
the group.
- ORDERED - all requests in the group must be executed in order.
Additionally, if the backend device supported ORDERED commands, this
facility would be used to eliminate extra queue draining. For instance,
"flush after each request" on WB devices without FUA would be a sequence
of ORDERED commands: [(write, sync_cache) ... (write, sync_cache) wait].
Compare to [(write, wait, sync_cache, wait) ... (write, wait,
sync_cache, wait)] needed achieve the same without ORDERED commands support.
For instance, your example of the fsync in XFS would be:
1) Write out all the data blocks as a group with no caching and ordering
properties.
2) Wait that group to finish
3) Propagate any I/O error to the inode so we can pick them up
4) Update the inode size in the shadow in-memory structure
5) Start a transaction to log the inode size in the new group with
properties "Flush at once after all finished" and no ordering (or, if
necessary, (it isn't clear from your text) ORDERED).
6) Write out a log buffer containing the inode and btree updates in the
new group in a chain after the group from (5) with necessary cache
flushing and ordering properties.
I believe, it can be implemented acceptably simply and effectively,
including the I/O scheduler level, and have some ideas for that.
Just my 5c from the storage vendors side.
But in most cases we don't just drain an imaginary queue but actually
need to modify software state before finishing one class of I/O and
submitting the next.
Again, take the example of fsync, but this time we have actually
extended the file and need to log an inode size update, as well
as a modification to to the btree blocks.
Now the fsync in XFS looks like this:
1) write out all the data blocks using WRITE
2) wait for these to finish
3) propagate any I/O error to the inode so we can pick them up
4) update the inode size in the shadow in-memory structure
5) start a transaction to log the inode size
6) flush the write cache to make sure the data really is on disk
Here should be "6.1) wait for it to finish" which can be eliminated if
requests sent ordered, correct?
7) write out a log buffer containing the inode and btree updates
8) if the FUA bit is not support flush the cache again
and yes, the flush in 6) is important so that we don't happen
to log the inode size update before all data has made it to disk
in case the cache flush in 8) is interrupted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html