Re: [RFC] relaxed barrier semantics

Vladislav Bolkhovitin <vst@xxxxxxxx> · Mon, 02 Aug 2010 23:01:53 +0400

Christoph Hellwig, on 07/30/2010 06:20 PM wrote:
On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
Yes, but why not to make step further and allow to completely eliminate
the waiting/draining using ORDERED requests? Current advanced storage
hardware allows that.

There is a few caes where we could do that - the fsync without metadata
changes above would be the prime example.  But there's a lot lower
hanging fruit until we get to the point where it's worth trying.

Yes, but, since there is also interface and file systems update coming, 
why not to design the interface now and then gracefully fill it with 
implementation?

All barriers discussions are always very hot. It definitely means the 
current approach doesn't satisfy too many people, from FS developers to 
storage vendors and users. I believe this is because the whole barriers 
ideology is not natural, hence there are too many troubles to fit it in 
the real life. Apparently, this approach needs some redesign to get in a 
more acceptable form.

IMHO, all is needed are:

1. Allow to optionally combine requests in groups and set for groups 
optional properties: caching and ordering modes (see below). Each group 
would reflect a higher level operation.

2. Allow to chain requests groups. Each chain would reflect order 
dependency between groups, i.e. higher level operations.

This interface is a natural extension of the current interface. Natural 
for storage too. In the extreme, when a group is empty, it could be 
implemented as a barrier, although, since there would be no dependencies 
between not chained groups, they would be freely reordered between each 
other.

We would need grouping requests sooner or later anyway, because 
otherwise it is impossible to implement selective cache flushing instead 
of flushing cache for the whole device as currently. This is highly 
demanded feature, especially for shared and distributed devices.

The caching properties would be:

 - None (default) - no cache flushing needed.

 - "Flush after each request". It would be translated to FUA on write 
back devices with FUA, (write, sync_cache) sequence on write back 
devices without FUA, and to nothing on write through devices.

 - "Flush at once after all finished". It would be translated to one or 
more SYNC_CACHE commands, executed after all done and syncing _only_ 
what was modified in the group, not the whole device as now.

The order properties would be:

 - None (default) - there are no order dependency between requests in 
the group.

 - ORDERED - all requests in the group must be executed in order.

Additionally, if the backend device supported ORDERED commands, this 
facility would be used to eliminate extra queue draining. For instance, 
"flush after each request" on WB devices without FUA would be a sequence 
of ORDERED commands: [(write, sync_cache) ... (write, sync_cache) wait]. 
Compare to [(write, wait, sync_cache, wait) ... (write, wait, 
sync_cache, wait)] needed achieve the same without ORDERED commands support.

For instance, your example of the fsync in XFS would be:

1) Write out all the data blocks as a group with no caching and ordering 
properties.

2) Wait that group to finish

3) Propagate any I/O error to the inode so we can pick them up

4) Update the inode size in the shadow in-memory structure

5) Start a transaction to log the inode size in the new group with 
properties "Flush at once after all finished" and no ordering (or, if 
necessary, (it isn't clear from your text) ORDERED).

6) Write out a log buffer containing the inode and btree updates in the 
new group in a chain after the group from (5) with necessary cache 
flushing and ordering properties.

I believe, it can be implemented acceptably simply and effectively, 
including the I/O scheduler level, and have some ideas for that.

Just my 5c from the storage vendors side.

But in most cases we don't just drain an imaginary queue but actually
need to modify software state before finishing one class of I/O and
submitting the next.

Again, take the example of fsync, but this time we have actually
extended the file and need to log an inode size update, as well
as a modification to to the btree blocks.

Now the fsync in XFS looks like this:

1) write out all the data blocks using WRITE
2) wait for these to finish
3) propagate any I/O error to the inode so we can pick them up
4) update the inode size in the shadow in-memory structure
5) start a transaction to log the inode size
6) flush the write cache to make sure the data really is on disk

Here should be "6.1) wait for it to finish" which can be eliminated if 
requests sent ordered, correct?

7) write out a log buffer containing the inode and btree updates
8) if the FUA bit is not support flush the cache again

and yes, the flush in 6) is important so that we don't happen
to log the inode size update before all data has made it to disk
in case the cache flush in 8) is interrupted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html