Re: atomic write & T10 standards

Ric Wheeler <rwheeler@xxxxxxxxxx> · Thu, 04 Jul 2013 08:34:43 -0400

On 07/03/2013 11:18 PM, Vladislav Bolkhovitin wrote:
Ric Wheeler, on 07/03/2013 11:31 AM wrote:
Journals are normally big (128MB or so?) - I don't think that this is unique to xfs.
We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.

If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks & commit blocks
* flush
No necessary.

Consider a case, when you are creating many small files in a big directory. Consider
that every such operation needs 3 actions: add new directory entry, get free space and
write data there. If 1 atomic write (scattered) command is used for each operation and
you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI
attribute or queue draining, you don't need any intermediate flushes. Only one final
flush would be sufficient. In case of crash simply some of the new files would
"disappear", but everything would be fully consistent, so the only needed recovery
would be to recreate them.

The worry I have is that we then have this intermediate state where we have sent 
the array down a scattered IO which is marked as atomic. Can we trust the array 
to lose all of those parts on power failure or lose none of them before we send 
down a queue flush of some kind?

Not to mention we still end up having to persist a broader range of data than we 
would otherwise need.

Even worse nightmare would be sending down atomic scattered write A, followed by 
atomic scattered write B, ...., scattered atomic write Y - all without a sync 
followed by a crash. What semantics or ordering promises do we have in this case 
if the power drops? Is there a promise that they are durable in the sequence 
sent to the target, or could we end up with a write B and not a write A after a 
crash?

The catch is that our current flush mechanisms are still pretty brute force and
act across either the whole device or in a temporal (everything flushed before
this is acked) way.

I still see it would be useful to have the atomic write really be atomic and
durable just for that IO - no flush needed.

Can you give a sequence for the use case for the non-durable atomic write that
would not need a sync?
See above.

Your above example still had a flush (or use of ORDERED SCSI commands).

Can we really trust all devices to make something atomic
that is not durable :) ?
Sure, if application allows that and the atomicity property itself is durable, why not?

Vlad

P.S. With atomic writes there's no need in a journal, no?

Durable and atomic are not the same - we need to make sure that the 
specification is clear and that the behaviours are uniform (mandated) if we can 
make use of them. We have been burnt in the past by things like the TRIM command 
leaving stale data for example by some vendor and not others (leading to an 
update of the spec :))

I think that you would need to have durability between the atomic writes in 
order to do away with the journal.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html