Neil Brown <neilb@xxxxxxxxxxxxxxx> wrote: > Well, how about I wade in..... Sure! :-) > A RAID controller, whether software, firmware, or hardware, will also > re-order requests to make best use of the devices. Possibly. I have written block device drivers that maintain write order, however (or at least do so if you ask them to, with the right switch), because ... > Any filesystem that assumes that requests will not be re-ordered is > broken, as the assumption is wrong. > I would be *very* surprised if Reiserfs makes this assumption. .. because that is EXACTLY what Hans Reiser has said to me. I don't think I've kept the mail, but I remember it. a quick google for reiserfs + write ordering shows up some suggestive quotes: > We cannot use the buffer.c dirty list anyway because bdflush can write > those buffers to disk at any time. Transactions have to control the > write ordering ... (hey, that was Hans quoting Stephen). From the Linux High Availability website (http://linuxha.trick.ca/DataRedundancyByDrbd): Since later WRITE requests might depend on successful finished previous ones, this is needed to assure strict write ordering on both nodes. ... Well, I'm not going to search now. Onecan simply ask HR and find out what the current status is vis a vis reiserfs. To be certain what I am talking about, I'll define write ordering as: Writes are not reordered and reads may not be reordered beyond the writes that bound them either side. > Until relatively recently, the only assumption that could be made is > that a write request will be handled sometime between when it is made, > and when the request completes (i.e. the end_io callback is called). This _appears_ to be what Stephen is saying he needs, from which I deduce that he probably has a single-threaded implementation in ext3, because: > If several requests are concurrent they could commit in any order. Yes. > With only this guarantee, the simplest approach for a journalling > filesystem is to write the content of a journal entry, wait for the > writes to complete, and then write a single block "header" which > describes and hence commits that journal entry. The journal entry is > not "safe" until this second write completes. > > This is equally applicable for IDE drives, SCSI drives, software > RAID1, software RAID5, hardware RAID etc. I would agree. No matter what the hardware people say, or what claims are made for equipment, I don't see how there can be any way of knowing the order in which writes are committed internally. I only hope that reads are not reordered beyond writes :-). (this would cause you to read the wrong data: you go W1 R1 W2 and yet read in R1 what you wrote in W2!) > More recently (2.6 only) Linux has had support for "write barriers". > The idea here is that you submit a number of write requests, then a Well, I seem to recall that at some point "request specials" were to act as request barriers, but I don't know if that is still the case. When a driver received one it had to flush all outstanding requests before acking the special. Perhaps Linus gave up on people implementing that and put the support in the kernel core, so as to enforce it? It would be possible. Or maybe he dropped it. > "barrier", then some more write requests. (The "barrier" might be a > flag on the last request of a list, I'm not sure of that detail). The > meaning is that no write request submitted after the barrier will be > attempted until all requests submitted before the barrier are > complete. Some drives support this concept natively so Linux simply > does not re-order requests across a barrier, and sends the barrier at > the appropriate time. Drives can do their own re-ordering but will > not reorder across a barrier (if they support the barrier concept). Yes. > If Linux needs to write a barrier to a device that doesn't support > barriers (as the md/raid currently doesn't) it will (should) submit > all requests before the barrier, flush them out, wait for them to > complete, then allow other requests to be forwarded. But I at least don't know if "Linux" does that :-). By "Linux" you either mean some part of the block subsystem, or fs's acting on their own. > In short, md/raid provides the same guarantees as normal drives, and > any filesystem that expects more is broken. Normal drives do not reorder writes. Their drivers also probably make no attempt to do so, nor not to do so, but in the nature of things (single input, single output) it is unlikely that they do. Software RAID on the other hand is fundamentally parallel so the intrinsic liklihood that something somewhere gets reordered is much higher, and I believe you agree with me that no attempt is made to either check on or prevent it. > Definitely put your journal on RAID with at least as much redundancy > as your main filesystem (I put my filesystem on raid5 and my journal > on raid1). :-) But I don't think you ought to put the journal on raid - what good does it do you to do so? (apart from testing out raid :). After all, the journal integrity is not itself guaranteed by a journal, and it is a point of failure for the whole system, and it is a point where you have doubled i/o density over and above the normal journal rate, which is already extremely high if you do data journalling, since ALL the data on the system flows through that point first. So you will stress the disk there as well as making all your data vulnerable to anything that happens there. What extra benefit do you get from putting it there that is not balanced by greater risks? I'm curious! Surely raid is about "spreading your eggs out through several baskets"? Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html