On Wednesday April 11, dan.j.williams@xxxxxxxxx wrote: > > From: Mark Hahn [mailto:hahn@xxxxxxxxxxx] > > > > > In its current implementation write-back mode acknowledges writes > before > > > they have reached non-volatile media. > > > > which is basically normal for unix, no? > I am referring to when bi_end_io is called on the bio submitted to MD. > Normally it is not called until after the bi_end_io event for the bio > submitted to the backing disk. > > > > > are you planning to support barriers? (which are the block system's > way > > of supporting filesystem atomicity). > Not as a part of these performance experiments. But, I have wondered > what the underlying issues are behind raid5 not supporting barriers. > Currently in raid5.c:make_request: > > if (unlikely(bio_barrier(bi))) { > bio_endio(bi, bi->bi_size, -EOPNOTSUPP); > return 0; > } I should be getting this explanation down to a fine art. I seem to be delivering it in multiple forums. My position is that for a virtual device that stores some blocks on some devices and other blocks on other devices (e.g. raid0, raid5, linear, LVM, but not raid1) barrier support in the individual devices is unusable, and that to achieve the goal it is just as easy for the filesystem to order requests and to use blkdev_issue_flush to force sync-to-disk. The semantics of a barrier (as I understand it) is that all writes prior to the barrier are safe before the barrier write is commenced, and that write itself is safe before any subsequence write is commenced. (I think those semantics are stronger than we should be exporting - just the first half should be enough - but such is life). On a single drive, this is achieved by not re-ordering requests around a barrier, and asking the device to not re-order requests either. When you have multiple devices, you cannot ask them not to re-order requests with respect to each other, so the same mechanism cannot be used. Instead, you would have to plug the meta-device, unplug all the lower level queues, wait for all writes to complete, call blkdev_issue_flush to make sure the data is safe, issue the barrier write and wait for it to complete, call blkdev_issue_flush again (Well, maybe the barrier write could have been sent with BIO_RW_BARRIER for the same effect) then unplug the queue. And the thing is that all of that complexity ALREADY needs to be in the filesystem. Because if a device doesn't support barriers, the filesystem should wait for all dependant writes to complete and then issue the 'barrier' write (and probably call blkdev_issue_flush as well). And the filesystem is positioned to do this BETTER because it can know which writes are really dependant and which might be incidental. Ext3 gets this right except that it never bothers with blkdev_issue_flush. XFS doesn't even bother trying (it's designed to be used with reliable drives). reiserfs might actually get it completely right as it does have a call to blkdev_issue_flush in what looks like the right place, but I cannot be sure without lots of code review. dm/stripe currently gets this wrong. If it gets a barrier request it just passes it down to the one target drive thus failing to ensure any ordering wrt other drives. All that said: raid5 is probably in a better position than most to implement a barrier as it keeps careful track of everything that is happening, and could easily wait for all prior writes to complete. This might mesH well with the write-behind approach to caching. But I would still rather than the filesystem just got it right for us. With a single drive, the drive can implement a barrier more efficiently than the filesystem. With multiple drives, the meta-device can at best be as efficient as the filesystem. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html