Re: Why does one get mismatches?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michael Evans wrote:
On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@xxxxxxxxxxxxx> wrote:
Neil Brown wrote:
Actually, I'm no longer convinced that the checksumming idea would work.
If a mem-mapped page were written, that the app is updating every
millisecond (i.e. less than the write latency), then every time a write
completed the checksum would be different so we would have to reschedule
the
write, which would not be the correct behaviour at all.
So I think that the only way to address this in the md layer is to copy
the data and write the copy.  There is already code to copy the data for
write-behind that could possible be leveraged to do a copy always.

The concerns of slowdowns with copy could be addressed by making the copy a
runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/
interface where one can echo "1" to enable copies for this type of raid. Or
better 1 could be the default (slower but safer, or if not safer, at least
to avoid needless questions on mismatches on this ML by new users, and to
allow detection of REAL mismatches which can be due to cabling or defective
disks) and echoing 0 would increase performances at the cost of seeing lots
of false positive mismatches.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Isn't there some way of making the page copy-on-write using hardware
and/or an in-kernel structure?  Ideally copying could be avoided
/unless/ there is change.  That way each operation looks like an
atomic commit.

As I think about this, one idea was to add a write-in-progress flag, so that the filesystem, or library, or whatever would know not to change the page. That would mean that every filesystem would need to be enhanced, or that the "safe write" would be optional on a per-filesystem level. Implementation of O_DIRECT could do it, or not, and there could be a safe way to write.

However, it occurs to me that there are several other levels involved, and so it could be better but not perfect. While md could flag the start and finish of write, you then need to have the next level, the device driver, do the same thing, so md knows when the data need not be frozen. "But wait, there's more," as they say, the device driver need to track when the data are transferred to the actual device, and the device needs to report when the data actually hit the platter, or you could still have possible mismatches.

All of that reminds us of the discussion of barriers, and flush cache commands, and other performance impacting practices. So in the long run I think the most effective solution, one which has the highest improvement at the lowest cost in performance, is a copy. Now if Neil liked my idea of doing a checksum before and after a write, and a copy only in the cases where the data had changed, the impact could be pretty small.

All that depends on two things, Neil thinking the whole thing is worth doing, and no one finding a flaw in my proposal to do a checksum rather than a copy each time.

And to return to your original question, no. Hardware COW works on memory pages, a buffer could span pages and a write to a page might not be in the part of the page used for the i/o buffer. So as nice as that would be, I don't think the hardware supports it. And even if you could, the COW needs to be done in the layer which tries to change the buffer, so md would set COW and the filesystem would have to deal with it. I am pretty sure that's a layering violation, big time. The advisory "write in progress" flag might be acceptable, it's information the f/s can use or not.

--
Bill Davidsen <davidsen@xxxxxxx>
 "We can't solve today's problems by using the same thinking we
  used in creating them." - Einstein

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux