Neil Brown wrote:
On Fri, 26 Feb 2010 15:48:58 -0500
Bill Davidsen <davidsen@xxxxxxx> wrote:
The idea of calculating a checksum before and after certainly has some merit,
if we could choose a checksum algorithm which was sufficiently strong and
sufficiently fast, though in many cases a large part of the cost would just be
bringing the page contents into cache - twice.
It has the advantage over copying the page of not needing to allocate extra
memory.
If someone wanted to try an prototype this and see how it goes, I'd be happy
to advise....
Disagree if you wish, but MD5 should be fine for this. While it is not
cryptographically strong on files, where the size can be changed and
evil doers can calculate values to add at the end of the data, it should
be adequate on data of unchanging size. It's cheap, fast, and readily
available.
Actually, I'm no longer convinced that the checksumming idea would work.
If a mem-mapped page were written, that the app is updating every
millisecond (i.e. less than the write latency), then every time a write
completed the checksum would be different so we would have to reschedule the
write, which would not be the correct behaviour at all.
So I think that the only way to address this in the md layer is to copy
the data and write the copy. There is already code to copy the data for
write-behind that could possible be leveraged to do a copy always.
Your point is valid about the possibility, but consider this, if the
checksum fails, then at that point do the copy and write again.
Or I could just stop setting mismatch_cnt for raid1 and raid10. That would
also fix the problem :-)
s/fix/hide/ ;-)
My feeling is that we have many ways to change the data, O_DIRECT, aio,
threads, mmap, and probably some I haven't found yet. Rather than think
that you could prevent that without a flaming layer violation, perhaps
my thought above, to detect the fact that the data has changed, and at
that point do a copy and write unchanging data to all drives. How that
plays with O_DIRECT I can't say, but it sounds to me as if it should
eliminate the mismatches without a huge performance impact. Let me know
if this addresses your concern with writing forever without taking much
overhead.
The question is why this happens with raid-1 and doesn't seem to with
raid-[56]. And I don't see mismatches on my raid-10, although I'm pretty
sure that neither mmap or O_DIRECT is used on those arrays.
What would seem to be optimal is some COW on the buffer to prevent the
buffer from being modified while it's being used for actual i/o. Doesn't
seem hardware supports it, page size, buffer size and sector size all vary.
--
Bill Davidsen <davidsen@xxxxxxx>
"We can't solve today's problems by using the same thinking we
used in creating them." - Einstein
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html