On Thu, 25 Feb 2010 08:22:10 +0100 Goswin von Brederlow <goswin-v-b@xxxxxx> wrote: > Neil Brown <neilb@xxxxxxx> writes: > > > On Wed, 24 Feb 2010 09:46:23 -0500 > > Bill Davidsen <davidsen@xxxxxxx> wrote: > > > >> > There is no question of data corruption. > >> > When memory changes between being written to one device and to another, this > >> > does not cause corruption, only inconsistency. Either the block will be > >> > written again consistently soon, or it will never be read. > >> > > >> > >> Just what is it that rewrites the data block? The user program doesn't > >> know it's needed, the filesystem, if any, doesn't know it's needed, and > >> as far as I can tell md doesn't do checksum before issuing the write and > >> after the last write is done. Doesn't make a copy and write from that. > >> So what sees that the data has changed and rewrites it? > >> > > > > The filesystem re-writes the block, though probably it is more accurate to > > say 'the page cache' rewrites the block (the page cache is essentially just a > > library of code that the filesystem uses). > > > > When a page is changed, its 'Dirty' flag is set. > > Before a page is written out, the Dirty flag is cleared. > > So if a page is written differently to two devices, then it must have been > > changed after the Dirty flag was clear, so the Dirty flag will be set, so the > > page cache will try to write it out again (after about 30 seconds or at > > unmount time). > > So maybe MD could check the dirty flag after write and then output a > warning so we can track down the issue. MD could also rewrite the page > prior to setting the disks in-sync until the dirty bit is clear after a > write. md isn't able to see the dirty bit. It gets a 'bio', which has a 'biovec' which has a list of pages with offset and size. It does not know if the page is in the page cache or not so it cannot know if the dirty flag on the page means anything or not. Yes, it technically could check the dirty bit and if it sees any of them set then it could reschedule the writes. however, 1- this is a layering violation - it is the wrong thing to do. 2- it might not work. The filesystem could keep the 'dirty' status elsewhere such as in a 'buffer_head', and only copy it through to the page occasionally. 3- it could cause a live-lock. If an application is changing a mapped page quite regularly, then the current pagecache will write it out every 30 seconds or so. Your proposed change would write it out again and again as soon as the previous write completes. So, no: we cannot do that. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html