Hi, On Wed, Mar 06, 2002 at 11:14:42AM +0100, Heinz J . Mauelshagen wrote: > I heard about it yesterday. > > TTBOMK none of the RedHat folks came back to us directly or on any LVM list :-( > > They definitely should feed back the code changes to us or give > pointers, where to retrieve them. Will do. I didn't realise it had gone out to rawhide quite so quickly --- sorry about that. It's a test patch to try to make pvmove not quite so broken. I started thinking about it a few weeks back when I realised that the current pvmove code has a bug which affects all filesystems but which ext2 gets away with most of the time. That particular problem is the use of fsync_dev() to flush pending IOs to disk when we lock a PE for pvmove. fsync_dev doesn't work. It flushes all buffers which are marked dirty in the buffer cache, true, but there are many cases where IOs bypass the buffer cache (raw IO, swap, O_DIRECT, and the ext3 journal), and there are also cases in reiserfs where dirty data in the buffer cache is temporarily marked not-dirty, confusing fsync_dev(). The impact of the bug is that although the PE lock prevents new IO from taking place on the corresponding LE, it does _not_ wait for all existing IO to complete; so it's theoretically possible for the pvmove to end up copying the old copy of the imperfectly-locked old PE to the new PV, only for the pending write to complete afterwards on the old PE. Net effect --- data corruption; we have essentially lost a write. There were two other related bugs in pvmove. One is that it uses buffered IO against the physical disk device to do the PE copy, and that buffered IO is not cache-coherent with respect to changes happening on the logical devices. Anything which populates the buffer cache with the contents of the PE can cause *massive* corruption on pvmove as we propagate bulk stale data to the new disk. The other related bug is that because the PE copy is done in user space, it is prone to VM deadlock. The solution I've been testing has been to add a locked PE copy ioctl, which locks the PE *safely*, copies it using kiobufs, and then updates the LV maps all in one ioctl. Doing the safe PE lock is a little tricky since it requires that LVM track all IOs in transit so that it can do a guaranteed flush of pending IOs, rather than the half-hearted attempt that fsync_dev makes. The patch is in two chunks. One factors out the kiobuf bulk copy code from the snapshot support, and the second implements the new ioctl. It seems to work well in testing, and I haven't measured any performance penalty from the new IO tracking. However, that tracking _will_ have some cost, because it basically cannot be done without pushing a new buffer_head on top of each incoming one passing through lvm_map. There's a third patch involved which is just a backport of 2.5's mempool code, which is an efficient and deadlock-free way of allocating the buffer_heads used to track the pending IOs. I had a long exchange with Joe Thornber about all of this --- his solution for LVM2 looks really solid (and incidentally requires the same pushing of completion structs that I do in my LVM1 patch), so for now I'm just looking for the minimal fix necessary to really cure the bugs in LVM1. The patches I'm using are at http://people.redhat.com/sct/patches/lvm/ They apply to our rawhide kernel, which is basically 2.4.18-ac3, and I think they should apply to any 2.4.19 kernel. One of the patches brings in the important fixes from 1.0.3 into the Marcelo kernel (most are already in 2.4, of course). It probably won't apply directly to CVS 1.0.3, because that and 2.4 have diverged quite a bit recently. lvm-1.0.3-pvmove.patch is the user-space diff, and it tries quite hard to fall back from the new ioctl to the old pvmove code if the ioctl doesn't seem to be present (it caches that result and doesn't even try it the second time). Cheers, Stephen _______________________________________________ linux-lvm mailing list linux-lvm@sistina.com http://lists.sistina.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://www.sistina.com/lvm/Pages/howto.html