Hi, On Sun, Oct 21, 2001 at 03:25:56PM -0400, Jason A. Lixfeld wrote: > Folks, I'm really stressed here. I'm sending this to both lists to see > if anyone can offer any assistance. > Anyway, 2.4.10-ac11 worked fine for about 5 > days. We started to get low on space on the RAID so deleted stuff off > of one of the LVMs to make room and then we moved stuff from the raid > over to the LVM we had just free'd up space on. We test ext3 extensively under load, but if it has particular problems over LVM I'd be interested in knowing. All I can suggest right now to narrow things down is that you see whether ext2 works any better. Just glancing over the LVM code, though, I don't think that their locking code is safe in the presence of other filesystem activity. lvm_do_pe_lock_unlock does try to flush existing IO, but they do it with pe_lock_req.lock = UNLOCK_PE; fsync_dev(pe_lock_req.data.lv_dev); pe_lock_req.lock = LOCK_PE; which (a) doesn't wait for existing IO to complete if that IO was submitted externally to the buffer cache (so it won't catch raw IO, direct IO, journal activity, or RAID1 ios); and (b) it allows new IO to be submitted while the fsync is going on, so when it eventually sets LOCK_PE state again, we can have loads of new IO freshly submitted to the device by the time the lock is re-asserted. LVM folks, am I missing something here? I can't see how you can assert that the device is truly quiescent after the LOCK_PE has been set. The 1.0.1-rc4 code seems to be improved in that it does another fsync_dev after finally setting LOCK_PE, but fsync_dev is still inadequate here for any IO submitted directly via submit_bh(), rather than through the buffer cache. This bug would be more likely to hit ext3 than ext2, as ext3 uses submit_bh directly for a lot of its journal IO, but there are plenty of cases outside ext3 which will also hit this problem. Cheers, Stephen