On 02/26/2013 01:00 PM, Sage Weil wrote: > On Tue, 26 Feb 2013, Yan, Zheng wrote: >>> It looks to me like truncates can get queued for later, so that's not the case? >>> And how could the client receive a truncate while in the middle of >>> writing? Either it's got the write caps (in which case nobody else can >>> truncate), or it shouldn't be writing...I suppose if somebody enables >>> lazyIO that would do it, but again in that case I imagine we'd rather >>> do the truncate before writing: >> >> Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can >> revoke the write caps in the middle of write. >> >>> >>>> I think it's better to do vmtruncate after write finishes. >>> What if you're writing past the truncate point? Then your written data >>> would be truncated off even though it should have been a file >>> extension. >>> >> I was wrong, I thought the order of write and truncate is not important. >> Now I'm worry about the correctness of commit 22cddde104, we probably >> should revert that commit. > > I'm sorry I haven't had time to dig into this thread. Do you think we > should revert that commit for this merge window? No, that commit fixes a i_size update bug, it is more important than the deadlock. But I think the eventual fix is revert that commit and also remove the optimization that drops Fw caps early (to avoid slow cap revocation caused by balance_dirty_pages) > > My gut feeling is that the whole locking scheme here needs to be reworked. > I suspect where we'll end up is something more like XFS, where there is an > explicit truncate (or other) mutex that is separate from i_mutex. The > fact that we were relying on i_mutex serialization from the VFS was > fragile and (as we can see) ultimately a flawed approach. And the locking > and races between writes, mmap, and truncation in the page cache code are > fragile and annoying. The only good thing going for us is that fsx is > pretty good at stress testing the code. :) If we want to keep write operation atomic, write and vmtruncate need to be protected by a mutex. I don't think introducing a new mutex makes thing simple. > >>> Oh, you're right about the locking (I think, anyway?). However, i_size >>> doesn't protect us ? we might for instance have truncated to zero and >>> then back up to the original file size. :( But that doesn't look like >>> it's handled correctly anyway (racy) ? am I misreading that badly or >>> is the infrastructure in fact broken in that case? >> >> You are right. probably we can fix the race by disallowing both read and >> write when there is pending truncate. > > FWIW, I think the high-level strategy before should still be basically > sound: we can queue a truncation without blocking, and any write path will > apply a pending truncation under the appropriate lock. But I think it's > going to mean carefully documenting the locking requirements for the > various paths and working out a new locking structure. > > Yan, is this something you are able to put some time into? I would like > to work on it, but it's going to be hard for me to find the time in the > next several weeks to get to it. OK, I will work on it. Regards Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html