On Sun, Jan 19, 2014 at 11:38:22AM +0200, Alex Lyakas wrote: > Hi Dave, > I believe I understand your intent of having the discard tree almost > fully disconnected from the busy extents tree. I planned to use the > the same pagb_lock for the discard tree as well, which solves the > atomicity and ordering issues you have described. If we use a > different lock (mutex), then some additional care is needed. You can probably just change the pagb_lock to a mutex. In the uncontended case they are almost as fast as spinlocks, and I don't think there's much contention on this lock right now... > One thing I like less about such approach, is that we need to be > careful of not growing the discard tree too large. If, for some > reason, the worker thread is not able to keep up discarding extents > while they are queued for discard, we may end up with too many discard > ranges in the tree. If that happens, then we'll burn CPU managing the discard tree on every allocation and waiting for discards to complete. It should self throttle. > Also at the same time user can trigger an offline > discard, which will add more discard ranges to the tree. While the > online discard that you have, kind of throttles itself. So if the > underlying block device is slow on discarding, the whole system will > be slowed down accordingly. discard already slows the whole system down badly. Moving them to the background won't prevent that - but it will reduce the impact because it discards in the transaction commit completion path completely stall the entire file system for the duration of the discards. This is the primary reason why mounting with -o discard is so slow - discards are done in a path of global serialisation. Moving it to the background avoids this problem, so regardless of how much discard work we queue up, the filesystem is still going to be faster than the current code because we've removed the global serialisation point. In the case of fstrim, this makes fstrim run very quickly because it doesn't issue synchronous discards with the AGF locked. We still do the same amount of discard work in the same amount of time in the background, but we don't block AGs or userspace while we wait for the discards to be done. That's also a pretty major win. IOWs, I think the first step is to push all the discard into the background via it's own rbtree as you've been doing. If we add tracepoints and stats to the code, we can track the discard queue depth easily enough with a bit of perf or trace-cmd output scripting, and then determine whether throttling is necessary. We can use the loop device to run tests as it implements discard via hole punching the backing file, and that will tells us pretty quickly if we need additional throttling by looking at the queue depths and the amount of CPU being spent manipulating the discard rbtree (i.e. via perf profiling). Throttling a workqueue isn't hard to do, but there's no point in doing it if it isn't necessary... > [snip] > > I have one additional question regarding your comment on metadata > > >> Q1: why it is ok to do so? why it is ok for "metadata" to reuse part > >> (or all) of the busy extent before its extent-free-intent is > >> committed? > > > > The metadata changes are logged, and crash recovery allows correct > > ordering of the free, realloc and modify process for metadata. Hence > > it doesn't matter that we overwrite the contents of the block before > > the free transaction is on disk - the correct contents will always > > be present after recovery. > > > > We can't do that for user data because we don't log user data. > > Therefore if we allow user data to overwrite the block whil eit is > > still busy, crash recovery may not result in the block having the > > correct contents (i.e. the transaction that freed the block never > > reaches the journal) and we hence expose some other user's data or > > metadata in the file. > If that is the case, why cannot we just issue an async discard before > the busy extent is committed? I understand that if we crashed, we > might have knocked off some of the user data (or changed it to some > new data). Because a user data extent is not free fo us to make arbitrary changes to it until the transaction that frees it commits. i.e. If we crash then the file must either contain the original data or be completely gone. Go google "null files after crash" and you'll see just how important people consider files being intact after a crash.... > But can XFS get corrupted (unmountable) this way? You said > earlier that it can, but now you are saying that reusing a metadata > block, before its busy extent commits, is fine. No, not exactly. There are 4 different cases to take into account: 1. metadata -> free -> alloc as metadata is the situation where there are not problems reusing the busy extent because all the modifications are in the log. Hence crash recovery always ends up with the correct result no mater where the crash occurs. 2. user data -> free -> alloc as metadata is safe because the metadata can't be written in place until the free transaction and the metadata changes are fully committed in the journal. Hence on a crash the user will either see in-tact data in their file, or the file will have had the extent removed successfully and the filesystem has safely reused it for metadata. 3. metadata -> free -> alloc as user data and 4. user data -> free -> alloc as user data are the problematic cases when it comes to IO ordering - if we allow reallocation of busy extents, then the new userdata can be written to disk before the free/alloc transactions are committed to the journal. If we crash at this point, the after recovery the new data will be pointed to by the old user. If the old user is user data, then we've corrupted their file by exposing some other users data to them. If the old user is metadata, then we've corrupted the filesystem because the metadata has been overwritten by user data before the journal recovery has read and freed the metadata block. That will cause recovery (and hence mount) to fail. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs