On Thu, Jul 06, 2017 at 10:19:06AM -0700, Darrick J. Wong wrote: > > While I was looking into this, I realized that there are some > > implementation details about how the jbd2 layer works that haven't > > written down, and it might be useful to document it for everyone. > > (This probably needs to go into the ext4 wiki, with some editing.) > > Additionally ext4-jbd2.h ? Yeah. We should have a succint summary of this in ext4_jbd2.h; I don't think we should have a huge long essay in there. That should either go in Documentation/filesystems or in the ext4 wiki. > > In general, under-estimating of the number of credits needed is far > > worse than over-estimating. Under-estimating can cause the above, > > which will end up marking the file system corrupt. We can actually do > > better; in fact, we probably ought to do is to try to see if we can > > extend the transaction, print an ext4 warning plus a WARN_ON(1) to get > > I don't know if that's a good idea -- I'd rather train the developers > that they cannot underestimate ever than let them paper over things like > this that will eventually blow out on someone else's machine. > > But I guess a noisy WARN would get peoples' attention. A WARN will cause xfstests to consider the test as failed, so as long as we can trigger it in xfstests, we can get it fixed. If it happens in the field, the poor user isn't going to do anything actionable, so doing a hard stop on the file system when we could have continued doesn't quite seem fair to the user, even if it makes it much more likely that the user will file an angry bug report with the distro help desk. :-) The better approach might be to have distro's write scripts that search for warnings in the logs, run them out of cron, and then somehow ask the user for permission to report the WARN's to the distro's bug tracking system (or some other tracking system). > > So over-estimating by a few 10's of credits is probably not noticeable > > at all. Over-estimating by hundreds of credits can start causing > > performance impacts. How? By forcing transaction to close earlier > > than the normal 5 second timeout due of a perceived lack of space, > > when the journal is almost full due to a credit over-estimation. Even > > so, the performance impact is not necessarily that great, and > > typically only shows up under heavy load --- and we or the system > > administrator can mitigate this problem fairly easily by increasing > > the journal size. > > /me wonders if it'd be useful to track how closely the estimates fit > reality in debugfs to make it easier to spot-check how good of a job > we're all doing? It's important to track this for those handle types where the estimates are huge, or could become huge --- especially if that happens via handle extensions. If we are estimating 20 and we're only using 4, it's hard for me to get excited about that. That being said, having added tracepoints to track handle usage, I already mostly know the answer to this question, and I suspect it hasn't changed that much. The really big problem user is truncate, and that's because it takes a long time and can potentially touch a large number of blocks. If we are worried about improving ext4 benchmarks where you have all N CPU cores all trying to write to the file system at once, the clear optimization is to change truncate so that we lock out changes to the inode, then scan the extent tree (and/or indirect blocks) to accomulate an extent map of blocks that need to be released, and do that *without* starting a handle. Once we know which blocks need to be released, especially if we are truncating to zero (the delete file case) we don't need to touch the extent tree blocks as we currently do. We can calculate how many bitmap blocks and block group descripts we would need to update, then try to start a handle for that number of blocks, free the blocks, and clear the inode's i_blocks[] array. Boom, done. This optimization would reduce the number of metadata blocks we need to modify, and hence the number of blocks we need journal, and it would also significantly reduce the handle hold time, which in turn improve our performance, since one of the ways in which highly parallel benchmarks can get dragged down by ext4 is when we have 31 CPU's waiting for the last handle to stop, which throttles the performance of the workload that would have been running on those 31 other CPU's, and in turn this drags down our Spec SFS numbers or whatever. I haven't done this optimization because I haven't had time, and for most workloads (other than the NFS / Samba workload case), we generally don't have all of the CPU's blocked on file system operations. So other than making the benchmark weenies at Phoronix happy (and those people who actually want to run an NFS server, of course!), it hasn't been a high priority thing for me to do. The other handle operation which can take a large number of handle credits is the inode allocation case. And there it's just a really complex case especially if quotas, acl's, and encryption are all enabled at once. So that's the other one that we could try optimizing, but I doubt there's as much opportunities for optimization as trying to optimize truncate. Cheers, - Ted