On Wed, Jan 08, 2014 at 12:43:35PM +0100, Lukáš Czerner wrote: >So it means that we would have to have bigger journal which is >multiple zones (or bands) of size long, right ? However I assume that >the optimal journal size in this case will be very much dependent >on the workload used - for example small file workload or other metadata >heavy workloads would need bigger journal. Could we possibly make journal >size variable ? The journal size is already variable, e.g., "mke2fs -J size=512M". But yes, the optimal journal size will be highly variable. > > The journal is not truncated when the file system is unmounted, and so > > there is no difference between mounting a file system which has been > > cleanly unmounted or after a system crash. > > I would maybe argue that clean unmount might be the right time for > checkpoint and resetting journal head back to the beginning because > I do not see it as a performance sensitive operation. This would in > turn help us on subsequent mount and run. Yes, maybe. It depends on how the the SMR drive handles random writes. I suspect that most of the time, if the zones are close to the 256MB to 512MB rather than 32MB, the SMR drive is not going to rewrite the entire block just to handle a couple of random writes. If we are doing an unmount, and so we don't care about performance for these random writes, if there is a way for us to hint to the SMR drive that no, really, it really should do a full zone rewrite, even if we are only updating a dozen blocks out of the 256MB zone, then sure, this might be a good thing to do. But if the SMR drive takes these random metadata writes and writes them to some staging area, then it might not improve performance after we reboot and remount the file system --- indeed, depending on the location and nature of the staging area, it might make things worse. I think we will need to do some experiments, and perhaps get some input from SMR drive vendors. They probably won't be willing to release detailed design information without our being under NDA, but we can probably explain the design, and watch how their faces grin or twitch or scowl. :-) BTW, even if we do have NDA information from one vendor, it might not necessarily follow that other vendors use the same tradeoffs. So even if some of us has NDA'ed information from one or two vendors, I'm a bit hesitant about hard-coding the design based on what they tell us. Besides the risk that one of the vendor might do things differently, there is also the concern that future versions of the drive might use different schemes for managing the logical->physical translation layer. So we will probably want to keep our implementation and design flexible. > While this helps a lot to avoid random writes it could possibly > result in much higher seek rates especially with bigger journals. > We're trying hard to keep data and associated metadata close > together and this would very much break that. This might be > especially bad with SMR devices because those are designed to be much > bigger in size. But of course this is a trade-off which makes it > very important to have good benchmark. While the file system is mounted, if the metadata block is being referenced frequently, it will be in kept in memory, so the fact that it would have to seek to some random journal location if we need to read that metadata block might not be a big deal. (This is similar to the argument used by log-structured file systems which claims that if we have enough memory, the fact that the metadata is badly fragmented doesn't matter. Yes, if we are under heavy memory pressure, it might not work out.) > I assume that the information about the newest commits for > particular metadata blocks would be kept in memory ? Otherwise it > would be quite expensive operation. But it seems unavoidable on > mount time, so it might really be better to clear the journal at > unmount when we should have all this information already in memory ? Information about all commits and what blocks are still associated with them is alreaady being kept in memory. Currently this is being done via a jh/bh; we'd want to do this differently, since we wouldn't necessarily enforce that all blocks which are in the journal must be in the buffer cache. (Although if we did keep all blocks in the journal in the buffer cache, it would address the issue you raised above, at the expense of using a large amount of memory --- more memory than we would be comfortable using, although I'd bet is still less memory than, say, ZFS requires. :-) - Ted P.S. One other benefit of this design which I forgot to mention in this version of the draft. Using this scheme would also allow us to implement true read-only mounts and file system checks, without requiring that we modify the file system by replaying the journal before proceeding with the mount or e2fsck run. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html