On Wed, 8 Jan 2014, Theodore Ts'o wrote: > Date: Wed, 8 Jan 2014 10:20:37 -0500 > From: Theodore Ts'o <tytso@xxxxxxx> > To: Lukáš Czerner <lczerner@xxxxxxxxxx> > Cc: linux-ext4@xxxxxxxxxxxxxxx > Subject: Re: A proposal for making ext4's journal more SMR (and flash) > friendly > > On Wed, Jan 08, 2014 at 12:43:35PM +0100, Lukáš Czerner wrote: > >So it means that we would have to have bigger journal which is > >multiple zones (or bands) of size long, right ? However I assume that > >the optimal journal size in this case will be very much dependent > >on the workload used - for example small file workload or other metadata > >heavy workloads would need bigger journal. Could we possibly make journal > >size variable ? > > The journal size is already variable, e.g., "mke2fs -J size=512M". > But yes, the optimal journal size will be highly variable. Yes, but I meant variable while the file system is mounted. With some boundaries of course. But I guess we'll have to think about it once we actually have some code done and hardware to test on. I am just mentioning this because there is a possibility of this being a problem and I would not want users to pick the right journal size for every file system. > > > > The journal is not truncated when the file system is unmounted, and so > > > there is no difference between mounting a file system which has been > > > cleanly unmounted or after a system crash. > > > > I would maybe argue that clean unmount might be the right time for > > checkpoint and resetting journal head back to the beginning because > > I do not see it as a performance sensitive operation. This would in > > turn help us on subsequent mount and run. > > Yes, maybe. It depends on how the the SMR drive handles random > writes. I suspect that most of the time, if the zones are close to > the 256MB to 512MB rather than 32MB, the SMR drive is not going to > rewrite the entire block just to handle a couple of random writes. If > we are doing an unmount, and so we don't care about performance for > these random writes, if there is a way for us to hint to the SMR drive > that no, really, it really should do a full zone rewrite, even if we > are only updating a dozen blocks out of the 256MB zone, then sure, > this might be a good thing to do. > > But if the SMR drive takes these random metadata writes and writes > them to some staging area, then it might not improve performance after > we reboot and remount the file system --- indeed, depending on the > location and nature of the staging area, it might make things worse. > > I think we will need to do some experiments, and perhaps get some > input from SMR drive vendors. They probably won't be willing to > release detailed design information without our being under NDA, but > we can probably explain the design, and watch how their faces grin or > twitch or scowl. :-) > > BTW, even if we do have NDA information from one vendor, it might not > necessarily follow that other vendors use the same tradeoffs. So even > if some of us has NDA'ed information from one or two vendors, I'm a > bit hesitant about hard-coding the design based on what they tell us. > Besides the risk that one of the vendor might do things differently, > there is also the concern that future versions of the drive might use > different schemes for managing the logical->physical translation > layer. So we will probably want to keep our implementation and design > flexible. I very much agree, firmware implementation of those drives will probably change a lot during first generations. > > > While this helps a lot to avoid random writes it could possibly > > result in much higher seek rates especially with bigger journals. > > We're trying hard to keep data and associated metadata close > > together and this would very much break that. This might be > > especially bad with SMR devices because those are designed to be much > > bigger in size. But of course this is a trade-off which makes it > > very important to have good benchmark. > > While the file system is mounted, if the metadata block is being > referenced frequently, it will be in kept in memory, so the fact that > it would have to seek to some random journal location if we need to > read that metadata block might not be a big deal. (This is similar to > the argument used by log-structured file systems which claims that if > we have enough memory, the fact that the metadata is badly fragmented > doesn't matter. Yes, if we are under heavy memory pressure, it might > not work out.) > > > I assume that the information about the newest commits for > > particular metadata blocks would be kept in memory ? Otherwise it > > would be quite expensive operation. But it seems unavoidable on > > mount time, so it might really be better to clear the journal at > > unmount when we should have all this information already in memory ? > > Information about all commits and what blocks are still associated > with them is alreaady being kept in memory. Currently this is being > done via a jh/bh; we'd want to do this differently, since we wouldn't > necessarily enforce that all blocks which are in the journal must be > in the buffer cache. (Although if we did keep all blocks in the > journal in the buffer cache, it would address the issue you raised > above, at the expense of using a large amount of memory --- more > memory than we would be comfortable using, although I'd bet is still > less memory than, say, ZFS requires. :-) I did not even think about always keeping those blocks in memory, it should definitely be a subject to memory reclaim. With big enough journal and the right workload this could grow out of proportions :) > > - Ted > > P.S. One other benefit of this design which I forgot to mention in > this version of the draft. Using this scheme would also allow us to > implement true read-only mounts and file system checks, without > requiring that we modify the file system by replaying the journal > before proceeding with the mount or e2fsck run. Right, that would be nice side effect. Thanks! -Lukas