On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@xxxxxxxxxx> wrote: > > After Kernel Summit and Plumbers, I wanted to consider all the various > side-discussions and try to summarize my current thoughts here along > with sending out my current implementation for review. > > Also: I'm going on four weeks of paternity leave in the very near > (but non-deterministic) future. So while I hope I still have time > for some discussion, I may have to deal with fussier complaints > then yours. :) In any case, you'll have more time to chew on > the idea and come up with amazing suggestions. :) Hi John, I wonder if you are trying to please everyone and risking pleasing no-one? Well, maybe not quite that extreme, but you can't please all the people all the time. For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile? Similarly the suggestion of using madvise - while tempting - is probably a minority interest and can probably be managed with library code. I'm glad you haven't pursued it. I think discarding whole ranges at a time is very sensible, and so merging adjacent ranges is best avoided. If you require page-aligned ranges this becomes trivial - is that right? I wonder if the oldest page/oldest range issue can be defined way by requiring apps the touch the first page in a range when they touch the range. Then the age of a range is the age of the first page. Non-initial pages could even be kept off the free list .... though that might confuse NUMA page reclaim if a range had pages from different nodes. Application to non-tmpfs files seems very unclear and so probably best avoided. If I understand you correctly, then you have suggested both that a volatile range would be a "lazy hole punch" and a "don't let this get written to disk yet" flag. It cannot really be both. The former sounds like fallocate, the latter like fadvise. I think the later sounds more like the general purpose of volatile ranges, but I also suspect that some journalling filesystems might be uncomfortable providing a guarantee like that. So I would suggest firmly stating that it is a tmpfs-only feature. If someone wants something vaguely similar for other filesystems, let them implement it separately. The SIGBUS interface could have some merit if it really reduces overhead. I worry about app bugs that could result from the non-deterministic behaviour. A range could get unmapped while it is in use and testing for the case of "get a SIGBUS half way though accessing something" would not be straight forward (SIGBUS on first step of access should be easy). I guess that is up to the app writer, but I have never liked anything about the signal interface and encouraging further use doesn't feel wise. That's my 2c worth for now. Keep up the good work, NeilBrown
Attachment:
signature.asc
Description: PGP signature