On Wed, Aug 27, 2014 at 02:30:55PM -0700, Andrew Morton wrote: > On Wed, 27 Aug 2014 16:22:20 -0500 (CDT) Christoph Lameter <cl@xxxxxxxxx> wrote: > > > > Some explanation of why one would use ext4 instead of, say, > > > suitably-modified ramfs/tmpfs/rd/etc? > > > > The NVDIMM contents survive reboot and therefore ramfs and friends wont > > work with it. > > See "suitably modified". Presumably this type of memory would need to > come from a particular page allocator zone. ramfs would be unweildy > due to its use to dentry/inode caches, but rd/etc should be feasible. <sigh> That's where we started about two years ago with that horrible pramfs trainwreck. To start with: brd is a block device, not a filesystem. We still need the filesystem on top of a persistent ram disk to make it useful to applications. We can do this with ext4/XFS right now, and that is the fundamental basis on which DAX is built. For sake of the discussion, however, let's walk through what is required to make an "existing" ramfs persistent. Persistence means we can't just wipe it and start again if it gets corrupted, and rebooting is not a fix for problems. Hence we need to be able to identify it, check it, repair it, ensure metadata operations are persistent across machine crashes, etc, so there is all sorts of management tools required by a persistent ramfs. But most important of all: the persistent storage format needs to be forwards and backwards compatible across kernel versions. Hence we can't encode any structure the kernel uses internally into the persistent storage because they aren't stable structures. That means we need to marshall objects between the persistence domain and the volatile domain in an orderly fashion. We can avoid using the dentry/inode *caches* by freeing those volatile objects the moment reference counts dop to zero rather than putting them on LRUs. However, we can't store them in persistent storage and we can't avoid using them to interface with the VFS, so it makes little sense to burn CPU continually marshalling such structures in and out of volatile memory if we have free RAM to do so. So even with a "persistent ramfs" caching the working set of volatile VFS objects makes sense from a peformance point of view. Then you've got crash recovery management: NVDIMMs are not synchronous: they can still lose data while it is being written on power loss. And we can't update persistent memory piecemeal as the VFS code modifies metadata - there needs to be synchronisation points, otherwise we will always have inconsistent metadata state in persistent memory. Persistent memory also can't do atomic writes across multiple, disjoint CPU cachelines or NVDIMMs, and this is what is needed for synchroniation points for multi-object metadata modification operations to be consistent after a crash. There is some work in the nvme working groups to define this, but so far there hasn't been any useful outcome, and then we willhave to wait for CPUs to implement those interfaces. Hence the metadata that indexes the persistent RAM needs to use COW techniques, use a log structure or use WAL (journalling). Hence that "persistent ramfs" is now looking much more like a database or traditional filesystem. Further, it's going to need to scale to very large amounts of storage. We're talking about machines with *tens of TB* of NVDIMM capacity in the immediate future and so free space manangement and concurrency of allocation and freeing of used space is going to be fundamental to the performance of the persistent NVRAM filesystem. So, you end up with block/allocation groups to subdivide the space. Looking a lot like ext4 or XFS at this point. And now you have to scale to indexing tens of millions of everything. At least tens of millions - hundreds of millions to billions is more likely, because storing tens of terabytes of small files is going to require indexing billions of files. And because there is no performance penalty for doing this, people will use the filesystem as a great big database. So now you have to have a scalable posix compatible directory structures, scalable freespace indexation, dynamic, scalable inode allocation, freeing, etc. Oh, and it also needs to be highly concurrent to handle machines with hundreds of CPU cores. Funnily enough, we already have a couple of persistent storage implementations that solve these problems to varying degrees. ext4 is one of them, if you ignore the scalability and concurrency requirements. XFS is the other. And both will run unmodified on a persistant ram block device, which we *already have*. And so back to DAX. What users actually want from their high speed persistant RAM storage is direct, cpu addressable access to that persistent storage. They don't want to have to care about how to find an object in the persistent storage - that's what filesystems are for - they just want to be able to read and write to it directly. That's what DAX does - it provides existing filesystems a method for exposing direct access to the persistent RAM to applications in a manner that application developers are already familiar with. It's a win-win situation all round. IOWs, ext4/XFS + DAX gets us to a place that is good enough for most users and the hardware capabilities we expect to see in the next 5 years. And hopefully that will be long enough to bring a purpose built, next generation persistent memory filesystem to production quality that can take full advantage of the technology... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html