From: Andiry Xu <jix024@xxxxxxxxxxx> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories. It was designed and developed at the Non-Volatile Systems Laboratory in the Computer Science and Engineering Department at the University of California, San Diego. Its primary authors are Andiry Xu <jix024@xxxxxxxxxxxx>, Lu Zhang <luzh@xxxxxxxxxxxx>, and Steven Swanson <swanson@xxxxxxxxxxxx>. These two papers provide a detailed, high-level description of NOVA's design goals and approach: NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories In The 14th USENIX Conference on File and Storage Technologies (FAST '16) (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf) NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System In The 26th ACM Symposium on Operating Systems Principles (SOSP '17) (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf) This patchset contains features from the FAST paper. We leave NOVA-Fortis features, such as snapshot, metadata and data replication and RAID parity for future submission. Signed-off-by: Andiry Xu <jix024@xxxxxxxxxxx> --- Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++ MAINTAINERS | 8 + 3 files changed, 508 insertions(+) create mode 100644 Documentation/filesystems/nova.txt diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index b7bd6c9..dc5c722 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -95,6 +95,8 @@ nfs/ - nfs-related documentation. nilfs2.txt - info and mount options for the NILFS2 filesystem. +nova.txt + - info on the NOVA filesystem. ntfs.txt - info and mount options for the NTFS filesystem (Windows NT). ocfs2.txt diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt new file mode 100644 index 0000000..4728f50 --- /dev/null +++ b/Documentation/filesystems/nova.txt @@ -0,0 +1,498 @@ +The NOVA Filesystem +=================== + +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system +designed to provide a high performance and production-ready file system +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs +and Intel's soon-to-be-released 3DXPoint DIMMs). +NOVA combines design elements from many other file systems +and adapts conventional log-structured file system techniques to +exploit the fast random access that NVMs provide. In particular, NOVA maintains +separate logs for each inode to improve concurrency, and stores file data +outside the log to minimize log size and reduce garbage collection costs. NOVA's +logs provide metadata and data atomicity and focus on simplicity and +reliability, keeping complex metadata structures in DRAM to accelerate lookup +operations. + +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in +the Computer Science and Engineering Department at the University of +California, San Diego. + +A more thorough discussion of NOVA's design is avaialable in these two papers: + +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories +Jian Xu and Steven Swanson +In The 14th USENIX Conference on File and Storage Technologies (FAST '16) + +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17) + +This version of NOVA contains features from the FAST paper. +NOVA-Fortis features, such as snapshot, metadata and data protection and replication +are left for future submission. + +The main NOVA features include: + + * POSIX semantics + * Directly access (DAX) byte-addressable NVMM without page caching + * Per-CPU NVMM pool to maximize concurrency + * Strong consistency guarantees with 8-byte atomic stores + + +Filesystem Design +================= + +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global +file system information and the recovery inode. The recovery inode represents a +special file that stores recovery information (e.g., the list of unallocated +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also +provides per-CPU journals for complex file operations that involve multiple +inodes. The rest of the available NVMM stores logs and file data. + +NOVA is log-structured and stores a separate log for each inode to maximize +concurrency and provide atomicity for operations that affect a single file. The +logs only store metadata and comprise a linked list of 4 KB pages. Log entries +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log +pages may reside anywhere in NVMM. + +NOVA keeps copies of most file metadata in DRAM during normal +operations, eliminating the need to access metadata in NVMM during reads. + +NOVA supports both copy-on-write and in-place file data updates and appends +metadata about the write to the log. For operations that affect multiple inodes +NOVA uses lightweight, fixed-length journals –one per core. + +NOVA divides the allocatable NVMM into multiple regions, one region per CPU +core. A per-core allocator manages each of the regions, minimizing contention +during memory allocation. + +After a system crash, NOVA must scan all the logs to rebuild the memory +allocator state. Since, there are many logs, NOVA aggressively parallelizes the +scan. + + +Building and using NOVA +======================= + +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual. + +NOVA runs on a pmem non-volatile memory region. You can create one of these +regions with the `memmap` kernel command line option. For instance, adding +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting +from address 8GB, and the kernel will create a `pmem0` block device under the +`/dev` directory. + +After the OS has booted, you can initialize a NOVA instance with the following commands: + + +# modprobe nova +# mount -t NOVA -o init /dev/pmem0 /mnt/nova + + +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on +`/mnt/nova`. + +NOVA support several module command line options: + + * measure_timing: Measure the timing of file system operations for profiling (default: 0) + + * inplace_data_updates: Update data in place rather than with COW (default: 0) + +To recover an existing NOVA instance, mount NOVA without the init option, for example: + +# mount -t NOVA /dev/pmem0 /mnt/nova + + +Sysfs support +------------- + +NOVA provides sysfs support to enable user to get/set information of +a running NOVA instance. +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/: + +timing_stats IO_stats allocator gc + +Show NOVA file operation timing statistics: +# cat /proc/fs/NOVA/pmem#/timing_stats + +Clear timing statistics: +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats + +Show NOVA I/O statistics: +# cat /proc/fs/NOVA/pmem#/IO_stats + +Clear I/O statistics: +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats + +Show NOVA allocator information: +# cat /proc/fs/NOVA/pmem#/allocator + +Manual garbage collection: +# echo #inode_number > /proc/fs/NOVA/pmem#/gc + + +Source File Structure +===================== + + * nova_def.h/nova.h + Defines NOVA macros and key inline functions. + + * balloc.{h,c} + NOVA's pmem allocator implementation. + + * bbuild.c + Implements recovery routines to restore the in-use inode list and the NVMM + allocator information. + + * dax.c + Implements DAX read/write and mmap functions to access file data. NOVA uses + copy-on-write to modify file pages by default, unless inplace data update is + enabled at mount-time. + + * dir.c + Contains functions to create, update, and remove NOVA dentries. + + * file.c + Implements file-related operations such as open, fallocate, llseek, fsync, + and flush. + + * gc.c + NOVA's garbage collection functions. + + * inode.{h,c} + Creates, reads, and frees NOVA inode tables and inodes. + + * ioctl.c + Implements some ioctl commands to call NOVA's internal functions. + + * journal.{h,c} + For operations that affect multiple inodes NOVA uses lightweight, + fixed-length journals – one per core. This file contains functions to + create and manage the lite journals. + + * log.{h,c} + Functions to manipulate NOVA inode logs, including log page allocation, log + entry creation, commit, modification, and deletion. + + * namei.c + Functions to create/remove files, directories, and links. It also looks for + the NOVA inode number for a given path name. + + * rebuild.c + When mounting NOVA, rebuild NOVA inodes from its logs. + + * stats.{h,c} + Provide routines to gather and print NOVA usage statistics. + + * super.{h,c} + Super block structures and NOVA FS layout and entry points for NOVA + mounting and unmounting, initializing or recovering the NOVA super block + and other global file system information. + + * symlink.c + Implements functions to create and read symbolic links in the filesystem. + + * sysfs.c + Implements sysfs entries to take user inputs for printing NOVA statistics. + + +Filesystem Layout +================= + +A NOVA file systems resides in single PMEM device. +NOVA divides the device into 4KB blocks. + + block ++---------------------------------------------------------+ +| 0 | primary super block (struct nova_super_block) | ++---------------------------------------------------------+ +| 1 | Reserved inodes | ++---------------------------------------------------------+ +| 2 - 15 | reserved | ++---------------------------------------------------------+ +| 16 - 31 | Inode table pointers | ++---------------------------------------------------------+ +| 32 - 47 | Journal pointers | ++---------------------------------------------------------+ +| 48 - 63 | reserved | ++---------------------------------------------------------+ +| ... | log and data pages | ++---------------------------------------------------------+ +| n-2 | replica reserved Inodes | ++---------------------------------------------------------+ +| n-1 | replica super block | ++---------------------------------------------------------+ + + + +Superblock and Associated Structures +==================================== + +The beginning of the PMEM device hold the super block and its associated +tables. These include reserved inodes, a table of pointers to the journals +NOVA uses for complex operations, and pointers to inodes tables. NOVA +maintains replicas of the super block and reserved inodes in the last two +blocks of the PMEM area. + + +Block Allocator/Free Lists +========================== + +NOVA uses per-CPU allocators to manage free PMEM blocks. On initialization, +NOVA divides the range of blocks in the PMEM device among the CPUs, and those +blocks are managed solely by that CPU. We call these ranges of "allocation regions". +Each allocator maintains a red-black tree of unallocated ranges (struct +nova_range_node). + +Allocation Functions +-------------------- + +NOVA allocate PMEM blocks using two mechanisms: + +1. Static allocation as defined in super.h + +2. Allocation for log and data pages via nova_new_log_blocks() and +nova_new_data_blocks(). + + +PMEM Address Translation +------------------------ + +In NOVA's persistent data structures, memory locations are given as offsets +from the beginning of the PMEM region. nova_get_block() translates offsets to +PMEM addresses. nova_get_addr_off() performs the reverse translation. + + +Inodes +====== + +NOVA maintains per-CPU inode tables, and inode numbers are striped across the +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1). + +The inodes themselves live in a set of linked lists (one per CPU) of 2MB +blocks. The last 8 bytes of each block points to the next block. Pointers to +heads of these list live in PMEM block INODE_TABLE_START. +Additional space for inodes is allocated on demand. + +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB +tree that holds ranges of allocated inode numbers. + + +Logs +==== + +NOVA maintains a log for each inode that records updates to the inode's +metadata and holds pointers to the file data. NOVA makes updates to file data +and metadata atomic by atomically appending log entries to the log. + +Each inode contains pointers to head and tail of the inode's log. When the log +grows past the end of the last page, nova allocates additional space. For +short logs (less than 1MB) , it doubles the length. For longer logs, it adds a +fixed amount of additional space (1MB). + +Log space is reclaimed during garbage collection. + +Log Entries +----------- + +There are four kinds of log entry, documented in log.h. The log entries have +several entries in common: + + 1. 'epoch_id' gives the epoch during which the log entry was created. + Creating a snapshot increments the epoch_id for the file systems. + Currently disabled (always zero). + + 2. 'trans_id' is per-inode, monotone increasing, number assigned each + log entry. It provides an ordering over FS operations on a single inode. + + 3. 'invalid' is true if the effects of this entry are dead and the log + entry can be garbage collected. + + 4. 'csum' is a CRC32 checksum for the entry. Currently it is disabled. + +Log structure +------------- + +The logs comprise a linked list of PMEM blocks. The tail of each block +contains some metadata about the block and pointers to the next block and +block's replica (struct nova_inode_page_tail). + ++----------------+ +| log entry | ++----------------+ +| log entry | ++----------------+ +| ... | ++----------------+ +| tail | +| metadata | +| -> next block | ++----------------+ + + +Journals +======== + +NOVA uses a lightweight journaling mechanisms to provide atomicity for +operations that modify more than one on inode. The journals providing logging +for two operations: + +1. Single word updates (JOURNAL_ENTRY) +2. Copying inodes (JOURNAL_INODE) + +The journals are undo logs: NOVA creates the journal entries for an operation, +and if the operation does not complete due to a system failure, the recovery +process rolls back the changes using the journal entries. + +To commit, NOVA drops the log. + +NOVA maintains one journal per CPU. The head and tail pointers for each +journal live in a reserved page near the beginning of the file system. + +During recovery, NOVA scans the journals and undoes the operations described by +each entry. + + +File and Directory Access +========================= + +To access file data via read(), NOVA maintains a radix tree in DRAM for each +inode (nova_inode_info_header.tree) that maps file offsets to write log +entries. For directories, the same tree maps a hash of filenames to their +corresponding dentry. + +In both cases, the nova populates the tree when the file or directory is opened +by scanning its log. + + +MMap and DAX +============ + +NOVA leverages the kernel's DAX mechanisms for mmap and file data access. +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the +application's address space. + + +Garbage Collection +================== + +NOVA recovers log space with a two-phase garbage collection system. When a log +reaches the end of its allocated pages, NOVA allocates more space. Then, the +fast GC algorithm scans the log to remove pages that have no valid entries. +Then, it estimates how many pages the logs valid entries would fill. If this +is less than half the number of pages in the log, the second GC phase copies +the valid entries to new pages. + +For example (V=valid; I=invalid): + ++---+ +---+ +---+ +| I | | I | | V | ++---+ +---+ Thorough +---+ +| V | | V | GC | V | ++---+ +---+ =====> +---+ +| I | | I | | V | ++---+ +---+ +---+ +| V | | V | | V | ++---+ +---+ +---+ + | | + V V ++---+ +---+ +| I | | V | ++---+ +---+ +| I | fast GC | I | ++---+ ====> +---+ +| I | | I | ++---+ +---+ +| I | | V | ++---+ +---+ + | + V ++---+ +| V | ++---+ +| I | ++---+ +| I | ++---+ +| V | ++---+ + + +Umount and Recovery +=================== + +Clean umount/mount +------------------ + +On a clean unmount, NOVA saves the contents of many of its DRAM data structures +to PMEM to accelerate the next mount: + +1. NOVA stores the allocator state for each of the per-cpu allocators to the + log of a reserved inode (NOVA_BLOCK_NODE_INO). + +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the + NOVA_BLOCK_INODELIST_INO reserved inode. + +After a clean unmount, the following mount restores these data and then +invalidates them. + +Recovery after failures +----------------------- + +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these +DRAM structures by scanning the inode logs. NOVA log scanning is fast because +per-CPU inode tables and per-inode logs allow for parallel recovery. + +The number of live log entries in an inode log is roughly the number of extents +in the file. As a result, NOVA only needs to scan a small fraction of the NVMM +during recovery. + +The NOVA failure recovery consists of two steps: + +First, NOVA checks its lite weight journals and rolls back any uncommitted +transactions to restore the file system to a consistent state. + +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in +parallel, performing log scanning for every valid inode in the inode table. +NOVA use different recovery mechanisms for directory inodes and file inodes: +For a directory inode, NOVA scans the log's linked list to enumerate the pages +it occupies, but it does not inspect the log's contents. For a file inode, +NOVA reads the write entries in the log to enumerate the data pages. + +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds +the allocator based on the result. After this process completes, the file +system is ready to accept new requests. + +During the same scan, it rebuilds the list of available inodes. + + +Gaps, Missing Features, and Development Status +============================================== + +Although NOVA is a fully-functional file system, there is still much work left +to be done. In particular, (at least) the following items are currently missing: + +1. Snapshot, metadata and data replication and protection are left for future submission. +2. There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system). +3. NOVA only works on x86-64 kernels. +4. NOVA does not currently support extended attributes or ACL. +5. NOVA doesn't provide quota support. +6. Moving NOVA file systems between machines with different numbers of CPUs does not work. + +None of these are fundamental limitations of NOVA's design. + +NOVA is complete and robust enough to run a range of complex applications, but +it is not yet ready for production use. Our current focus is on adding a few +missing features from the list above and finding/fixing bugs. + + +Hacking and Contributing +======================== + +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues. + +If you have other questions or suggestions you can contact the NOVA developers +at cse-nova-hackers@xxxxxxxxxxxx. diff --git a/MAINTAINERS b/MAINTAINERS index 4623caf..89ac59b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9825,6 +9825,14 @@ F: drivers/power/supply/bq27xxx_battery_i2c.c F: drivers/power/supply/isp1704_charger.c F: drivers/power/supply/rx51_battery.c +NOVA FILESYSTEM +M: Andiry Xu <jix024@xxxxxxxxxxx> +M: Steven Swanson <swanson@xxxxxxxxxxx> +L: linux-fsdevel@xxxxxxxxxxxxxxx +L: linux-nvdimm@xxxxxxxxxxxx +F: Documentation/filesystems/nova.txt +F: fs/nova/ + NTB AMD DRIVER M: Shyam Sundar S K <Shyam-sundar.S-k@xxxxxxx> L: linux-ntb@xxxxxxxxxxxxxxxx -- 2.7.4