On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote: > On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote: > > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote: > > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > Add a discussion of pageable kernel memory, since online fsck > > > > needs > > > > quite a bit more memory than most other parts of the filesystem > > > > to > > > > stage > > > > records and other information. > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > --- > > > > .../filesystems/xfs-online-fsck-design.rst | 490 > > > > ++++++++++++++++++++ > > > > 1 file changed, 490 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > index 419eb54ee200..9d7a2ef1d0dd 100644 > > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst > > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst > > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan. > > > > However, any data structure builder that maintains a resource > > > > lock > > > > for the > > > > duration of the repair is *always* an offline algorithm. > > > > > > > > +.. _secondary_metadata: > > > > + > > > > Secondary Metadata > > > > `````````````````` > > > > > > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys > > > > obtained by ``xchk_fshooks_enable``. > > > > > > > > For more information, please see the kernel documentation of > > > > Documentation/staging/static-keys.rst. > > > > + > > > > +.. _xfile: > > > > + > > > > +Pageable Kernel Memory > > > > +---------------------- > > > > + > > > > +Demonstrations of the first few prototypes of online repair > > > > revealed > > > > new > > > > +technical requirements that were not originally identified. > > > > +For the first demonstration, the code walked whatever filesystem > > > > +metadata it needed to synthesize new records and inserted > > > > records > > > > into a new > > > > +btree as it found them. > > > > +This was subpar since any additional corruption or runtime > > > > errors > > > > encountered > > > > +during the walk would shut down the filesystem. > > > > +After remount, the blocks containing the half-rebuilt data > > > > structure > > > > would not > > > > +be accessible until another repair was attempted. > > > > +Solving the problem of half-rebuilt data structures will be > > > > discussed in the > > > > +next section. > > > > + > > > > +For the second demonstration, the synthesized records were > > > > instead > > > > stored in > > > > +kernel slab memory. > > > > +Doing so enabled online repair to abort without writing to the > > > > filesystem if > > > > +the metadata walk failed, which prevented online fsck from > > > > making > > > > things worse. > > > > +However, even this approach needed improving upon. > > > > + > > > > +There are four reasons why traditional Linux kernel memory > > > > management isn't > > > > +suitable for storing large datasets: > > > > + > > > > +1. Although it is tempting to allocate a contiguous block of > > > > memory > > > > to create a > > > > + C array, this cannot easily be done in the kernel because it > > > > cannot be > > > > + relied upon to allocate multiple contiguous memory pages. > > > > + > > > > +2. While disparate physical pages can be virtually mapped > > > > together, > > > > installed > > > > + memory might still not be large enough to stage the entire > > > > record > > > > set in > > > > + memory while constructing a new btree. > > > > + > > > > +3. To overcome these two difficulties, the implementation was > > > > adjusted to use > > > > + doubly linked lists, which means every record object needed > > > > two > > > > 64-bit list > > > > + head pointers, which is a lot of overhead. > > > > + > > > > +4. Kernel memory is pinned, which can drive the system out of > > > > memory, leading > > > > + to OOM kills of unrelated processes. > > > > + > > > I think I maybe might just jump to what ever the current plan is > > > instead of trying to keep a record of the dev history in the > > > document. > > > I'm sure we're not done yet, dev really never is, so in order for > > > the > > > documentation to be maintained, it would just get bigger and bigger > > > to > > > keep documenting it this way. It's not that the above isnt > > > valuable, > > > but maybe a different kind of document really. > > > > OK, I've shortened this introduction to outline the requirements, and > > trimmed the historical information to a sidebar: > > > > "Some online checking functions work by scanning the filesystem to > > build > > a shadow copy of an ondisk metadata structure in memory and comparing > > the two copies. For online repair to rebuild a metadata structure, it > > must compute the record set that will be stored in the new structure > > before it can persist that new structure to disk. Ideally, repairs > > complete with a single atomic commit that introduces a new data > > structure. To meet these goals, the kernel needs to collect a large > > amount of information in a place that doesn’t require the correct > > operation of the filesystem. > > > > "Kernel memory isn’t suitable because: > > > > * Allocating a contiguous region of memory to create a C array is > > very > > difficult, especially on 32-bit systems. > > > > * Linked lists of records introduce double pointer overhead which > > is > > very high and eliminate the possibility of indexed lookups. > > > > * Kernel memory is pinned, which can drive the system into OOM > > conditions. > > > > * The system might not have sufficient memory to stage all the > > information. > > > > "At any given time, online fsck does not need to keep the entire > > record > > set in memory, which means that individual records can be paged out > > if > > necessary. Continued development of online fsck demonstrated that the > > ability to perform indexed data storage would also be very useful. > > Fortunately, the Linux kernel already has a facility for > > byte-addressable and pageable storage: tmpfs. In-kernel graphics > > drivers > > (most notably i915) take advantage of tmpfs files to store > > intermediate > > data that doesn’t need to be in memory at all times, so that usage > > precedent is already established. Hence, the xfile was born! > > > > Historical Sidebar > > ------------------ > > > > "The first edition of online repair inserted records into a new btree > > as > > it found them, which failed because filesystem could shut down with a > > built data structure, which would be live after recovery finished. > > > > "The second edition solved the half-rebuilt structure problem by > > storing > > everything in memory, but frequently ran the system out of memory. > > > > "The third edition solved the OOM problem by using linked lists, but > > the > > list overhead was extreme." > Ok, I think that's cleaner > > > > > > > > > > > > > +For the third iteration, attention swung back to the possibility > > > > of > > > > using > > > > > > Due to the large volume of metadata that needs to be processed, > > > ofsck > > > uses... > > > > > > > +byte-indexed array-like storage to reduce the overhead of in- > > > > memory > > > > records. > > > > +At any given time, online repair does not need to keep the > > > > entire > > > > record set in > > > > +memory, which means that individual records can be paged out. > > > > +Creating new temporary files in the XFS filesystem to store > > > > intermediate data > > > > +was explored and rejected for some types of repairs because a > > > > filesystem with > > > > +compromised space and inode metadata should never be used to fix > > > > compromised > > > > +space or inode metadata. > > > > +However, the kernel already has a facility for byte-addressable > > > > and > > > > pageable > > > > +storage: shmfs. > > > > +In-kernel graphics drivers (most notably i915) take advantage of > > > > shmfs files > > > > +to store intermediate data that doesn't need to be in memory at > > > > all > > > > times, so > > > > +that usage precedent is already established. > > > > +Hence, the ``xfile`` was born! > > > > + > > > > +xfile Access Models > > > > +``````````````````` > > > > + > > > > +A survey of the intended uses of xfiles suggested these use > > > > cases: > > > > + > > > > +1. Arrays of fixed-sized records (space management btrees, > > > > directory > > > > and > > > > + extended attribute entries) > > > > + > > > > +2. Sparse arrays of fixed-sized records (quotas and link counts) > > > > + > > > > +3. Large binary objects (BLOBs) of variable sizes (directory and > > > > extended > > > > + attribute names and values) > > > > + > > > > +4. Staging btrees in memory (reverse mapping btrees) > > > > + > > > > +5. Arbitrary contents (realtime space management) > > > > + > > > > +To support the first four use cases, high level data structures > > > > wrap > > > > the xfile > > > > +to share functionality between online fsck functions. > > > > +The rest of this section discusses the interfaces that the xfile > > > > presents to > > > > +four of those five higher level data structures. > > > > +The fifth use case is discussed in the :ref:`realtime summary > > > > <rtsummary>` case > > > > +study. > > > > + > > > > +The most general storage interface supported by the xfile > > > > enables > > > > the reading > > > > +and writing of arbitrary quantities of data at arbitrary offsets > > > > in > > > > the xfile. > > > > +This capability is provided by ``xfile_pread`` and > > > > ``xfile_pwrite`` > > > > functions, > > > > +which behave similarly to their userspace counterparts. > > > > +XFS is very record-based, which suggests that the ability to > > > > load > > > > and store > > > > +complete records is important. > > > > +To support these cases, a pair of ``xfile_obj_load`` and > > > > ``xfile_obj_store`` > > > > +functions are provided to read and persist objects into an > > > > xfile. > > > > +They are internally the same as pread and pwrite, except that > > > > they > > > > treat any > > > > +error as an out of memory error. > > > > +For online repair, squashing error conditions in this manner is > > > > an > > > > acceptable > > > > +behavior because the only reaction is to abort the operation > > > > back to > > > > userspace. > > > > +All five xfile usecases can be serviced by these four functions. > > > > + > > > > +However, no discussion of file access idioms is complete without > > > > answering the > > > > +question, "But what about mmap?" > > > I actually wouldn't spend too much time discussing solutions that > > > didn't work for what ever reason, unless someones really asking for > > > it. > > > I think this section would read just fine to trim off the last > > > paragraph here > > > > Since I wrote this, I've been experimenting with wiring up the tmpfs > > file page cache folios to the xfs buffer cache. Pinning the folios > > in > > this manner makes it so that online fsck can (more or less) directly > > access the xfile contents. Much to my surprise, this has actually > > held > > up in testing, so ... it's no longer a solution that "didn't really > > work". :) > > > > I also need to s/page/folio/ now that willy has finished that > > conversion. This section has been rewritten as such: > > > > "However, no discussion of file access idioms is complete without > > answering the question, “But what about mmap?” It is convenient to > > access storage directly with pointers, just like userspace code does > > with regular memory. Online fsck must not drive the system into OOM > > conditions, which means that xfiles must be responsive to memory > > reclamation. tmpfs can only push a pagecache folio to the swap cache > > if > > the folio is neither pinned nor locked, which means the xfile must > > not > > pin too many folios. > > > > "Short term direct access to xfile contents is done by locking the > > pagecache folio and mapping it into kernel address space. > > Programmatic > > access (e.g. pread and pwrite) uses this mechanism. Folio locks are > > not > > supposed to be held for long periods of time, so long term direct > > access > > to xfile contents is done by bumping the folio refcount, mapping it > > into > > kernel address space, and dropping the folio lock. These long term > > users > > must be responsive to memory reclaim by hooking into the shrinker > > infrastructure to know when to release folios. > > > > "The xfile_get_page and xfile_put_page functions are provided to > > retrieve the (locked) folio that backs part of an xfile and to > > release > > it. The only code to use these folio lease functions are the xfarray > > sorting algorithms and the in-memory btrees." > Alrighty, sounds like a good upate then > > > > > > > +It would be *much* more convenient if kernel code could access > > > > pageable kernel > > > > +memory with pointers, just like userspace code does with regular > > > > memory. > > > > +Like any other filesystem that uses the page cache, reads and > > > > writes > > > > of xfile > > > > +data lock the cache page and map it into the kernel address > > > > space > > > > for the > > > > +duration of the operation. > > > > +Unfortunately, shmfs can only write a file page to the swap > > > > device > > > > if the page > > > > +is unmapped and unlocked, which means the xfile risks causing > > > > OOM > > > > problems > > > > +unless it is careful not to pin too many pages. > > > > +Therefore, the xfile steers most of its users towards > > > > programmatic > > > > access so > > > > +that backing pages are not kept locked in memory for longer than > > > > is > > > > necessary. > > > > +However, for callers performing quick linear scans of xfile > > > > data, > > > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided > > > > to > > > > pin a page > > > > +in memory. > > > > +So far, the only code to use these functions are the xfarray > > > > :ref:`sorting > > > > +<xfarray_sort>` algorithms. > > > > + > > > > +xfile Access Coordination > > > > +````````````````````````` > > > > + > > > > +For security reasons, xfiles must be owned privately by the > > > > kernel. > > > > +They are marked ``S_PRIVATE`` to prevent interference from the > > > > security system, > > > > +must never be mapped into process file descriptor tables, and > > > > their > > > > pages must > > > > +never be mapped into userspace processes. > > > > + > > > > +To avoid locking recursion issues with the VFS, all accesses to > > > > the > > > > shmfs file > > > > +are performed by manipulating the page cache directly. > > > > +xfile writes call the ``->write_begin`` and ``->write_end`` > > > > functions of the > > > > +xfile's address space to grab writable pages, copy the caller's > > > > buffer into the > > > > +page, and release the pages. > > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages > > > xfile readers > > > > OK. > > > > > > directly before > > > > +copying the contents into the caller's buffer. > > > > +In other words, xfiles ignore the VFS read and write code paths > > > > to > > > > avoid > > > > +having to create a dummy ``struct kiocb`` and to avoid taking > > > > inode > > > > and > > > > +freeze locks. > > > > + > > > > +If an xfile is shared between threads to stage repairs, the > > > > caller > > > > must provide > > > > +its own locks to coordinate access. > > > Ofsck threads that share an xfile between stage repairs will use > > > their > > > own locks to coordinate access with each other. > > > > > > ? > > > > Hm. I wonder if there's a misunderstanding here? > > > > Online fsck functions themselves are single-threaded, which is to say > > that they themselves neither queue workers nor start kthreads. > > However, > > an xfile created by a running fsck function can be accessed from > > other > > thread if the fsck function also hooks itself into filesystem code. > > > > The live update section has a nice diagram of how that works: > > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks > > > > Oh ok, I think I got hung up on who the callers were. How about > "xfiles shared between threads running from hooked filesystem functions > will use their own locks to coordinate access with each other." I don't want to mention filesystem hooks before the chapter that introduces them. How about: "For example, if a scrub function stores scan results in an xfile and needs other threads to provide updates to the scanned data, the scrub function must provide a lock for all threads to share." --D