Re: [PATCH 07/14] xfs: document pageable kernel memory

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 9 Feb 2023 15:14:49 -0800

On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote:
> On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > 
> > > > Add a discussion of pageable kernel memory, since online fsck
> > > > needs
> > > > quite a bit more memory than most other parts of the filesystem
> > > > to
> > > > stage
> > > > records and other information.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > > ++++++++++++++++++++
> > > >  1 file changed, 490 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > > >  However, any data structure builder that maintains a resource
> > > > lock
> > > > for the
> > > >  duration of the repair is *always* an offline algorithm.
> > > >  
> > > > +.. _secondary_metadata:
> > > > +
> > > >  Secondary Metadata
> > > >  ``````````````````
> > > >  
> > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > > > obtained by ``xchk_fshooks_enable``.
> > > >  
> > > >  For more information, please see the kernel documentation of
> > > >  Documentation/staging/static-keys.rst.
> > > > +
> > > > +.. _xfile:
> > > > +
> > > > +Pageable Kernel Memory
> > > > +----------------------
> > > > +
> > > > +Demonstrations of the first few prototypes of online repair
> > > > revealed
> > > > new
> > > > +technical requirements that were not originally identified.
> > > > +For the first demonstration, the code walked whatever filesystem
> > > > +metadata it needed to synthesize new records and inserted
> > > > records
> > > > into a new
> > > > +btree as it found them.
> > > > +This was subpar since any additional corruption or runtime
> > > > errors
> > > > encountered
> > > > +during the walk would shut down the filesystem.
> > > > +After remount, the blocks containing the half-rebuilt data
> > > > structure
> > > > would not
> > > > +be accessible until another repair was attempted.
> > > > +Solving the problem of half-rebuilt data structures will be
> > > > discussed in the
> > > > +next section.
> > > > +
> > > > +For the second demonstration, the synthesized records were
> > > > instead
> > > > stored in
> > > > +kernel slab memory.
> > > > +Doing so enabled online repair to abort without writing to the
> > > > filesystem if
> > > > +the metadata walk failed, which prevented online fsck from
> > > > making
> > > > things worse.
> > > > +However, even this approach needed improving upon.
> > > > +
> > > > +There are four reasons why traditional Linux kernel memory
> > > > management isn't
> > > > +suitable for storing large datasets:
> > > > +
> > > > +1. Although it is tempting to allocate a contiguous block of
> > > > memory
> > > > to create a
> > > > +   C array, this cannot easily be done in the kernel because it
> > > > cannot be
> > > > +   relied upon to allocate multiple contiguous memory pages.
> > > > +
> > > > +2. While disparate physical pages can be virtually mapped
> > > > together,
> > > > installed
> > > > +   memory might still not be large enough to stage the entire
> > > > record
> > > > set in
> > > > +   memory while constructing a new btree.
> > > > +
> > > > +3. To overcome these two difficulties, the implementation was
> > > > adjusted to use
> > > > +   doubly linked lists, which means every record object needed
> > > > two
> > > > 64-bit list
> > > > +   head pointers, which is a lot of overhead.
> > > > +
> > > > +4. Kernel memory is pinned, which can drive the system out of
> > > > memory, leading
> > > > +   to OOM kills of unrelated processes.
> > > > +
> > > I think I maybe might just jump to what ever the current plan is
> > > instead of trying to keep a record of the dev history in the
> > > document.
> > > I'm sure we're not done yet, dev really never is, so in order for
> > > the
> > > documentation to be maintained, it would just get bigger and bigger
> > > to
> > > keep documenting it this way.  It's not that the above isnt
> > > valuable,
> > > but maybe a different kind of document really.
> > 
> > OK, I've shortened this introduction to outline the requirements, and
> > trimmed the historical information to a sidebar:
> > 
> > "Some online checking functions work by scanning the filesystem to
> > build
> > a shadow copy of an ondisk metadata structure in memory and comparing
> > the two copies. For online repair to rebuild a metadata structure, it
> > must compute the record set that will be stored in the new structure
> > before it can persist that new structure to disk. Ideally, repairs
> > complete with a single atomic commit that introduces a new data
> > structure. To meet these goals, the kernel needs to collect a large
> > amount of information in a place that doesn’t require the correct
> > operation of the filesystem.
> > 
> > "Kernel memory isn’t suitable because:
> > 
> > *   Allocating a contiguous region of memory to create a C array is
> > very
> >     difficult, especially on 32-bit systems.
> > 
> > *   Linked lists of records introduce double pointer overhead which
> > is
> >     very high and eliminate the possibility of indexed lookups.
> > 
> > *   Kernel memory is pinned, which can drive the system into OOM
> >     conditions.
> > 
> > *   The system might not have sufficient memory to stage all the
> >     information.
> > 
> > "At any given time, online fsck does not need to keep the entire
> > record
> > set in memory, which means that individual records can be paged out
> > if
> > necessary. Continued development of online fsck demonstrated that the
> > ability to perform indexed data storage would also be very useful.
> > Fortunately, the Linux kernel already has a facility for
> > byte-addressable and pageable storage: tmpfs. In-kernel graphics
> > drivers
> > (most notably i915) take advantage of tmpfs files to store
> > intermediate
> > data that doesn’t need to be in memory at all times, so that usage
> > precedent is already established. Hence, the xfile was born!
> > 
> > Historical Sidebar
> > ------------------
> > 
> > "The first edition of online repair inserted records into a new btree
> > as
> > it found them, which failed because filesystem could shut down with a
> > built data structure, which would be live after recovery finished.
> > 
> > "The second edition solved the half-rebuilt structure problem by
> > storing
> > everything in memory, but frequently ran the system out of memory.
> > 
> > "The third edition solved the OOM problem by using linked lists, but
> > the
> > list overhead was extreme."
> Ok, I think that's cleaner
> 
> > 
> > > 
> > > 
> > > > +For the third iteration, attention swung back to the possibility
> > > > of
> > > > using
> > > 
> > > Due to the large volume of metadata that needs to be processed,
> > > ofsck
> > > uses...
> > > 
> > > > +byte-indexed array-like storage to reduce the overhead of in-
> > > > memory
> > > > records.
> > > > +At any given time, online repair does not need to keep the
> > > > entire
> > > > record set in
> > > > +memory, which means that individual records can be paged out.
> > > > +Creating new temporary files in the XFS filesystem to store
> > > > intermediate data
> > > > +was explored and rejected for some types of repairs because a
> > > > filesystem with
> > > > +compromised space and inode metadata should never be used to fix
> > > > compromised
> > > > +space or inode metadata.
> > > > +However, the kernel already has a facility for byte-addressable
> > > > and
> > > > pageable
> > > > +storage: shmfs.
> > > > +In-kernel graphics drivers (most notably i915) take advantage of
> > > > shmfs files
> > > > +to store intermediate data that doesn't need to be in memory at
> > > > all
> > > > times, so
> > > > +that usage precedent is already established.
> > > > +Hence, the ``xfile`` was born!
> > > > +
> > > > +xfile Access Models
> > > > +```````````````````
> > > > +
> > > > +A survey of the intended uses of xfiles suggested these use
> > > > cases:
> > > > +
> > > > +1. Arrays of fixed-sized records (space management btrees,
> > > > directory
> > > > and
> > > > +   extended attribute entries)
> > > > +
> > > > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > > > +
> > > > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > > > extended
> > > > +   attribute names and values)
> > > > +
> > > > +4. Staging btrees in memory (reverse mapping btrees)
> > > > +
> > > > +5. Arbitrary contents (realtime space management)
> > > > +
> > > > +To support the first four use cases, high level data structures
> > > > wrap
> > > > the xfile
> > > > +to share functionality between online fsck functions.
> > > > +The rest of this section discusses the interfaces that the xfile
> > > > presents to
> > > > +four of those five higher level data structures.
> > > > +The fifth use case is discussed in the :ref:`realtime summary
> > > > <rtsummary>` case
> > > > +study.
> > > > +
> > > > +The most general storage interface supported by the xfile
> > > > enables
> > > > the reading
> > > > +and writing of arbitrary quantities of data at arbitrary offsets
> > > > in
> > > > the xfile.
> > > > +This capability is provided by ``xfile_pread`` and
> > > > ``xfile_pwrite``
> > > > functions,
> > > > +which behave similarly to their userspace counterparts.
> > > > +XFS is very record-based, which suggests that the ability to
> > > > load
> > > > and store
> > > > +complete records is important.
> > > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > > ``xfile_obj_store``
> > > > +functions are provided to read and persist objects into an
> > > > xfile.
> > > > +They are internally the same as pread and pwrite, except that
> > > > they
> > > > treat any
> > > > +error as an out of memory error.
> > > > +For online repair, squashing error conditions in this manner is
> > > > an
> > > > acceptable
> > > > +behavior because the only reaction is to abort the operation
> > > > back to
> > > > userspace.
> > > > +All five xfile usecases can be serviced by these four functions.
> > > > +
> > > > +However, no discussion of file access idioms is complete without
> > > > answering the
> > > > +question, "But what about mmap?"
> > > I actually wouldn't spend too much time discussing solutions that
> > > didn't work for what ever reason, unless someones really asking for
> > > it.
> > >  I think this section would read just fine to trim off the last
> > > paragraph here
> > 
> > Since I wrote this, I've been experimenting with wiring up the tmpfs
> > file page cache folios to the xfs buffer cache.  Pinning the folios
> > in
> > this manner makes it so that online fsck can (more or less) directly
> > access the xfile contents.  Much to my surprise, this has actually
> > held
> > up in testing, so ... it's no longer a solution that "didn't really
> > work". :)
> > 
> > I also need to s/page/folio/ now that willy has finished that
> > conversion.  This section has been rewritten as such:
> > 
> > "However, no discussion of file access idioms is complete without
> > answering the question, “But what about mmap?” It is convenient to
> > access storage directly with pointers, just like userspace code does
> > with regular memory. Online fsck must not drive the system into OOM
> > conditions, which means that xfiles must be responsive to memory
> > reclamation. tmpfs can only push a pagecache folio to the swap cache
> > if
> > the folio is neither pinned nor locked, which means the xfile must
> > not
> > pin too many folios.
> > 
> > "Short term direct access to xfile contents is done by locking the
> > pagecache folio and mapping it into kernel address space.
> > Programmatic
> > access (e.g. pread and pwrite) uses this mechanism. Folio locks are
> > not
> > supposed to be held for long periods of time, so long term direct
> > access
> > to xfile contents is done by bumping the folio refcount, mapping it
> > into
> > kernel address space, and dropping the folio lock. These long term
> > users
> > must be responsive to memory reclaim by hooking into the shrinker
> > infrastructure to know when to release folios.
> > 
> > "The xfile_get_page and xfile_put_page functions are provided to
> > retrieve the (locked) folio that backs part of an xfile and to
> > release
> > it. The only code to use these folio lease functions are the xfarray
> > sorting algorithms and the in-memory btrees."
> Alrighty, sounds like a good upate then
> 
> > 
> > > > +It would be *much* more convenient if kernel code could access
> > > > pageable kernel
> > > > +memory with pointers, just like userspace code does with regular
> > > > memory.
> > > > +Like any other filesystem that uses the page cache, reads and
> > > > writes
> > > > of xfile
> > > > +data lock the cache page and map it into the kernel address
> > > > space
> > > > for the
> > > > +duration of the operation.
> > > > +Unfortunately, shmfs can only write a file page to the swap
> > > > device
> > > > if the page
> > > > +is unmapped and unlocked, which means the xfile risks causing
> > > > OOM
> > > > problems
> > > > +unless it is careful not to pin too many pages.
> > > > +Therefore, the xfile steers most of its users towards
> > > > programmatic
> > > > access so
> > > > +that backing pages are not kept locked in memory for longer than
> > > > is
> > > > necessary.
> > > > +However, for callers performing quick linear scans of xfile
> > > > data,
> > > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided
> > > > to
> > > > pin a page
> > > > +in memory.
> > > > +So far, the only code to use these functions are the xfarray
> > > > :ref:`sorting
> > > > +<xfarray_sort>` algorithms.
> > > > +
> > > > +xfile Access Coordination
> > > > +`````````````````````````
> > > > +
> > > > +For security reasons, xfiles must be owned privately by the
> > > > kernel.
> > > > +They are marked ``S_PRIVATE`` to prevent interference from the
> > > > security system,
> > > > +must never be mapped into process file descriptor tables, and
> > > > their
> > > > pages must
> > > > +never be mapped into userspace processes.
> > > > +
> > > > +To avoid locking recursion issues with the VFS, all accesses to
> > > > the
> > > > shmfs file
> > > > +are performed by manipulating the page cache directly.
> > > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > > functions of the
> > > > +xfile's address space to grab writable pages, copy the caller's
> > > > buffer into the
> > > > +page, and release the pages.
> > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> > > xfile readers
> > 
> > OK.
> > 
> > > > directly before
> > > > +copying the contents into the caller's buffer.
> > > > +In other words, xfiles ignore the VFS read and write code paths
> > > > to
> > > > avoid
> > > > +having to create a dummy ``struct kiocb`` and to avoid taking
> > > > inode
> > > > and
> > > > +freeze locks.
> > > > +
> > > > +If an xfile is shared between threads to stage repairs, the
> > > > caller
> > > > must provide
> > > > +its own locks to coordinate access.
> > > Ofsck threads that share an xfile between stage repairs will use
> > > their
> > > own locks to coordinate access with each other.
> > > 
> > > ?
> > 
> > Hm.  I wonder if there's a misunderstanding here?
> > 
> > Online fsck functions themselves are single-threaded, which is to say
> > that they themselves neither queue workers nor start kthreads. 
> > However,
> > an xfile created by a running fsck function can be accessed from
> > other
> > thread if the fsck function also hooks itself into filesystem code.
> > 
> > The live update section has a nice diagram of how that works:
> > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> > 
> 
> Oh ok, I think I got hung up on who the callers were.  How about
> "xfiles shared between threads running from hooked filesystem functions
> will use their own locks to coordinate access with each other."

I don't want to mention filesystem hooks before the chapter that
introduces them.  How about:

"For example, if a scrub function stores scan results in an xfile and
needs other threads to provide updates to the scanned data, the scrub
function must provide a lock for all threads to share."

--D