Re: Fedora 34 Change: DNF/RPM Copy on Write enablement for all variants (System-Wide Change)

Matthew Almond via devel <devel@xxxxxxxxxxxxxxxxxxxxxxx> · Thu, 24 Dec 2020 21:54:43 +0000

On Wed, 2020-12-23 at 19:23 -0500, James Cassell wrote:
> # Resolve packaging request into a list of packages and operations
> > # Download and '''decompress''' packages into a '''locally
> > optimized''' rpm file
> 
> Please verify the signature on the downloaded RPM before
> decompressing it.  (Do we do this already?)
> 

We have an opportunity to do the verification during download, but I'm
not keen on it for two major reasons:

1. the transcoder would need to open the rpmdb to perform the 
   verification, adding a fair amount of complexity.
2. (more crucially) I observe that dnf downloads packages and
   signatures before asking whether to trust them. The order of events
   means we can't be confident that all signatures are in the rpmdb
   yet.

My proposal is for enabling CoW with dnf. The code change to do
transcoding is in librepo as part of the generic file download
mechanism. librepo does verify downloads relative to the repo's
recorded digest.

The transcoder produces a different series of bits to be written to
disk, so how could that verification work? Turns out the answer is
easy: we see the original bits on the input to the transcoder, so we
calculate the digest of the bits received from the yum server and
record this in the footer of the transcoded rpm. I've
modified lr_checksum_fd() in librepo to look for this before using the
xattr cache or reading the whole file again. You can only locate that
whole file digest if the footer itself is complete.

The digest is actually a list of digests. The default value in
createrepo_c is SHA256 (irrespective of what digest algorithm is used
to identify/verify files in each rpm) and for now the dnf plugin passes
"SHA256" to the transcoder statically. This is ultimately repo
specific. I hope to eliminate the hard coding later if there's signal
within librepo to choose the right digest algo for the specific repo. 

The job of actually verifying the signature falls through to rpm as it
did before. As stated in the proposal: the headers (lead, signature,
main header) are completely untouched, so the gpg based signature is
still verified as before, and at the same point in time.

> > # Install and/or upgrade packages sequentially using RPM files,
> > using
> > '''reference linking''' (reflinking) to reuse data already on disk.
> 
> Sounds like a great improvement!  Any real-world data on how much
> time it saves, how much it changes disk usage, or how much SSD writes
> it saves?
> 

Forthcoming! I've got some numbers I've used internally at Facebook to
talk about this. To do this I had to write another rpm plugin to
measure how much time was spent on decompressing and writing data. I'm
planning on improving this and open sourcing that too. The goal here is
to produce some publicly reproducible numbers.

> > The outcome is intended to be the same, but the order of operations
> > is
> > different.
> > 
> > # Decompression happens inline with download. This has a positive
> > effect on resource usage: downloads are typically limited by
> > bandwidth. Decompression and writing the full data into a single
> > file
> > per rpm is essentially free. Additionally: if there is more than
> > one
> > download at a time, a multi-CPU system can be better utilized. All
> > compression types supported in RPM work because this uses the rpm
> > I/O
> > functions.
> 
> I referenced above, I think each chunk should also be verified before
> decompressing.
> 

This is certainly possible, but not implemented. My thinking here is
that the full rpm file digest enforced for files downloaded with
dnf/librepo also covers this. The only optimization possible here is
for a damaged rpm to fail faster during transcode. I consider this a
pretty minor optimization.

> > # RPMs are cached on local storage between downloading and
> > installation time as normal. This allows DNF to defer actual RPM
> > installation to when all the RPM are available. This is unchanged.
> > # The file format for RPMs is different with Copy on Write. The
> > headers are identical, but the payload is different. There is also
> > a
> > footer.
> > ## Files are converted (“transcoded”) locally during download using
> > <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The
> > format
> > is not intended to be “portable” - i.e. copying the files from the
> > cache is not supported.
> 
> I think these should be made to be portable.  How many variants of
> these are there?  Would it be difficult to make the transcoder also
> understand RPMs transcoded for a different
> platform/setup?  Eventually, I'd like to see additional signatures
> added to the RPM for each of the variants so RPM itself can do the
> verification at install time, avoiding a transcode to the "canonical"
> format.  (I suppose this might require a build-time or sign-time
> transcode to each of the other variants.)  Until then, I'd like to
> ensure that the package signatures are being verified in a secure
> manner, which would be necessary for the plugin to be able to install
> packages not built with multiple signatures/digests.
> 
> Would it be practical to just have a single format aligned to the
> largest page size known, leaving fs holes as necessary on systems
> with smaller page sizes?
> 

I'm not keen on making the transcoded rpms portable because they're
usually twice the size of the original/archive. If you want to share
these between systems, I would think running a caching web proxy or an
explicit internal mirror the more common way to do this.

Defaulting the alignment to 64k or some higher value would yield a lot
of wasted space per file, unless holes were used, I've not experimented
with this, but it's an interesting idea.

I've covered the transcoded file digest(s) above. This is repo driven
not rpm driven. RPM signatures get enforced as normal.

> 
> > ## Regular RPMs use a compressed .cpio based payload. In contrast,
> > extent based RPMs contain uncompressed data aligned to the
> > fundamental
> > page size of the architecture, e.g. 4KiB on x86_64. This alignment
> > is
> > required for <code>FICLONERANGE</code> to work. Only files are
> > represented in the payload, other directory entries like symlinks,
> > device nodes etc are constructed entirely from rpm header
> > information.
> > Files are referenced by their digest, so identical files are
> > de-duplicated.
> 
> How are hardlinks in an RPM handled?  Do they stay as hardlinks or
> become reflinks only, losing the hardlink status?  They should stay
> hardlinks, in my opinion.

This is a great question: Everything ends up being bit for bit
identical to systems without this system enabled. Making this work was
an interesting challenge, and I'm pretty happy with how it turned out.

> 
> > ## The footer currently has three sections
> > ### Table of original (rpm) file digests, used to validate the
> > integrity of the download in dnf.
> > ### Table of digest → offset used when actually installing files.
> > ### Signature 8 bytes at the end of the file, used to differentiate
> > between traditional RPMs and extent based.
> 
> I think this magic number "signature" should vary based on the items
> that cause the format to change.
> 

The footer contains a list of digests for the source file verification,
a list of digests -> offsets, and the signature itself. Some kind of
versioning is possible, but I've not encountered a need to cross that
bridge yet. (trying to avoid premature optimization when I don't have a
good use case yet).

> What happens if you try to use a transcoded RPM on a non-compatible
> system?
> 
Depends on how it got there, and what you asked for. Here's some
examples:

1. cp foo.rpm /var/cache/dnf/<repo>/Packages/ && dnf install foo
   ...will fail the librepo full file check, and it'll be re-
   downloaded.
2. dnf install /root/foo.rpm || rpm -i /root/foo.rpm
   (not actually tested) will likely fail with CPIO/payload error

Note that tools like rpm2cpio and rpm2archive will also fail on
transcoded rpms. I have an open task to make the dnf plugin not
transcode with 'yumdownloader' or 'dnf download' (plugin) as those are
reasonable command to run. I will look at making error messages better
and/or making some of these use cases work.

> > === Notes ===
> > 
> > # The headers are preserved bit for bit during transcoding. This
> > preserves signatures. The signatures cover the main header blob,
> > and
> > the main header blob ensures the integrity of data in two ways:
> > ## Each file with content has a digest. Originally this was md5,
> > but
> > today it’s usually sha256. In normal RPM this is only used to
> > verify
> > the integrity of files, e.g. <code>rpm -V</code>. With CoW we use
> > this
> > as a content key.
> > ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and
> > <code>PAYLOADDIGESTALT</code>) covering the payload archive
> > (compressed cpio). The header value is preserved, but transcoded
> > RPMs
> > do not preserve the original structure so RPM’s pre-installation
> > verification (controlled by <code>%_pkgverify_level</code> will
> > fail.
> > <code>dnf-plugin-cow</code> disables this check in dnf because it
> > verifies the whole file digest which is captured during
> > download/transcoding. The second one is likely used for delta rpm.
> > # This is untested, and possibly incompatible with delta RPM
> > (drpm).
> > The process for reconstructing an rpm to install from a delta is
> > expensive from both a CPU and I/O perspective, while only providing
> > marginal benefits on download size. It is expected that having
> > delta
> > rpm enabled (which is the default) will be handled gracefully.
> 
> https://github.com/rpm-software-management/rpm/pull/880 added
> DIGESTALT, apparently to help reduce this CPU usage problem.  I don't
> know if it's actually used by anything, but it is much newer than I'd
> have guessed (2019 October).

I don't see a straightforward way to use DIGESTALT. I think the
transcoded file level digest is a decent way to falsify the file, and
when the rpm is installed using dnf, you get a verify that checks the
files. DIGESTALT helps provide a way to falsify a local rpm before
trying to install it.

> 
> > # Disk space requirements are expected to be marginally higher than
> > before: all new packages or updates will consume their installed
> > size
> > before installation instead of about half their size (regular rpms
> > with payloads still cost space).
> > # <code>rpm-plugin-reflink</code> will fall back to simple file
> > copying when the destination path is not on the same
> > filesystem/subvolume. A common example is <code>/boot</code> and/or
> > <code>/boot/efi</code>.
> > # The system will still work on other filesystem types, but will
> > ''always'' fall back to simple copying. This is expected to be
> > slightly slower than not enabling CoW because the source for
> > copying
> > will be the decompressed data.
> 
> Any testing to see the speed impact?

Only accidentally ;) You're simply <moving> the decompression time to
an earlier step, and then copying a lot more data bit by bit, so the
full effect has strong dependency on CPU speed relative to I/O speed.
We found that overall, it was *slightly* faster.

> 
> > # For systems that enable transparent filesystem compression: every
> > file will continue to be decompressed from the original rpm, and
> > then
> > transparently re-compressed by the filesystem. There is no
> > effective
> > change here. There is a future project to investigate alternate
> > distribution mechanics to provide parallel versions of file content
> > pre-compressed in a filesystem specific format, reducing both CPU
> > costs and I/O. It is expected that this will result in slightly
> > higher
> > network utilization because filesystem compression is purposely
> > restricted to allow random I/O.
> > # Current implementation of <code>dnf-plugin-cow</code> is in
> > Python,
> > but it looks possible to implement this in <code>libdnf</code>
> > instead
> > which would make it work in <code>packagekit</code>.
> > 
> > === Performance Metrics ===
> > 
> > Ballpark performance difference is about half the duration for file
> > download+install time. A lot of rpms are very small, so it’s
> > difficult
> > to see/measure. Larger RPMs give much clearer signal.
> > 
> > (Actual numbers/charts will be supplied in Jan 2021)
> 
> Seems like a very nice optimization!  Thanks for working on it!

Thanks for the feedback! I'll try to incorporate these points into the
wiki in the new year - Matthew.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx