Re: rpm hashes

Panu Matilainen <pmatilai@xxxxxxxxxxxxxxx> · Wed, 20 May 2009 11:19:08 +0300 (EEST)

On Thu, 14 May 2009, Adam Jackson wrote:

On Thu, 2009-05-14 at 10:46 +0300, Panu Matilainen wrote:
On Wed, 13 May 2009, Adam Jackson wrote:
It would have been really, _really_ nice if sha256 was merely another
hash that could be in the payload, instead of forcing you to pick one or
the other.  For that matter, it would still be really really nice.

Could it have been done that way? Yes, and if it were just per-package
hash then certainly it would've been done that way. But remember this is
per-file data, storing two (and when the day comes when sha256 is
considered insufficient, three etc) hashes per file adds a non-trivial
amount of header bloat.

32 bytes per file, plus another four for the header tag, unless I have
my math wildly wrong and/or I'm misremembering how hashes are stored.
My F11 machine has 430910 files over 2167 packages, so that extra
metadata comes to a massive 14.8M, compared to 11.6G of actual payload.
I have trouble getting worked up over this.

People scream BLOAT! for lesser issues. It's data that gets transfered 
over the wire(less) over and over again, stored on disk in rpmdb (for the 
average desktop/server its completely irrelevant but not so for smaller 
devices) .. and the header data size is (artificially) limited to 16MB. 
Increasing that limit is possible and will sooner or later be necessary 
(people are occasionally hitting it already), but it's another 
incompatibility: all the widely deployed versions of rpm will think of 
a package with > 16MB header as corrupted, refusing to read it at all.

The point about having to store arbitrarily many hashes is certainly
fair, but a) sha512 is only twice as large as sha256, and 0.2% overhead
is still not a lot, b) that seems like a distro policy question.

Having the md5 hashes too would've been nice for backwards compatibility
but actually using them for file conflict calculations would mean (in
addition to the header bloat):
- considerable increase in memory use

I just don't buy this at all.  The checksums are computed as part of the
stdio stream, and any competent implementation of a SHA-like algorithm
requires storage that's O(n) on the size of the hash, not on the size of
the file.  So you'd need whatever the overhead is for the additional
metadata on the package you're currently inspecting, plus no more than a
page for the additional work area for the second hash.  (I assume here
that fileconflict checks are done one package at a time, not by loading
all packages into memory and then checking them for conflicts, since the
latter would be unusable.)

Well the assumption is wrong: during file conflict checking, all 
file-related data of non-installed packages is kept in memory, the full 
headers that are fed into transaction are discarded to - guess what - save 
memory, only the absolutely necessary file data is kept. For installed 
packages, rpm can and does fetch them one at a time from rpmdb as 
necessary, but for to-be-installed packages, rpm doesn't have the header 
so it can't go back to them as needed.

Oh, I guess there's also a case where you have to check for
fileconflicts among multiple packages in the same transaction laying
down the same files.  Handwave, same problem really.

- falling back to md5 for conflict resolution would void the supposed
   extra security of the better hash

So there's two cases, if rpm would let you carry both hashes.

1 is where the file on disk has both MD5 and SHA256 sums, and the new
package has only MD5.  You already trust the package on disk, because
you already installed it; so compute the SHA256 of the file you're about
to lay down!  Now you have both hashes, and you can compare them both.
The odds of defeating this are the odds of finding a payload that
collides for both MD5 and SHA256, which can't possibly be lower than the
odds of finding a collision for just SHA256 itself.

2 is where the file on disk has only MD5, and the package you're about
to install has both.  If you have an rpm that only understands MD5, then
whatever, you just ignore the SHA256 hash.  If you have an rpm that
understands both, then you have options.  If you're being sensible, you
do the same thing as for case 1, which is to generate the SHA256 of the
disk file that's implicitly already trusted and compare both sums, and
presumably you only got to this point because you trust the GPG key that
signed the package you're about to install, so, good enough.  (There's a
flaw here if the file on disk is modified.  I could see arguments here
for any of rpmnew/rpmsave/fileconflict as the "right thing", which I
leave to someone more detail-oriented than I am.)

If you're in FIPS mode - that is, if you're _not_ being sensible - then
you fail the transaction, which you ought rightly do anyway since oh no
the package on disk is only hashed with MD5, you're already in trouble.

3) You're installing two new packages with a common file where the other 
only has md5 hashes and the other has md5 + a stronger hash. Okay, assume 
a "FIPS mode" exists and it's mostly same as above, either be anal about 
it or not.

But back to the existing implementation: sure it isn't optimal, sure it 
would be nice if it were backwards compatible all the way to RHEL 2.1 or 
whatever. It's a trade-off on several fronts, due to many different 
aspects: limitations of fundamental rpm architecture (inability to 
calculate the hash from payload on demand), efficiency (memory footprint, 
bandwidth etc), compatibility (see the point about header size, just 
stuffing more and more data there can make things even more 
incompatible)... and I'm a bit tired of people assuming no thought 
whatsoever was given to the way its done.

	- Panu -

--
fedora-devel-list mailing list
fedora-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/fedora-devel-list