Re: Better repodata performance

Jeff Johnson <n3npq@xxxxxxxxx> · Sun, 30 Jan 2005 13:55:19 -0500

--- Begin Message ---

To: Alexandre Oliva <aoliva@xxxxxxxxxx>
Subject: Re: Better repodata performance
From: Jeff Johnson <n3npq@xxxxxxxxx>
Date: Sun, 30 Jan 2005 13:22:00 -0500
In-reply-to: <orfz0i4xae.fsf@livre.redhat.lsd.ic.unicamp.br>
References: <20050127082839.GE29221@batman.gotham.krass.com>	<200501290027.05357.symbiont@berlios.de>	<1106937879.26755.6.camel@cutter>	<200501290954.22339.symbiont@berlios.de>	<1106984170.1614.34.camel@cutter>	<20050129113933.GB16904@neu.nirvana>	<ord5vot5it.fsf@livre.redhat.lsd.ic.unicamp.br>	<1107026606.1614.59.camel@cutter>	<20050129212206.GA22924@neu.nirvana>	<1107036420.1614.72.camel@cutter>	<5cf776b8050129173454d67d78@mail.gmail.com>	<1107049511.1614.91.camel@cutter>	<oru0oz4kam.fsf@livre.redhat.lsd.ic.unicamp.br>	<1107059359.1614.96.camel@cutter> <41FC86DD.2050100@nc.rr.com>	<1107069344.1614.99.camel@cutter> <41FC9277.1060807@nc.rr.com>	<orfz0i4xae.fsf@livre.redhat.lsd.ic.unicamp.br>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031030

Alexandre Oliva wrote:

On Jan 30, 2005, Jeff Johnson <n3npq@xxxxxxxxx> wrote:

More seriously, I'm like a weekend's work away from adding a look-aside

cache to rpm -4.4.x (which has a relaible http/https stack using neon)  that

could be invoked asynchronously to yum as

   rpm -q http://host/path/to/N-V-R.A.rpm

and then yum could read the header from the package as it was being

downloaded

Err...  By the time yum or any other depsolver decides to download a
package, it's already got all the headers for all packages.  And I

Yep, "already got" so lets's go get the header again.

hope you're not suggesting yum to get rpm to download *all* packages
just because it needs headers.  *That* would be a waste of bandwidth

Depends on how yum implements, but we agree that "all" is stupid, even

if we appear to disagree whether headers being downloaded and then 
downloaded

again is stupid.

into /var/cache/yum/repo/packages since you already know the header

byte range you are interested in from the xml metadata, thereby

saving the bandwidth used by reading the header twice.

Hmm...  I hope you're not saying yum actually fetches the header

portion out of the rpm files for purposes of dep resolution.  Although

I realize the information in the .xml file makes it perfectly

possible, it also makes it (mostly?) redundant.  Having to download

not only the big xml files but also all of the headers would suck in a

big way!

The rpmlib API requires a header for a ride. So yes, that is exactly what
is happening, yum is using byte ranges to pull headers from discovered
packages where (if discovered, packages are needed) both header+payload
could be pulled togethere and asynchronously.

I was thinking to myself that having to download only the compressed
xml files might be a win (bandwidth-wise) over going though all of the
headers like good old yum 2.0 did, at least in the short term, and for
a repository that doesn't change too much.

But having to download the xml files *and* the rpm's headers upfront

would make the repodata format a bit loser, because not only would you

waste a lot of bandwidth with the xml files, that are much bigger than

the header.info files, but also because fetching only the header

portion out of the rpm files with byte-range downloads makes them

non-cacheable by say squid.

The repo data is a win over previous incarnations of yum becuase it's 
one, not hundreds, of

files that needs to be downloaded. That is easier to implement and 
debug, and filtering

the data (i.e. headers have changelogs and more that is useless baggage) 
is also

a win.

At the end of the road, you need the package, because that is what 
rpmlib installs.

There is no way to avoid downloading the package once a decision to 
install has

been made.

Downloading the headers in between the primary repo data and the package is
what is unnecssary, but the header is still a useful object.

So the suggestion was to download the package, not the header, and then
extract the header from local, not remote storage.

I'd be very surprised if yum 2.1 actually worked this way.  I expect

far better from Seth, and from what I read during the design period

of the metadata format, I understood that the point of the xml files

was precisely to avoid having to download the hdr files in the first

place.  So why would they be needed?  To get rpmlib to verify the

transaction, perhaps?

What do you expect? There is no way to create a transaction using rpmlib
without a header, a header is wired in the rpmlib API. So honk at failed
rpmlib design, not otherwise.

That's a far bigger bandwidth saving than attempting to fragment

primary.xml,

which already has timestamp checks to avoid downloading the same file

repeatedly

The problem is not downloading the same file repeatedly.  The problem

is that, after it is updated, you have to download the entire file

again to get a very small amount of new information.  Assuming a

biggish repository like FC updates, development, pre-extras, extras or

dag, freshrpms, at-rpms, newrpms, etc, it's a lot of wasted

bandwidth.

Look, repos currently change daily, perhaps twice daily. Trying to 
optimize incremenatl

updates for something that changes perhaps twice a day is fluff.

The rpm-metadata is already a huge win, as the previous incarnation 
checked time stamps

on hundreds and thousands of headers, not one primary file.

Sure there are further improvements, but busting up repo metadata ain't 
gonna be where

the win is, there's little gold left in that mine.

73 de Jeff

--- End Message ---