On Jan 30, 2005, Jeff Johnson <n3npq@xxxxxxxxx> wrote: > Alexandre Oliva wrote: >> Err... By the time yum or any other depsolver decides to download a >> package, it's already got all the headers for all packages. And I > Yep, "already got" so lets's go get the header again. I see. Still, it's probably unwise to get the transaction verification procedure waiting for all big packages to download, or competing for limited bandwidth, when the headers alone would have sufficed. An idea to overcome this issue without throwing away possible web caching benefits would be to start a download of the entire rpm and, by the time you get to the end of the header, you stop reading from that connection until you've completed the transaction verification. If you have a web proxy, it will likely keep on downloading the entire package, and you'll end up downloading the rest of it very quickly, but the downloads will be competing for bandwidth. If you don't have a web proxy, however, things may get messy: not only will you get competition for bandwidth, you'll also get competition for any limit on open connections that may be imposed on you by upstream (ISP, download server, etc). (My DSL provider, for example, won't let me establish more than 30 TCP connections simultaneously) >> hope you're not suggesting yum to get rpm to download *all* packages >> just because it needs headers. *That* would be a waste of bandwidth > Depends on how yum implements, but we agree that "all" is stupid, > even if we appear to disagree whether headers being downloaded and > then downloaded again is stupid. We do agree on both accounts. >>> into /var/cache/yum/repo/packages since you already know the header >>> byte range you are interested in from the xml metadata, thereby >>> saving the bandwidth used by reading the header twice. >> Hmm... I hope you're not saying yum actually fetches the header >> portion out of the rpm files for purposes of dep resolution. Although >> I realize the information in the .xml file makes it perfectly >> possible, it also makes it (mostly?) redundant. Having to download >> not only the big xml files but also all of the headers would suck in a >> big way! > The rpmlib API requires a header for a ride. So yes, that is exactly what > is happening, yum is using byte ranges to pull headers from discovered > packages where (if discovered, packages are needed) both header+payload > could be pulled togethere and asynchronously. I hope you're really not saying that, if I request to install package foo, that depends on bar, it will also download headers for baz, a totally unrelated package. I can see that we'd need headers for foo and bar, but not for baz. I thought the point of the xml files and the info on provides, filelists, etc, was precisely to enable the depsolver to avoid having to download the headers for every package. I'm wondering if it would be possible for a depsolver to create a (smaller) .hdr file out of info in the .xml files, and feed that to rpmlib for transaction-verification purposes. This would enable it to skip the download-header step before downloading the entire package. > The repo data is a win over previous incarnations of yum becuase > it's one, not hundreds, of files that needs to be downloaded. It's clear to me that it's a win for a one-shot download. What's not clear is that, downloading the entire .xml files 2-3 times a day, or every time it's updated (which rhn-applet would presumably do, although a simple listing of package N-V-Rs would be enough for it), you won't end up wasting more bandwidth than having the .hdr files downloaded once and for all. > filtering the data (i.e. headers have changelogs and more that is > useless baggage) is also a win. Definitely. But couldn't we perhaps do it by intelligently filtering information out of the rpm header and, say, generating a single archive containing all of the info needed for depsolving and for rpmlib's transaction verification? It might still be useful to have something like header.info, but have it compressed, and not only listing N-V-Rs, but also the header ranges should one want to download the headers for individual packages, as opposed to the xml files or equivalent, a previous version of which was downloaded in the past and whose data is still locally-available, and that has undergone very slight changes. > So the suggestion was to download the package, not the header, and then > extract the header from local, not remote storage. I see. Good one, if we can't help downloading the package (e.g., because rpmlib ends up deciding it can't install the packages) >> I'd be very surprised if yum 2.1 actually worked this way. I expect >> far better from Seth, and from what I read during the design period >> of the metadata format, I understood that the point of the xml files >> was precisely to avoid having to download the hdr files in the first >> place. So why would they be needed? To get rpmlib to verify the >> transaction, perhaps? > What do you expect? There is no way to create a transaction using rpmlib > without a header, a header is wired in the rpmlib API. So honk at failed > rpmlib design, not otherwise. I was expecting depsolving wouldn't require all the headers. And from what I gather from your reply, it indeed doesn't. > Look, repos currently change daily, perhaps twice daily. They actually change more often than that. rawhide changes daily, yes. FC updates sometimes change several times in a single day, and then sometimes stay put for a few days. Could be once or twice a day, indeed. Other repos such as dag and freshrpms change more often than that, it seems to me. At least once a day would be an accurate description for them. > Trying to optimize incremenatl updates for something that changes > perhaps twice a day is fluff. Let me try some back-of-the-envelope calculations here. Consider an FC install that remains installed for 40 weeks (~ 9 months), and has a user permanently running rhn-applet, and whose administrator runs up2date once a day on average. Further consider that updates are released, on average, once a day, and that, on average, only two of the 7 weekly update runs actually have new packages to install (i.e., updates are generally published in batches) Let's consider two scenarios: 1) using up2date with yum-2.0 (headers/) repos (whoever claimed up2date supported rpmmd repodata/ misled me :-) and 2) using yum-2.1 (repodata/) repos. 1) yum 2.0 16MiB) initial download, distro's and empty updates's hdrs 8MiB) daily (on average) downloads of header.info for updates, downloaded by rhn-applet, considering an average size of almost 30KiB, for 40 weeks. (both FC2 and FC3 updates for i386 have a header.info this big right now) 16MiB) .hdr files for updates, downloaded by the update installer. Current FC2 i386 headers/ holds 9832KiB, whereas FC3 i386 headers/ holds 8528KiB, but that doesn't count superseded updates, whose .hdr files are removed. The assumption is that each header is downloaded once. 16MiB is a guestimate, that I believe to be inflated. It doesn't take into account the duplicate downloads of header.info for updates, under the assumption that a web proxy would avoid downloading again what rhn-applet has already downloaded. ---- 40MiB) just in metadata over a period of 9 months, total 2) yum 2.1 2.7MiB) initial download, distro's and empty updates' primary.xml.gz and filelists.xml.gz 68MiB) daily (on average) downloads of primary.xml.gz, downloaded by rhn-applet, considering an average size of 250KiB (FC2 updates's is 240KiB, whereas FC3's is 257KiB, plus about 1KiB for repomd.xml) 16MiB) .hdr files for updates, downloaded by the update installer (same as in case 1) 192MiB) filelists.xml.gz for updates, downloaded twice a week on average by the update installer, to solve filename dep. ---- 278.7MiB) just in metadata over a period of 9 months, total Looks like a waste of at least 238.7 MiB per user per 9-month install. Sure, it's not a lot, only 26.5MiB a month, but it's almost 6 times as much data being transferred for the very same purpose. How is that a win? Multiply that by the number of users pounding on your mirrors and it adds up to hundreds of GiB a month. Of course there are some factors that can help minimize the wastage, for example, a web proxy serving multiple machines, one of which is updated before the others, will be able to serve the headers for yum 2.1 out of the cached .rpm files, so you transfer the headers by themselves only once for all machines, instead of once per machine. But then, yum 2.0 enables the web proxy to cache headers anyway, so this would be a win for both, and less so for yum 2.1 if you update multiple boxes in parallel. Another factor is that you probably won't need filelists.xml.gz for every update. Maybe I don't quite understand how often it is needed, but even if I have to download it only once a month, that's still 64MiB over 9 months, more than the 40MiB total metadata downloaded over 9 months by yum 2.0. > The rpm-metadata is already a huge win, as the previous incarnation > checked time stamps on hundreds and thousands of headers, not one > primary file. I don't know how yum 2.0 did it, but up2date surely won't even try to download a .hdr file if it already has it in /var/spool/up2date, so this is not an issue. > Sure there are further improvements, but busting up repo metadata > ain't gonna be where the win is, there's little gold left in that > mine. repodata helps the initial download, granted, but it loses terribly in the long run. -- Alexandre Oliva http://www.ic.unicamp.br/~oliva/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org}