Re: downloading a complete web page without using a browser...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Samuel Sieb writes:
 > On 2021-07-03 8:02 p.m., dwoody5654@xxxxxxxxx wrote:
 > > the url I am trying to download does not have an extension ie. no
 > > '.htm' such as:
 > > https://my.acbl.org/club-results/details/338288

The extension doesn't matter to any of the utilities mentioned as far
as I know.  I'm pretty sure they get the MIME type from the HTTP
Content-Type header.

 > > wget does not download the correct web page.
 > 
 > I tried it and it worked, sort of.  The problem is that you want to 
 > download everything to view it offline, but the site my.acbl.org has a 
 > robots.txt that says "no robots allowed".  So wget respects that and 
 > will not download any required files from that site other than the 
 > initial page.  curl probably has the same issue.

1.  The page does not have content represented in HTML AFAICT: it's a
    blob which is parsed and formatted by a battery of (java)scripts,
    some of which are resources on the Internet, and some are inline.
    In other words, the HTML in that file is used as a container format
    to transport the scripts to the browser.
    Neither wget nor curl support Javascript at all as far as I know.

2.  96% of the page is in two blobs; AFAICT there were no IMG or other
    elements that specify requirements by URL.  If so, that would
    explain why only the top page was downloaded.

3.  curl does not document how it handles robots.txt.  Since as far as
    I can tell curl has no recursive or get-requirements option, it
    probably doesn't handle it at all.
    wget documents that wget -r (recursive downloads) respects
    robots.txt.  It does not document that wget -p (get page
    requisites, too) respects robots.txt, but a quick test suggests
    that it does.  I think this is a bug: any interactive program that
    supports non-text media will download required resources with the
    access to the HTML file.  (If someone agrees and wants to do
    something about it, this is a wget bug, not a Fedora bug.)

I don't have an alternative fetch tool to suggest, unfortunately.  I
think that you need to use a graphical browser somehow, or write a
script in your favorite P-language.

Steve
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux