On Tue, 6 Jul 2021 14:31:29 +0900 stephen@xxxxxxxxxx wrote: > Samuel Sieb writes: > > On 2021-07-03 8:02 p.m., dwoody5654@xxxxxxxxx wrote: > > > the url I am trying to download does not have an extension ie. no > > > '.htm' such as: > > > https://my.acbl.org/club-results/details/338288 > > The extension doesn't matter to any of the utilities mentioned as far > as I know. I'm pretty sure they get the MIME type from the HTTP > Content-Type header. > > > > wget does not download the correct web page. > > > > I tried it and it worked, sort of. The problem is that you want to > > download everything to view it offline, but the site my.acbl.org has a > > robots.txt that says "no robots allowed". So wget respects that and > > will not download any required files from that site other than the > > initial page. curl probably has the same issue. > > 1. The page does not have content represented in HTML AFAICT: it's a > blob which is parsed and formatted by a battery of (java)scripts, > some of which are resources on the Internet, and some are inline. > In other words, the HTML in that file is used as a container format > to transport the scripts to the browser. > Neither wget nor curl support Javascript at all as far as I know. > > 2. 96% of the page is in two blobs; AFAICT there were no IMG or other > elements that specify requirements by URL. If so, that would > explain why only the top page was downloaded. > > 3. curl does not document how it handles robots.txt. Since as far as > I can tell curl has no recursive or get-requirements option, it > probably doesn't handle it at all. > wget documents that wget -r (recursive downloads) respects > robots.txt. It does not document that wget -p (get page > requisites, too) respects robots.txt, but a quick test suggests > that it does. I think this is a bug: any interactive program that > supports non-text media will download required resources with the > access to the HTML file. (If someone agrees and wants to do > something about it, this is a wget bug, not a Fedora bug.) > > I don't have an alternative fetch tool to suggest, unfortunately. I > think that you need to use a graphical browser somehow, or write a > script in your favorite P-language. > > Steve Thanks for the info. I have been using a script called save-page-as.sh that runs firefox. I have changed the save-page-as in firefox to use the 'Web Page, complete' The savd-page-as.sh script sends a ctls-s to firefox and saves the page. It works perfectly when run from the command line. I have tried to use the save-page-as.sh script by sending an email to my computer. It does not run firefox for some reason. In searching it says that firefox can be run from a cron script by exporting DISPLAY. Running from a cron script, I would think, is similar to running a script from an email (using procmailrc) no luck , however. env shows :0.0. I have tried several variations: export DISPLAY=:0 export DISPLAY=:0.0 export DISPLAY=:0.1 with no luck. Perhaps there is another setting that need to be included as well. Any thoughts? David > _______________________________________________ > users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx > To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List > Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List > Archives: > https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure