Re: downloading a complete web page without using a browser...

D&R <dwoody5654@xxxxxxxxx> · Tue, 6 Jul 2021 10:05:42 -0500

On Tue, 6 Jul 2021 14:31:29 +0900
stephen@xxxxxxxxxx wrote:

> Samuel Sieb writes:
>  > On 2021-07-03 8:02 p.m., dwoody5654@xxxxxxxxx wrote:
>  > > the url I am trying to download does not have an extension ie. no
>  > > '.htm' such as:
>  > > https://my.acbl.org/club-results/details/338288
> 
> The extension doesn't matter to any of the utilities mentioned as far
> as I know.  I'm pretty sure they get the MIME type from the HTTP
> Content-Type header.
> 
>  > > wget does not download the correct web page.
>  > 
>  > I tried it and it worked, sort of.  The problem is that you want to 
>  > download everything to view it offline, but the site my.acbl.org has a 
>  > robots.txt that says "no robots allowed".  So wget respects that and 
>  > will not download any required files from that site other than the 
>  > initial page.  curl probably has the same issue.
> 
> 1.  The page does not have content represented in HTML AFAICT: it's a
>     blob which is parsed and formatted by a battery of (java)scripts,
>     some of which are resources on the Internet, and some are inline.
>     In other words, the HTML in that file is used as a container format
>     to transport the scripts to the browser.
>     Neither wget nor curl support Javascript at all as far as I know.
> 
> 2.  96% of the page is in two blobs; AFAICT there were no IMG or other
>     elements that specify requirements by URL.  If so, that would
>     explain why only the top page was downloaded.
> 
> 3.  curl does not document how it handles robots.txt.  Since as far as
>     I can tell curl has no recursive or get-requirements option, it
>     probably doesn't handle it at all.
>     wget documents that wget -r (recursive downloads) respects
>     robots.txt.  It does not document that wget -p (get page
>     requisites, too) respects robots.txt, but a quick test suggests
>     that it does.  I think this is a bug: any interactive program that
>     supports non-text media will download required resources with the
>     access to the HTML file.  (If someone agrees and wants to do
>     something about it, this is a wget bug, not a Fedora bug.)
> 
> I don't have an alternative fetch tool to suggest, unfortunately.  I
> think that you need to use a graphical browser somehow, or write a
> script in your favorite P-language.
> 
> Steve

Thanks for the info.

I have been using a script called save-page-as.sh that runs firefox. I have
changed the save-page-as in firefox to use the 'Web Page, complete' The
savd-page-as.sh script sends a ctls-s to firefox and saves the page. It works
perfectly when run from the command line. I have tried to use the
save-page-as.sh script by sending an email to my computer. It does not run
firefox for some reason. In searching it says that firefox can be run from a
cron script by exporting DISPLAY. Running from a cron script, I would
think, is similar to running a script from an email (using procmailrc) no luck
, however.

env shows :0.0. I have tried several variations:
export DISPLAY=:0
export DISPLAY=:0.0
export DISPLAY=:0.1

with no luck.

Perhaps there is another setting that need to be included as well.

Any thoughts?

David

> _______________________________________________
> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List
> Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List
> Archives:
> https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure