Re: Anyone know how to get around a java script on this web page?

Jeremy Nicoll - ml fedora <jn.ml.fdr.287@xxxxxxxxxxxxxxxxxxxx> · Mon, 24 Aug 2020 11:38:55 +0100

On 2020-08-23 23:09, ToddAndMargo via users wrote:
On 2020-08-23 06:33, Jeremy Nicoll - ml fedora wrote:
On 2020-08-23 02:42, ToddAndMargo via users wrote:
Hi All,

This is a puzzle I have been working on for about two years.

Yes, and you've asked lots of questions on the curl mailing list and,
it seems, not read (or not understood) what you've been told.

There are a lot of great guys on that list.  They never were able to
figure that one out for me.

Rubbish.   People on that list are experts in using curl, and properly
understand what it does and doesn't do.

The problem is that you don't seem to understand the huge difference
between what curl (or wget) do, and what a browser does.

I posted back to that
list yesterday with what Fulko and Gordon taught me.
Fulko and Gordon are extremely smart guys.

You were told about (and indeed replied about) Firefox's developer
tools as far back as Aug 2018.  The problem is, you don't seem to
have gone and read the Mozilla documentation on how to use them,
far less explored their capabilities.

A while ago I wrote a description, for someone elsewhere, about
what a browser typically does to fetch a web page.  This is it:

------------------------------------------------------------------
When a browser fetches "a page" what happens (glossing over all
the stuff that can make this even more complicated) is:

- it asks the server for basic page html

- the server returns page meta data (length, when last changed,
  etc and possibly things like redirects, if the website now lives
  somewhere else and should automatically go there instead)

     - with a browser the user never sees this stuff, but it's
       visible in the browser console in developer tools, and
       with curl if you coe your request in the right way curl
       will put the returned headers etc in a file for you,
       separately from html etc

- if the metadata etc meant that html should actually be returned
  the server would send it.  It might also send some "cookies" back
  to the user's browser.

      - with curl you can have any returned cookies put in a file
        too

- the browser would then do a preliminary parse of the source
  html, finding all the embedded references to things like css files,
  image files, javascript files etc, and make separate requests for
  all of them.

     - curl does not do any of that for you.  You need to read the
       html returned by a previous stage, and decide if you want
       to fetch anything else and explicitly ask for it

  For any of those requests that were to the original server, the
  browser would send back that server-specific cookie data, so the
  server can see the new requests are from the same user as the
  first one.

      - curl would only send cookie data back if you explicitly
        tell it to do so, and you have to tell it which data to
        send back

- for every file apart from the first one that's fetched from anywhere
  the metadata and cookie logic is done for them too.  If they're not
  image files, (ie they are css or scripts) they also will be parsed to
  dig out references to embedded files (for example scripts often use
  other people's scripts which in turn use someone else's and so on,
  and they all need to be fetched.

- eventually the browser will think it has all the parts that make up
  what it needs to display the page you wanted.

- at some point the browser does a detailed parse of the whole
  file assembled from the bits.  In a modern webpage there is very
  likely to be Javascript code that needs to execute before anything
  is shown to the user.  Sometimes some of that will generate more
  external file references (eg building names of required files from
  pieces of information that was not present in any one part of any
  of the files fetched so far.

     - curl will of course not execute the Javascript, but you
       could in theory try to work out what it does.  Eg when
       looking using Developer Tools in Firefox you can run the
       JS under a debugger and follow what it does, so could eg
       see that the URL for another file that has to be fetched
       is built up in a particular way from snippets of the JS.
       Then you could replicate that in future by extracting the
       contents of the snippets and joining them together in
       your own code.  For example the JS might fetch something
       from a URL made up of some base value, a date, and a
       literal, all of which would need to be in the code
       somewhere.

  In particular the use of cookies for successive fetches, allowing the
  server to see that the fetches were all from the same user, may
  eg mean that "deal of the week" info will somehow have been
  tailored to you.  The server will also know not just what country
  you are in but also what regions (if you're not using a VPN to
  fool it), as the ip address of your computer will correspond to one
  of the ranges of addresses used by your ISP.

  Anyway the initial JS code might mean the browser has to fetch
  more files.  So it will, repeating most of the above logic for them
  too.

   Finally it works out what to display and shows it to you.

- after that, modern webpages are very JS intensive.  There's often
  JS logic that executes as you move the mouse around.  It's one of
  the ways that pages react to the mouse moving over certain bits
  of the page.  Some of it is in html itself, but other parts are coded
  in essence to say eg "if the mouse drifts over this bit" or "if the
  mouse drifts away from here" then run such-and-such a bit of
  JS.   Any of those little bits of JS can cause more data to be
  fetched from a server - that could be ads, or it could be something
  to do with the real site.

  Finally things like "next screen" buttons might execute JS before
  actually requesting something.  The JS might encapsulate data
  about you and your activity using the page, as well as just ask
  for more data.  Certainly cookie info set by the initial page fetch
  will be returned to the server... .

To replicate all of the above is difficult.  To do it accurately you
would need to write in your own scripts (that issue curl commands) a
lot of extra logic.

An alternative to trying to write a whole browser (in essence) is to
use "screen scraping" software.  It is specifically designed to use
the guts of a browser to fetch stuff and present it - in a machine-
readable way - to logic that can, say, extract an image of part of a
web page and then run OCR on it to work out what it says.

Another alternative is to use something like AutoIt or AutoHotKey to
write a script that operates a browser by pretending to be a human
using a computer - so eg it will send mouse clicks to the browser in
the same way that (when a user is using a computer) the OS sends info
about mouse clicks to a browser.

---------------------------------------------------------------

Problem is that the revision is generated by a java script.

No it isn't (as someone else here has explained).

Also there is no Java involved.   The Java programming language
has nothing at all to do with the different programming language
named Javascript.

I am not sure where you are coming from.  I state (Brendan
Eich 's) "java script" in the Subject line and all over
the place.  I no where stated or implied that it was
Java the programming Language.  I wish they had called
them two different names.

Universally in computing if someone says

  "python script, or lua script, or rexx script"

they mean "a script written in python, or a script written in lua
or a script written in rexx"... so when you keep saying "java script"
it looks like you think you're talking about a script written in Java.

Some webpages etc DO use Java, just not very many.

If you're ever googling for info about what a javascript script does
then googling for "java script" is likely to show you info about how
things are done in Java, not in Javascript.

Everything in programming requires one to be precise.  Using the
wrong terminology will not help you.

And JSON is an extension of Java Script.

No it isn't.  It's a data exchange format invented for use in Javascript
though nowadays you'll find it used elsewhere too.  It has nothing to do
with Java.

The "JS" stands for "Java Script"

Sort-of.  It's two letters of the single word "Javascript", which is
sometimes written as "JavaScript".

And curl and wget have no way of running java scripts.

No, but as people on the curl list have explained before, you can
fetch pages and parse out the relevant details and work out what
to fetch next.

Not until Fulko and Gordon did I know what to look for.

I think you need to understand far better what a browser does, and
play with the browser developer tools (on simpler sites than the
eset one) to see what they can do for you.

The people on the curl list expected you to go and do that.

The developer tools will show you, when you fetch "a" page, that
the browser in fact fetches a whole load of different files, and
if you poke around you will see how sets of those are fetched
after earlier files have been fetched (and parsed) and seen to
contain references to later ones.  You can also see eg how long
each of those fetches takes.

The tools also allow you to see the contents of scripts that are
fetched individually as well as isolated sections of JS that are
embedded in a page.

You can intercept what the bits of JS do and watch them execute
in a debugger, and alter them.   (To find out how, you need to
read the Mozilla (for Firefox) or whoever else's) docs AND you
need to experiment with simple pages that you already fully
understand - ideally your own - to see how the tools let you
explore and alter what a page does.

One of the problems with many modern websites is their programmers
grab bits of JS from "toolkits" and "libraries" written by other
people, eg to achieve some amazing visual effect on a page.  They
might embed 100 KB of someone-else's JS and CSS, just to make one
tiny part of their site do something "clever".  Often almost all
of the embedded/associated JS on a page isn't actually used on
that page, but the site designers neither know nor care.

Another issue is that (say) a JS library might exist in more than
one form.  Often sites embed "minimised" versions of commonly-used
scripts - these have spaces and comments removed and variable names
etc reduced to one or two characters.  The script is then maybe
a tenth of the size of an easily-readable-by-a-human version (so
will download faster and waste less of the server's resources).
Your browser will understand a minimised script just as easily as
a verbose human-readable one ... but you won't.  Some commonly used
scripts (eg those for "jquery") exist in matching minimised and
readable forms, so you could download a readable equivalent (and I
think developer tools will sometimes do that for you).

But... to understand what a script that uses (eg) jquery is doing,
you'll either need to look at it in fine detail (& understand what
it is capable of doing - so eg know all about the "document object
model" and how css and JS are commonly used on webpages) or at the
very least have skim-read lots of jquery documentation.

It won't necessarily be easy to work out which bits of JS on a
page are just for visual effects, and which bits are for necessary
function.

The guys on the curl group were not as explicit as Fulko
and Gordon.

You're expected when using a programmers' utility to read its
documentation and anything else that people mention.  You're
especially expected to understand how the whole browser
process works.

I did what the mensches on the curl list told
me to do and dug around a lot, but could not make
heads or tails out of the page.

Then you should have asked for more help.  But it's
important when doing to that to show that you have made an
effort to understand; to tell people what you read (so eg
if you're going off on a wild goose chase people can point
you back at the relevant things) and what you've tried.

You might eg explain how after fetching html page a you
then discovered the names of scripts b and c on it, and
fetched those, but didn't know what to do next.

No-one, on a voluntary support forum, is going to do the
whole job for you.  They might, if they have the time and
the inclination, look at what you've managed to do so far
(and ideally the code you used to do it) and suggest how
to add to it to do the next stage.

The other aspect of that is that if you demonstrate some
level of skill as a programmer, people will know at what
level to pitch their replies.  If the person asking how
to do something cannot write working programs they have
no chance of automating any process using scripts,
parsing html or JS that they fetch etc.  On the other hand
if someone shows that they already understand all that,
they're more likely to get appropriate help.

I've written sequences of curl requests interspersed with
logic (in Regina REXX and ooREXX) that grabs successive
pages of a website, finding on each one the information
required to grab the next page in ... but it took days &
days to make the whole process work reliably.  One of my
sets of these processes grabbed crosswords from the puzzle
pages of a certain newspaper.  The structure of the pages
was different on Mon-Fri, Sat, and Sun.  And at certain
times of year, eg Easter, different again.  Over time the
editorial and presentational style of the site changed
too, so code that worked perfectly for weeks could easily
suddenly go wrong because the underlying html (or the css
around it) would suddenly change.  So code I wrote to
extract data from returned pages and scripts needed to
be 'aware' that layout of content in the html etc might
not be the same as any of the previously-seen layouts,
and stop and tell me if something didn't seem right.

Writing reliable logic to do this is not straightforward.

My idea of what is straightforward might not match yours.
(I've a computing degree and worked for years as, first, a
programmer (on microcomputers and a small mainframe) then
a systems programmer (installing & customing the OS for a
bank's mainframe), then a programmer again leading a
programming team writing systemy automation programs for
that bank (ie we weren't part of the teams that wrote code
that moved money around).

Even so, the websites I've explored using curl and parsing
what comes back and then issuing more curl requests tended
to be less complex some years ago than the norm nowadays.
It's not necessarily impossible to do it, but it gets harder
and harder to understand the code on many sites, so working
out what the "glue" logic required to do this is, is more
and more difficult.

Sometimes in the past, eg when smartphones were a whole lot
less capable, websites existed in simpler forms for phone
users.  Sometimes also much simpler ones for eg blind
users using screen-readers.  When that was the case, making
curl etc grab the blind-users' website pages considerably
simplified the whole process.  Nowadays, it's more common
for there only to be one version of a website, with much
more complex code on it which might adapt to the needs of
eg blind users.  It's therefore sometimes worthwhile looking
at the website help pages, if it has them, particularly any
info about "accessibility" to see if there's any choice in
what you get.  (Though replicating that may need you to
simulate a login and/or use cookies saved from a previous
visit.)  And if all it does is eg making a page /look/
simpler, but the html & scripts sent to you is unchanged,
there'll be no advantage unless the route through the
page JS etc is simpler - ie if you're using a debugger to
work out what it does, that process may be simpler.  But
probably it won't.)

--
Jeremy Nicoll - my opinions are my own
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx