On 2020-08-23 23:09, ToddAndMargo via users wrote:
On 2020-08-23 06:33, Jeremy Nicoll - ml fedora wrote:
On 2020-08-23 02:42, ToddAndMargo via users wrote:
Hi All,
This is a puzzle I have been working on for about two years.
Yes, and you've asked lots of questions on the curl mailing list and, it seems, not read (or not understood) what you've been told.
There are a lot of great guys on that list. They never were able to figure that one out for me.
Rubbish. People on that list are experts in using curl, and properly understand what it does and doesn't do.
The problem is that you don't seem to understand the huge difference between what curl (or wget) do, and what a browser does.
I posted back to that list yesterday with what Fulko and Gordon taught me. Fulko and Gordon are extremely smart guys.
You were told about (and indeed replied about) Firefox's developer tools as far back as Aug 2018. The problem is, you don't seem to have gone and read the Mozilla documentation on how to use them, far less explored their capabilities.
A while ago I wrote a description, for someone elsewhere, about what a browser typically does to fetch a web page. This is it:
------------------------------------------------------------------ When a browser fetches "a page" what happens (glossing over all the stuff that can make this even more complicated) is:
- it asks the server for basic page html
- the server returns page meta data (length, when last changed, etc and possibly things like redirects, if the website now lives somewhere else and should automatically go there instead)
- with a browser the user never sees this stuff, but it's visible in the browser console in developer tools, and with curl if you coe your request in the right way curl will put the returned headers etc in a file for you, separately from html etc
- if the metadata etc meant that html should actually be returned the server would send it. It might also send some "cookies" back to the user's browser.
- with curl you can have any returned cookies put in a file too
- the browser would then do a preliminary parse of the source html, finding all the embedded references to things like css files, image files, javascript files etc, and make separate requests for all of them.
- curl does not do any of that for you. You need to read the html returned by a previous stage, and decide if you want to fetch anything else and explicitly ask for it
For any of those requests that were to the original server, the browser would send back that server-specific cookie data, so the server can see the new requests are from the same user as the first one.
- curl would only send cookie data back if you explicitly tell it to do so, and you have to tell it which data to send back
- for every file apart from the first one that's fetched from anywhere the metadata and cookie logic is done for them too. If they're not image files, (ie they are css or scripts) they also will be parsed to dig out references to embedded files (for example scripts often use other people's scripts which in turn use someone else's and so on, and they all need to be fetched.
- eventually the browser will think it has all the parts that make up what it needs to display the page you wanted.
- at some point the browser does a detailed parse of the whole file assembled from the bits. In a modern webpage there is very likely to be Javascript code that needs to execute before anything is shown to the user. Sometimes some of that will generate more external file references (eg building names of required files from pieces of information that was not present in any one part of any of the files fetched so far.
- curl will of course not execute the Javascript, but you could in theory try to work out what it does. Eg when looking using Developer Tools in Firefox you can run the JS under a debugger and follow what it does, so could eg see that the URL for another file that has to be fetched is built up in a particular way from snippets of the JS. Then you could replicate that in future by extracting the contents of the snippets and joining them together in your own code. For example the JS might fetch something from a URL made up of some base value, a date, and a literal, all of which would need to be in the code somewhere.
In particular the use of cookies for successive fetches, allowing the server to see that the fetches were all from the same user, may eg mean that "deal of the week" info will somehow have been tailored to you. The server will also know not just what country you are in but also what regions (if you're not using a VPN to fool it), as the ip address of your computer will correspond to one of the ranges of addresses used by your ISP.
Anyway the initial JS code might mean the browser has to fetch more files. So it will, repeating most of the above logic for them too.
Finally it works out what to display and shows it to you.
- after that, modern webpages are very JS intensive. There's often JS logic that executes as you move the mouse around. It's one of the ways that pages react to the mouse moving over certain bits of the page. Some of it is in html itself, but other parts are coded in essence to say eg "if the mouse drifts over this bit" or "if the mouse drifts away from here" then run such-and-such a bit of JS. Any of those little bits of JS can cause more data to be fetched from a server - that could be ads, or it could be something to do with the real site.
Finally things like "next screen" buttons might execute JS before actually requesting something. The JS might encapsulate data about you and your activity using the page, as well as just ask for more data. Certainly cookie info set by the initial page fetch will be returned to the server... .
To replicate all of the above is difficult. To do it accurately you would need to write in your own scripts (that issue curl commands) a lot of extra logic.
An alternative to trying to write a whole browser (in essence) is to use "screen scraping" software. It is specifically designed to use the guts of a browser to fetch stuff and present it - in a machine- readable way - to logic that can, say, extract an image of part of a web page and then run OCR on it to work out what it says.
Another alternative is to use something like AutoIt or AutoHotKey to write a script that operates a browser by pretending to be a human using a computer - so eg it will send mouse clicks to the browser in the same way that (when a user is using a computer) the OS sends info about mouse clicks to a browser.
---------------------------------------------------------------
Problem is that the revision is generated by a java script.
No it isn't (as someone else here has explained).
Also there is no Java involved. The Java programming language has nothing at all to do with the different programming language named Javascript.
I am not sure where you are coming from. I state (Brendan Eich 's) "java script" in the Subject line and all over the place. I no where stated or implied that it was Java the programming Language. I wish they had called them two different names.
Universally in computing if someone says
"python script, or lua script, or rexx script"
they mean "a script written in python, or a script written in lua or a script written in rexx"... so when you keep saying "java script" it looks like you think you're talking about a script written in Java.
Some webpages etc DO use Java, just not very many.
If you're ever googling for info about what a javascript script does then googling for "java script" is likely to show you info about how things are done in Java, not in Javascript.
Everything in programming requires one to be precise. Using the wrong terminology will not help you.
And JSON is an extension of Java Script.
No it isn't. It's a data exchange format invented for use in Javascript though nowadays you'll find it used elsewhere too. It has nothing to do with Java.
The "JS" stands for "Java Script"
Sort-of. It's two letters of the single word "Javascript", which is sometimes written as "JavaScript".
And curl and wget have no way of running java scripts.
No, but as people on the curl list have explained before, you can fetch pages and parse out the relevant details and work out what to fetch next.
Not until Fulko and Gordon did I know what to look for.
I think you need to understand far better what a browser does, and play with the browser developer tools (on simpler sites than the eset one) to see what they can do for you.
The people on the curl list expected you to go and do that.
The developer tools will show you, when you fetch "a" page, that the browser in fact fetches a whole load of different files, and if you poke around you will see how sets of those are fetched after earlier files have been fetched (and parsed) and seen to contain references to later ones. You can also see eg how long each of those fetches takes.
The tools also allow you to see the contents of scripts that are fetched individually as well as isolated sections of JS that are embedded in a page.
You can intercept what the bits of JS do and watch them execute in a debugger, and alter them. (To find out how, you need to read the Mozilla (for Firefox) or whoever else's) docs AND you need to experiment with simple pages that you already fully understand - ideally your own - to see how the tools let you explore and alter what a page does.
One of the problems with many modern websites is their programmers grab bits of JS from "toolkits" and "libraries" written by other people, eg to achieve some amazing visual effect on a page. They might embed 100 KB of someone-else's JS and CSS, just to make one tiny part of their site do something "clever". Often almost all of the embedded/associated JS on a page isn't actually used on that page, but the site designers neither know nor care.
Another issue is that (say) a JS library might exist in more than one form. Often sites embed "minimised" versions of commonly-used scripts - these have spaces and comments removed and variable names etc reduced to one or two characters. The script is then maybe a tenth of the size of an easily-readable-by-a-human version (so will download faster and waste less of the server's resources). Your browser will understand a minimised script just as easily as a verbose human-readable one ... but you won't. Some commonly used scripts (eg those for "jquery") exist in matching minimised and readable forms, so you could download a readable equivalent (and I think developer tools will sometimes do that for you).
But... to understand what a script that uses (eg) jquery is doing, you'll either need to look at it in fine detail (& understand what it is capable of doing - so eg know all about the "document object model" and how css and JS are commonly used on webpages) or at the very least have skim-read lots of jquery documentation.
It won't necessarily be easy to work out which bits of JS on a page are just for visual effects, and which bits are for necessary function.
The guys on the curl group were not as explicit as Fulko and Gordon.
You're expected when using a programmers' utility to read its documentation and anything else that people mention. You're especially expected to understand how the whole browser process works.
I did what the mensches on the curl list told me to do and dug around a lot, but could not make heads or tails out of the page.
Then you should have asked for more help. But it's important when doing to that to show that you have made an effort to understand; to tell people what you read (so eg if you're going off on a wild goose chase people can point you back at the relevant things) and what you've tried.
You might eg explain how after fetching html page a you then discovered the names of scripts b and c on it, and fetched those, but didn't know what to do next.
No-one, on a voluntary support forum, is going to do the whole job for you. They might, if they have the time and the inclination, look at what you've managed to do so far (and ideally the code you used to do it) and suggest how to add to it to do the next stage.
The other aspect of that is that if you demonstrate some level of skill as a programmer, people will know at what level to pitch their replies. If the person asking how to do something cannot write working programs they have no chance of automating any process using scripts, parsing html or JS that they fetch etc. On the other hand if someone shows that they already understand all that, they're more likely to get appropriate help.
I've written sequences of curl requests interspersed with logic (in Regina REXX and ooREXX) that grabs successive pages of a website, finding on each one the information required to grab the next page in ... but it took days & days to make the whole process work reliably. One of my sets of these processes grabbed crosswords from the puzzle pages of a certain newspaper. The structure of the pages was different on Mon-Fri, Sat, and Sun. And at certain times of year, eg Easter, different again. Over time the editorial and presentational style of the site changed too, so code that worked perfectly for weeks could easily suddenly go wrong because the underlying html (or the css around it) would suddenly change. So code I wrote to extract data from returned pages and scripts needed to be 'aware' that layout of content in the html etc might not be the same as any of the previously-seen layouts, and stop and tell me if something didn't seem right.
Writing reliable logic to do this is not straightforward.
My idea of what is straightforward might not match yours. (I've a computing degree and worked for years as, first, a programmer (on microcomputers and a small mainframe) then a systems programmer (installing & customing the OS for a bank's mainframe), then a programmer again leading a programming team writing systemy automation programs for that bank (ie we weren't part of the teams that wrote code that moved money around).
Even so, the websites I've explored using curl and parsing what comes back and then issuing more curl requests tended to be less complex some years ago than the norm nowadays. It's not necessarily impossible to do it, but it gets harder and harder to understand the code on many sites, so working out what the "glue" logic required to do this is, is more and more difficult.
Sometimes in the past, eg when smartphones were a whole lot less capable, websites existed in simpler forms for phone users. Sometimes also much simpler ones for eg blind users using screen-readers. When that was the case, making curl etc grab the blind-users' website pages considerably simplified the whole process. Nowadays, it's more common for there only to be one version of a website, with much more complex code on it which might adapt to the needs of eg blind users. It's therefore sometimes worthwhile looking at the website help pages, if it has them, particularly any info about "accessibility" to see if there's any choice in what you get. (Though replicating that may need you to simulate a login and/or use cookies saved from a previous visit.) And if all it does is eg making a page /look/ simpler, but the html & scripts sent to you is unchanged, there'll be no advantage unless the route through the page JS etc is simpler - ie if you're using a debugger to work out what it does, that process may be simpler. But probably it won't.)