Re: Accessing forms through php

"Richard Lynch" <ceo@xxxxxxxxx> · Thu, 23 Jun 2005 15:16:54 -0700 (PDT)

On Thu, June 23, 2005 3:24 am, José Miguel López-Coronado said:
> I have seen how to use cURL to retreive results from a web site after
> processing a form. The problem is that I want to simulate completely the
> submiting of a form, I mean, I wan to "enter" the page in the server
> instead of retreive the results into my own page. I'm trying to use a
> php script to login into a site without having to enter user and pass.
> Anyone knows how to do this using cURL or any other option like HEADER
> or something like that?

You will need to do curl_execute() or whatever it is several times.

First, you'll user cURL to *read* the page with the FORM in it.  This may
(or may not trigger) some Cookies and/or some embedded tokens in the FORM
that you may need.

If it has no cookies and nothing fancy embedded in the FORM, then you may
be able to comment out this first chunk of code, after you figure out that
you don't need it, and you can proceed to Step 2.

Do *NOT* delete this code.  You may need it tomorrow if they change their
login routine.  Been there.  Done that.  Keep the code.

Second, you'll then need to send a POST with curl of the
username/password, and, again, get all the results.

This step will almost certainly send a Cookie, or embed some kind of token
in the URLs and FORMs that you need to catch and pass on to all subsequent
HTTP/HTTPS requests.

Third, you can request whatever it is you wanted in the first place, but
you need to pass in the Cookies and tokens you got from the second step.

Many of these pages might, or might not, return 302 headers for "Object
Moved"  If it's an MS/ASP/.net site, it will have a bunch of them, mostly
bogus, because you can bypass them, usually.

Again, you want to keep the code around in case some day they change their
login procedure and those stupid re-directs actually have meaning.

Right now, they just waste resources and sell more hardware for Microsoft,
but is that really a shock?

Reverse-engineering the login of a page can be challenging.  Sometimes
stuff you think is totally un-necessary turns out to be needed.

Here are some hard-won Tips:

1. The button clicked on for login (or other form submission) may or may
not have a NAME="..." parameter.

If it has a NAME="..." parameter, you *may* need to pass in its
VALUE="..." as one of your arguments in the POST/GET with curl to make it
work.

While it's less likely you need it, I've seen at least one site where they
RELY on the default NAME="..." being "Submit" and its value of "Submit"
and, yes, you had to pass those in to get past the gate. [shudder]

2. You need to catch/send Cookies.  I never did get that automated Cookie
Jar feature of curl to work for me.  But it helps to see the Cookie
variable names and values to figure out what the other programmer did (or
didn't do) for their login process anyway, so you might as well manage
them by hand.

3. In some cases, I needed to use a whole new curl handle to send the next
request.  Later research indicated that maybe curl was doing a POST by
default after it had done POST the last time, and I should have
over-ridden that...  It was easier to just keep the code that get a fresh
curl handle. In a high-performance situation, you might maybe care to fix
that better.  I didn't.

4. Keep debug output that dumps out the HTML you get at each step. 
Comment it out, but keep it. Document in the code itself what you found
were the relevant elements that you needed to achieve the next HTTP
interaction.  Also comment anything funky that you thought would be
relevant/needed data, but turned out to be useless, or even detrimental to
send back on the next request.  Yesterday's junk could be tomorrow's gold.

5. Dump out all headers and all HTML in your first debug code. When you
think you know what's relevant, add more debug code below that to dump out
the relevant stuff, and comment out the verbose debug code that dumps
everything out.  When you're *SURE* you have it right, and can login not
just today, but also tomorrow, and also from another computer or three,
then comment out the concise debug code.  Keep all of it.  You'll need it
again when they change their login.

6. Set up your own PHP script on your own server that spits out all the
headers sent by an HTTP request, and then be ready to copy/paste their
output in as your output.  Sometimes the silliest things are used by them
to try to stop your "robot" from logging in.  User Agent springs to mind. 
Anything your browser sends as a header is "fair game" for them to be
checking, even if they are violating HTTP standards.  Hey, this is the
real world. You *WILL* find a site that violates standards really fast if
you dig into this very much.

7. It's kind of "fun" in a hacker-sort of way to work through the login
process of another site and see how it all fits together.  It's certainly
instructive!  It can also be challenging.  If you're stuck, take a break. 
Staring at the code and their HTTP output is unlikely to be fruitful after
any length of time.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php