One of the first questions that comes up when you start to
move beyond the basics of web scraping is: “How do I access information
behind a login screen?” The Web is increasingly moving toward
interaction, social media, and user-generated content. Forms and logins
are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.
Up until this point, most of our interactions with web servers in our example scrapers has consisted of using HTTP GET to request information. In this chapter, we’ll focus on the POST method which pushes information to a web server for storage and analysis.
Forms basically give users a way to submit a POST request that the web server can understand and use. Just like link tags on a website help users format GET requests, HTML forms help them format POST
requests. Of course, with a little bit of coding, it is possible to
simply create these requests ourselves and submit them with a scraper.
Python Requests Library
Although it’s possible to navigate web forms using only the Python
core libraries, sometimes a little syntactic sugar makes life a lot
sweeter. When you start to do more than a basic GET request with urllib it can help to look outside the Python core libraries.
The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more.
Here’s what Requests creator Kenneth Reitz has to say about Python’s core tools:
Python’s standard urllib2
module provides most of the HTTP capabilities you need, but the API is
thoroughly broken. It was built for a different time—and a different
web. It requires an enormous amount of work (even method overrides) to
perform the simplest of tasks.
Things shouldn’t be this way. Not in Python.
As with any Python library, the Requests library can be installed with any third-party Python library manager, such as pip, or by downloading and installing the source file.
Submitting a Basic Form
Most web forms
consist of a few HTML fields, a submit button, and an “action” page,
where the actual form processing is done. The HTML fields usually
consist of text but might also contain a file upload or some other
non-text content.
Most popular websites block access to their login forms in their robots.txt file, so to play it safe I’ve
constructed a series of different types of forms and logins at pythonscraping.com that you can run your web scrapers against. The most basic of these forms is located athttp://bit.ly/1AGKPRU.
A couple of things to notice here: first, the name of the two input fields are firstname and lastname. This is important. The names of these fields determine the names of the variable parameters that will be POSTed to the server when the form is submitted. If you want to mimic the action that the form will take when POSTing your own data, you need to make sure that your variable names match up.
The second thing to note is that the action of the form is actually at processing.php (the absolute path is http://bit.ly/1d7TPVk). Any post requests to the form should be made on this page,
not on the page that the form itself resides. Remember: the purpose of
HTML forms is only to help website visitors format proper requests to
send to the page that does the real action. Unless you are doing
research to format the request itself, you don’t need to bother much
with the page that the form can be found on.
Submitting a form with the
Requests library can be done in four lines, including the import and
the instruction to print the content (yes, it’s that easy):
After the form is submitted, the script should return with the page’s content:
Hellothere,RyanMitchell!
This script can be applied to many simple forms encountered on the
Internet. The form to sign up for the O’Reilly Media newsletter, for
example, looks like this:
In this case, the website returned is simply another form to fill
out, before you can actually make it onto O’Reilly’s mailing list, but
the same concept could be applied to that form as well. However, I would
request that you use your powers for good, and not spam the publisher
with invalid signups, if you want to try this at home.
Radio Buttons, Checkboxes, and Other Inputs
Obviously, not all web forms
are a collection of text fields followed by a submit button. Standard
HTML contains a wide variety of possible form input fields: radio
buttons, checkboxes, and select boxes, to name a few. In HTML5, there’s
the addition of sliders (range input fields), email, dates, and more.
With custom JavaScript fields the possibilities are endless, with
colorpickers, calendars, and whatever else the developers come up with
next.
Regardless of the seeming complexity of any sort of form field, there
are only two things you need to worry about: the name of the element
and its value. The element’s name can be easily determined by looking at
the source code and finding the name attribute. The value can sometimes be trickier, as it might be populated by JavaScript immediately before form submission. Colorpickers, as an example of a fairly exotic form field, will likely have a value of something like #F03030.
If you’re unsure of the format of an input field’s value, there are a number of tools you can use to track the GET and POST requests your browser is sending to and from sites. The best and perhaps most obvious way to track GET requests, as mentioned before, is to simply look at the URL of a site. If the URL is something like:
http://domainname.com?thing1=foo&thing2=bar
You know that this corresponds to a form of this type:
You can see this in Figure 9-1.
If you’re stuck with a complicated-looking POST
form, and you want to see exactly which parameters your browser is
sending to the server, the easiest way is to use your browser’s
inspector or developer tool to view them.
Figure 9-1. The Form Data section, highlighted in a box, shows the POST parameters “thing1” and “thing2” with their values “foo” and “bar”
The Chrome developer tool can
be accessed via the menu by going to View → Developer → Developer
Tools. It provides a list of all queries that your browser produces
while interacting with the current website and can be a good way to view
the composition of these queries in detail.
Submitting Files and Images
Although file uploads
are common on the Internet, file uploads are not something often used
in web scraping. It is possible, however, that you might want to write a
test for your own site that involves a file upload. At any rate, it’s a
useful thing to know how to do.
There is a practice file upload form at http://pythonscraping/files/form2.html. The form on the page has the following markup:
Except for the <input> tag having the type attribute file,
it looks essentially the same as the text-based forms used in the
previous examples. Fortunately, the way the forms are used by the Python
Requests library is also very similar:
Note that in lieu of a simple string, the value submitted to the form field (with the name uploadFile) is now a Python File object, as returned by the open function. In this example, I am submitting an image file, stored on my local machine, at the path ../files/Python-logo.png, relative to where the Python script is being run from.
Yes, it’s really that easy!
Handling Logins and Cookies
So far, we’ve
mostly discussed forms that allow you submit information to a site or
let you to view needed information on the page immediately after the
form. How is this different from a login form, which lets you exist in a
permanent “logged in” state throughout your visit to the site?
Most modern websites use cookies to keep track of who is logged in and who is not. Once a site authenticates your login credentials a it
stores in your browser a cookie, which usually contains a
server-generated token, timeout, and tracking information. The site then
uses this cookie as a sort of proof of authentication, which is shown
to each page you visit during your time on the site. Before the
widespread use of cookies in the mid-90s, keeping users securely
authenticated and tracking them was a huge problem for websites.
Although cookies are a great solution for web developers, they can be
problematic for web scrapers. You can submit a login form all day long,
but if you don’t keep track of the cookie the form sends back to you
afterward, the next page you visit will act as though you’ve never
logged in at all.
I’ve created a simple login form at http://bit.ly/1KwvSSG (the username can be anything, but the password must be “password”).
This form is processed at http://bit.ly/1d7U2I1, and contains a link to the “main site” page, http://bit.ly/1JcansT.
If you attempt to access the welcome page or the profile page without
logging in first, you’ll get an error message and instructions to log
in first before continuing. On the profile page, a check is done on your
browser’s cookies to see whether its cookie was set on the login page.
Keeping track of cookies is easy with theRequests library:
importrequestsparams={'username':'Ryan','password':'password'}r=requests.post("http://pythonscraping.com/pages/cookies/welcome.php",params)print("Cookie is set to:")print(r.cookies.get_dict())print("-----------")print("Going to profile page...")r=requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)print(r.text)
Here I am sending the login parameters to the welcome page, which
acts as the processor for the login form. I retrieve the cookies from
the results of the last request, print the result for verification, and
then send them to the profile page by setting the cookies argument.
This works well for simple situations, but what if you’re dealing
with a more complicated site that frequently modifies cookies without
warning, or if you’d rather not even think about the cookies to begin
with? The Requests session function works perfectly in this case:
importrequestssession=requests.Session()params={'username':'username','password':'password'}s=session.post("http://pythonscraping.com/pages/cookies/welcome.php",params)print("Cookie is set to:")print(s.cookies.get_dict())print("-----------")print("Going to profile page...")s=session.get("http://pythonscraping.com/pages/cookies/profile.php")print(s.text)
In this case, the session object (retrieved by calling requests.Session())
keeps track of session information, such as cookies, headers, and even
information about protocols you might be running on top of HTTP, such as
HTTPAdapters.
Requests is a fantastic library, second perhaps only to Selenium (which we’ll cover in Chapter 10) in
the completeness of what it handles without programmers having to think
about it or write the code themselves. Although it might be tempting to
sit back and let the library do all the work, it’s extremely important
to always be aware of what the cookies look like and what they are
controlling when writing web scrapers. It could save many hours of
painful debugging or figuring out why a website is behaving strangely!
HTTP Basic Access Authentication
Before the advent of cookies, one popular way to handle logins was with HTTP basic access authentication.
You still see it from time to time, especially on high-security or
corporate sites, and with some APIs. I’ve created a page at http://pythonscraping.com/pages/auth/login.php that has this type of authentication (Figure 9-2).
Figure 9-2. The user must provide a username and password to get to the page protected by basic access authentication
As usual with these examples, you can log in with any username, but the password must be “password.”
The Requests package contains an auth module specifically designed to handle HTTP authentication:
Although this appears to be a normal POST request, an HTTPBasicAuth object is passed as the auth argument in the request. The resulting text will be the page protected by the username and password (or an Access Denied page, if the request failed).
It's been a while that I got a chance to write a post here.
Currently, I'm really busy with school/work (just got a "job" at Initas Technologies - love it!) which takes a lot of my time.
When searching for those
inspiring programming quotes, there were loads of others that are really
funny (and true) where I (and probably many more) can relate to.
Here are two of my favourite programming quotes:
“ Java is to JavaScript what Car is to Carpet.”
- Chris Heilmann
“ It's hard enough to find an error in your code when you're looking
for it; it's even harder when you've assumed your code is error-free. ”
- Steve McConnell
Check out these other 25 to see if you can relate!
27 inspiring top notch programming quotes
These quotations are in no order.
“ If debugging is the process of removing software bugs, then programming must be the process of putting them in. ”
- Edsger Dijkstra
“ Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet. ”
- Michael A. Jackson
“ The best method for accelerating a computer is the one that boosts it by 9.8 m/s2. ”
- Anonymous
“ Walking on water and developing software from a specification are easy if both are frozen. ”
- Edward V Berard
“ Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it. ”
- Brian Kernighan
“ It's not at all important to get it right the first time. It's vitally important to get it right the last time. ”
- Andrew Hunt and David Thomas
“ First, solve the problem. Then, write the code. ”
- John Johnson
“ Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration. ”
- Stan Kelly-Bootle
“ Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. ”
- Rick Osborne
“ Any fool can write code that a computer can understand. Good programmers write code that humans can understand. ”
- Martin Fowler
“ Software sucks because users demand it to. ”
- Nathan Myhrvold
“ Linux is only free if your time has no value. ”
- Jamie Zawinski
“ Beware of bugs in the above code; I have only proved it correct, not tried it. ”
- Donald Knuth
“ There is not now, nor has there ever been, nor will there ever be,
any programming language in which it is the least bit difficult to
write bad code. ”
- Flon's Law
“ The first 90% of the code accounts for the first 90% of the
development time. The remaining 10% of the code accounts for the other
90% of the development time. ”
- Tom Cargill
“ Good code is its own best documentation. As you're about to add a
comment, ask yourself, "How can I improve the code so that this comment
isn't needed?" Improve the code and then document it to make it even
clearer. ”
- Steve McConnell
“ Programs must be written for people to read, and only incidentally for machines to execute. ”
- Abelson / Sussman
“ Most software today is very much like an Egyptian pyramid with
millions of bricks piled on top of each other, with no structural
integrity, but just done by brute force and thousands of slaves. ”
- Alan Kay
“ Programming can be fun, so can cryptography; however they should not be combined. ”
- Kreitzberg and Shneiderman
“ Copy and paste is a design error. ”
- David Parnas
“ Before software can be reusable it first has to be usable. ”
- Ralph Johnson
“ Without requirements or design, programming is the art of adding bugs to an empty text file. ”
- Louis Srygley
“ When someone says, "I want a programming language in which I need only say what I want done," give him a lollipop. ”
- Alan Perlis
“ Computers are good at following instructions, but not at reading your mind. ”
- Donald Knuth
“ Any code of your own that you haven't looked at for six or more months might as well have been written by someone else. ”
- Eagleson's law