Fetching a page

The web works with a metaphor of “pages”. When you put a URL into a browser, you see a “page” of content.

For example, if you visit https://github.com/presnick/runestone, you will see the home page for the open source project whose contents are used to run this online textbook.

The browser is just a computer program that fetches the contents and displays them in a nice way. If you want to see what the contents are, in plain text, right click your mouse on the page and select View source, or whatever the equivalent is in your browser.

You don’t need to use a browser to fetch the contents of a page, though.

Fetching with UNIX curl

At the git bash prompt, you can invoke the unix curl command

curl https://github.com/presnick/ProgramsInformationPeople

Assuming you have a network connection, it will soon print out a whole lot of text. That’s the same text that your browser gets and that was shown when you did View source. If you want to see it a little more slowly try using the less command.

curl https://github.com/presnick/ProgramsInformationPeople | less

Fetching in python with requests.get

In python, there’s a module available, called requests. If you haven’t already, install pip and use it to install the requests module. Information on how to do that is in the pip chapter.

Then, you can use the get function in the requests module to fetch the contents of a page.

Here, the code is only printing the first 1000 characters. It turns out that somewhere later on the page there is an ellipsis character (a single character representing an ellipsis ...). If we try to print out the whole contents, we get an error. (You’ll learn a little bit more about handling unicode characters in another chapter.) For now, if you try this code, just extract the first 1000 characters when printing it out.

Note that you must save this code in a program file of your own and run it on your own machine, because it relies on the requests external Python module being installed!

import requests

page = requests.get("https://github.com/presnick/runestone")
print page.text[:1000]
Next Section - The Internet: Behind the Scenes