Searching Realtor.com with python

In this tutorial I will show you how I used python to get a list of addresses and prices from the website realtor.com in a given area.

Disclamer

Before we start I need to point out that realtor.com does not want webscraping against their site. They have gone so far as use a system built by Distil Networks to identify and block bots. Their main concern is that you scrape all the data off their site and build a competing site and get add revenue off their data or something. Here is a link to describe what the are doing. While technically I am using scraping (reading webpages with something other than a browser) I am not going to republish any information I gather. Also, I am only doing this to more efficiently use the information that they publish. I believe that while the advice below may violate their view of the letter of the law it does not violate the spirit of the law.

Source

The source for this can be found on my github gist. This was done for the rent-scout project. Over time rent-scout will be updated and some of what I post below will become out of date.

The first url

Before you can begin parsing any website you need to first find a url you can work with. The awesome thing about urls is that they are like folders that take arguments and you can string together a list of different arguments to find the address you want to access. I discovered that if I went to http://www.realtor.com/local/Colorado I got a table of links that showed the top cities in that state along with the number of listings available for rent and for purchase. I am mostly interested in the ones available for purchase because I want them as rental properties. So the first thing you can do to read the url is to use the libraries lxml and requests. You can install them with pip install lxml requests. This will read that page and print the html data in text form:

Troubleshooting Notes

I discovered really quickly that if I didn’t create a session with a user agent header that the site would put me into infinite redirects. This is probably one of their tactics to dissuade beginners form writing scripts to scrape their site. Sessions also have other benefits, one of which is speed as it reuses the TCP connection and does not attempt to set up a new one each time it sends a new request. If you don’t set the User-Agent to something browser like it will look something like this: python-requests/1.2.0. That is a dead giveaway to the server that you are not a browser, but instead python’s request library that is trying to get the server to respond with a html page that you can parse.

Parse that table

The steps for parsing the table of cities is to first create a tree from the html data then we need to find the xpath’s to the various items we want, and then we need to parse that data into a useful structure.

Build the tree

There is a tool in the html package from lxml that can build a tree from a string:

Find the XPaths

XPath is a way to navigate through elements and attributes in n XML document. The easiest way to find an xpath to an item you are looking for is to use Google Chrome. In order to find the xpath first open up the Developer Tools (View->Developer->Developer Tools). Next right-click on an element you are interested in and select inspect. That will highlight the line of html code that that link comes from. Right click on that line of code and then select Copy->Copy XPath. If you do that to the link for the city Aurora on the page http://www.realtor.com/local/Colorado you will get this result:

Parse the data

You will notice that the table is in some tag with the id of top-cities and that table has tr and td tags. tr is the row and td is the column. a is the link. The second column you will notice is a list of links to homes for sale in that area. I found this one to be particularly interesting so I used this code to investigate:

The output of this looks like this:

Now this is something I can work with. The next step was to get the name of the cities out of the title element and to do that I used this code:

Next I wanted to grab the links and the number of properties that were listed for sale, and the number of properties that were listed for rent for each location. This is the code I used to do that:

Syntax Note

You will notice that I’m using a technique to build lists that you may not have used before. It is called List Compression and here is a link that explains how to use it: List Compression

URL argument Note

The second thing you should notice about the code above is the argument that I’m adding the end of the url. With urls you can add arguments by listing them after a question mark in key,value pairs like this: ?key=value,key2=value2. I used this technique to set the variable pgsz to the value of the number of houses available for sale in that area. I discovered that this would cause the website to display all the listings in that area on one page. To figure this out I hovered my mouse over the next page button on the bottom of the result page and saw this argument. I changed it and noted what it did. This allows the next step to be much easier.

Getting a link to the listing

The next step proved to be much harder. I wanted to find the links on the next page that went to listings but one thing I found is that there were duplicate links and that grabbing the link was actually kind of difficult using xpath. Enter another tool. That tool is BeautifulSoap. It is a tool for searching through documents and finding what you are looking for. It is considerably easier than XPath. To test I chose to pick just one city for continuing on, Aurora. There are two steps to this part: getting the list of links to the listings and parsing data on those pages.

Getting a list of links to listings

Below is how I got the list of URL’s that went to addresses:

Making urls unique

You will notice a couple of things. First is the regular expression '[^?]*'. What this does is match until it finds a ? in the string. This is because the website has may copies of the same link with different arguments (probably so they can track on which links get clicked on). I wanted a list of unique arguments so I used a set. When you add an item to a set it checks to see if that item is already in the set, and if it is not it adds it. The second thing you’ll notice is that I’m searching through the soup object with a regular expression matching on the href parameter of find_all. This is complicated so I’ll break it down into parts. find_all takes a special type of parameter that allows you to pass key=value pairs into it that it reads as a dictionary. Inside this function it uses that dictionary to help it search the tags in the soup. In our case we want all tags with a href parameter (link), where that link contains the word detail. I discovered the format for the links by clicking on links in my browser and looking at the urls. All the listing links had the word detail in them.

On regular expressions

Regular expressions are very powerful for searching, yet they can also be very confusing. For a good primer on how to use them go to the Python Regular expression operations wiki page and read up. The trick to learning regular expressions is to use them. Set up a program and try parsing text. The only real way to learn to use something like regular expressions is to play around. And then once you understand the basics you can google for how to do something specific and you will often find many examples on how to do the specific thing you are looking for with regular expressions.

Parsing data out of those links

Once you have the links the next step is rather trivial. There is so much more that could be done here but I chose to just go to the links and get the address and price they are asking for now. Here is the code, there isn’t anything new here:

Next Steps

The next thing I’m going to do with this is gather more information about each of the listings and start to build a database locally that I can then use with other scripts without going out to the website each time to gather the same information. Doing this will make so I’m not being evil with the bandwidth on realtor.com’s server and making it more difficult for them to get the same information to other users. If you want to learn how to do this you should copy the code into a text file, run it, and then start playing around with it. Change it to work differently on your system and see what it does. And of course, I am not responsible for anything you do with this new power.

3 Comments

  1. Clinton
    March 21, 2017

    So I am playing around with your code and found that for Aurora, it is only pulling in the first 200 links with the word ‘detail’ in it. Is there a reason it stops at 200 based on your coding, or is something that realtor.com has implemented to limit the number of links you can access?

    Reply
    1. Clinton
      March 21, 2017

      and also if you play around with it too much you get the error “raise TooManyRedirects(‘Exceeded %s redirects.’ % self.max_redirects, response=resp)
      TooManyRedirects: Exceeded 30 redirects.” Looks like they have found a way to block your requests.

      Reply
    2. weaveringrally
      March 22, 2017

      I noticed this too. I think they have more things making it difficult to do this. Let me know if you figure out any better ways.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *