4 Steps to set up a free online Scraper with RServer and AWS

Feb 27, 2018 · 4 minute read R AWS Web Scraping SQL

R AWS Scraper

When you are regularly scraping websites, you might want to outsource that process. There are clear benefits of using cloud resources for scraping:

  • schedule your scrapers
  • scale the number of scrapers
  • no need for an internet connection with your local machine
  • free up computational resources on your local machine

Although I run only a couple of scrapers, the major advantage for me is that I can schedule my scrapers and do not need to worry whether my laptop is connected to a wifi when I’m not at home.

This article describes the architecture and steps to set up a free and remote scraper using RServer and AWS. I will not go into too many details but rather explain the concept behind it.

Works also with Python and on Digital Ocean

For those of you who prefer Python with BeautifulSoup or DigitalOcean, you can build a similar setup. The architecture would be the same, and the necessary steps very are similar.

Step 1: Create an AWS account

First, head over to Amazon Webservices and sign up for a free account. If you sign up for the first time, you will get a 12 months trial period with free access to cloud resources. This is also referred to as free tier usage.

These free resources do not include much processing power, but they are more than sufficient for our purposes.

Step 2: Install RServer on an EC2 instance

Next, you create an EC2 instance and install RServer on it. There a great Youtube-Tutorials on how to do this. And it actually takes less than 10 minutes. Check out the two by Manuel using Ubuntu or CentOS. Both tutorials are great and only differ in the operating system of the EC2 instance.

With CentOS, you do not need to use any terminal commands for the installation. However, you might want to choose Ubuntu as there are more help-resources and tutorials available if you want to expand and configure your instance later.

Step 3: Install rvest and your favorite R packages

Now, you can log into your RServer with the IP address of your EC2 instance. Check out any tutorial from the previous step if you don’t know how.

In RServer, you have to install all packages you need for your script. To scrape a website, you will most likely use rvest. Additionally, I installed all packages from tidyverse to clean and pre-process the scraping results.

Note: Before you can install rvest, you might need to install openssl and libxml2 first. To do this, log into your instance with the terminal and install both packages:

# On Ubuntu:
ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.com
sudo apt-get install openssl-dev
sudo apt-get install libxml2-dev

# On CentOS:
ssh -i “.ssh/rstudio.pem” ec2-user@<server>.compute.amazonaws.com
sudo yum install openssl-devel
sudo yum install libxml2-devel

Afterwards, you should be able to successfully install `rvest and scrape webpages.

Step 4: Install cronR on Rserver

So far, you are able to scrape websites from your AWS instance. But to leverage the advantages of your cloud instance, install cronR . The cronR package allows you to schedule your scripts and scrapers using crontabs.

That’s it! Now you can scrape websites with R autonomously using an AWS instance.

Just upload your scripts to RServer and schedule them with cronR . In addition, you can also connect more services to enhance your workflow:

Optional: Connect RServer with Github

If you are already using Github, this step might seem natural to you. If you are not using Github, start considering it. I have all my scrapers in a private repository and synchronize it with RServer and my local machine.

As such, I make sure that I’m always working with the most recent script which makes it easier to maintain my scrapers

Optional: Add a database to store your results

Finally, you can set up a database that stores the results of your scraper. With your AWS free tier usage, you can set up a MySQL, PostgreSQL, MariaDB, or Oracle database.

I use a MySQL database to store my scraper results. Every time a scraper is done, the results are added to the database. This way I make sure to push my results to a permanent storage that is unaffected if I pause or terminate the EC2 instance.

Another benefit is that I can directly access the most recent entries from further services. For example, dashboards and visualizations at Google Data Studio, Tableau, Plotly, etc. are always up to date. And I can, of course, also access the database from my local machine with programs like Sequel Pro.

R AWS Scraper

This article has also been published at HackerNoon (Medium.com).