How to automate Google Search Console data downloads with R

google search console r automation
April 20, 2020 in Google Search Console

If you use the Google Search Console, you probably know the performance report.

google search console performance report

It’s a useful report for SEO analysis because it contains many data about search queries and website pages but it has 3 major limits:

  • the data shown are sampled
  • you have to use the API if you want to have queries and pages in the same dataset
  • the report contains only the data for the last 16 months

In this post, I want to show you how I solved and automated the last two points using R and PostgreSQL database.

I know that there are multiple ways to do that, but I choose this one because I think right now it’s perfect for my needs and it’s easy enough for an R Programming beginner like myself.

The idea behind this project

Before R and PostgreSQL, I was automatically downloading the Google Search Console data using a Google Sheets add-on, but I wasn’t really happy because:

  • I wasn’t able to get all my websites’ data into one single table
  • Google Sheets (or Excel) is not that scalable, it’s really slow when you have to archive hundreds of thousands of rows

That’s why I decided to try something different.

My idea is to use searchConsoleR to download the data, cronR to schedule my R script and RPostgreSQL to upload the output of the script into a PostgreSQL, an open-source relational database. I’m also going to show you how you can connect the database with Google Data Studio.

If you are not familiar with Google Search Console and R I suggest you to read my searchConsoleR guide.

Install & Setup searchConsoleR

First of all, you have to install & load searchConsoleR in your R IDE.

Then you have to connect Google Search Console with R using the scr_auth(). You’ll be redirected to your browser and you’ll have to log in to your Google account.

You can get the list of all your Google Search Console properties using list_websites().

Then you have to subset the data frame and “delete” the permissionLevel column. Below you can find two different options, I used the second one because I wanted to get the data only for the websites that I own.

For reporting reasons, I decided to have only one database table containing all my websites data and that’s why I use a for loop. But you can also have one table per website.

First, you have to create an empty list where the script will store all the data. Then you have the for loop which iterates over the websites list.

Inside the loop, search_analytics() connect R to the Google Search Console API and download the data.

In the start and end arguments, you should use Sys.Date() – XX to automatically update the dates of the export (Sys.Date() return the date of today).

The last line of code inside the loop merges the different lists (the downloads) into one data frame.

RPostgreSQL: connect R with your PostgreSQL database

In this step, you have to create the connection between R and your database using the RPostgreSQL package.

In dbConnect() you have to add the username, password, host, port & name of your PostgreSQL database. If you don’t know where to find the information I suggest you reach out to your hosting provider.

Lastly, you have to create a table in your database (or get the name of an existing table) and upload the first data you downloaded before.

For example, I decided to upload the data for the last 16 months because from the next month the script will do that for me.

Schedule the R script using cronR

There are multiple ways to schedule an R script, I decided to manage the automation locally (on my computer) and not on a server because it’s easier to set up (at least for myself).

Load cronR package and then go to Tools > Addins > Browse Addins…

cronr setup step1

Here you have to select cronR and click on Execute.

cronr setup step2

Now you have to create the job – select your R Script location and set launch date, launch hour and schedule. Click on Create job.

As I said previously, the script I run with the cronR job downloads only the data for the last month. Before I set up the job, I manually ran the script once to get the data of the last 16 months.

Which script should cronR execute?

As I said previously, the script I run with the cronR job downloads only the data for the last month. Before I set up the job, I manually ran the script once to get the data of the last 16 months. Below, you can find the R script I run with the cronR job.

Connect Google Data Studio and PostgreSQL

You can also connect your database to Google Data Studio. Select the PostgreSQL connector and enter your database credentials and location information.

google search console postgresql

Run an SQL query with RPostgreSQL

If you want, you can also query your database directly in R using dbGetQuery().

In my example, I stored the result in a data frame called query_result.

Write a Comment

Comment

  1. Hi Ruben! Thanks for this post. I like your approach and I think this is really helpful, hope that many people will find out about this post and try R for SEO to enhance their analysis 🙂

    I have dealt with downloading data from Search Console since 2016, so I hope you don’t mind if I suggest some small possible tweaks to your solution that may enhance it 🙂 please don’t assume your code is wrong, it’s good and it’s working, you did a great job and I hope a lot of people will use it 🙂 I just want to suggest some small tweaks.

    First, I would recommend not using large date ranges as it affects the data if you query GSC API for a large project (a lot of traffic). For small or medium projects the differences are negligible, but in case of large websites with thousands or tens of thousands of clicks daily the data will become incomplete. My example: querying 60 days at once returns about 15k rows per a date, but querying the exact same date range by single days returns over 25k rows per date. I suppose the reason for this behaviour is the sampling and the row limit. What I would recommend is to loop over date range and query for each day independently (I provide a code example below).

    Second, your loop creates a list and each project’s data is a data frame inside. Later you call do.call(rbind, exports_list). It is working but you can enhance it because: 1. you’re growing a list in a way that is very slow and not very efficient at using resources (it can eat your memory up); 2. you don’t really need to keep it as a list – you seem to only want a data frame and the list is just an intermediary; 3. you can speed it up significantly by not using the for loop and do.call. Instead you use purrr::map_dfr or lapply (example below).

    The best way I’ve found out so far to deal with both those things at once is to loop on the date range and always query a single day with map_dfr. The downside of this approach is increase in API calls but I try to control it with Sys.sleep(). Something like this:

    # declare a function that creates a date range:
    gsc_date_range <- function(start, end) {
    seq.Date(from = as.Date(start),
    to = as.Date(end),
    by = "day")
    }

    # declare a function that downloads GSC data from a single date with chosen dimensions
    gsc_download <- function(single_date, url, dim1, dim2 = NULL) {
    gsc_data <- searchConsoleR::search_analytics(
    siteURL = url,
    startDate = single_date,
    endDate = single_date,
    dimensions = c("date",
    "device",
    dim1,
    dim2),
    walk_data = "byBatch",
    searchType = "web",
    rowLimit = 100000)
    Sys.sleep(3) # we're going to make a lot of API calls, so let's not annoy the API limits, 3 seconds of waiting between the calls seems pretty reasonable
    return(gsc_data)
    }

    # download the data
    df_page_query <- purrr::map_dfr(
    .x = gsc_date_range(
    start = Sys.Date() – 93,
    end = Sys.Date() – 3),
    .f = gsc_download,
    url = site,
    dim1 = "page",
    dim2 = "query")

    Hope you will find this helpful 🙂

    Let me repeat one more time – good job, keep it up!
    Best,
    Leszek

    • Thanks a lot Leszek! Your comment it’s really valuable and I’ll definitely update the post.

      1. I totally agree with you! I also did some tests when I was downloading the data using Google Sheets. I wrote this code for a bunch of small websites but I already planned to change it for my large websites (my idea is also to use Google BigQuery instead of PostgreSQL)
      2. Really interesting! I use a for loop simply because I’m currently studying R with Datacamp and I haven’t reached yet the course about the purrr package. But as I said, I’ll update my code once I’ll know more about this package.

      Best,
      Ruben

      • Glad to hear Ruben 🙂

        I also used DataCamp. Basically map() and map_dfr() are lapply() on steroids. They are hard to wrap your head around at first but come really handy later!

        By the way, I was chatting with my colleague today about downloading GSC data and shared your blog post. I also provided them with a GitHub gist with code and explanations for my approach, so I will take my freedom to share it with you 🙂 it is more readable there than in my comment. You can find it here: https://gist.github.com/Leszek-Sieminski/e8bb1f930ee70bd934e1d34a4295c910

        Once again, great work, can’t wait to see what else you will create!
        Best,
        Leszek