WebScrape - Data extraction from a public API using python

Saulo Toledo Pereira
17 min readJun 23, 2022

Github project here.

Hi buddies, this post is the first part of a project that I am doing to increase my knowledge about getting and manipulating data. The idea came from a job interview that a did on my path to becoming a data scientist.

In the selection process, I had a case of a problem and I had to solve it and present the results. I spent the weekend solving this case (the process started on Friday and should be delivered on Monday), perhaps due to my little experience it was very difficult to do everything at that time (it was my first selection process for this area, so I tried as hard as I could because I really wanted to pass. Spoiler alert: I didn’t pass haha), well, the part I didn’t do so well was in the presentation of the problem resolution, or, the “deploy”, where the recruiter was expecting something more “professional” and with a little time I had, I presented a PowerPoint that was even praised, however, it was not what they were expecting. I learned a lot doing this case, and from it, I decided that one day I would do something similar.

I remember hearing a lot from him that how they used the R language for data manipulation and intranet publishing with shiny, interactive graphics and so on (that’s what he expected from the presentation). So I decided to create a project as a personal goal in which I was going to do the whole process from data extraction to presentation, but now no longer in PowerPoint, but in Power BI, Google Data Studio in addition to a dashboard published on the web with python libraries (it’s not shiny yet but I believe they are similar hehe).

The steps of this project will be:

1 - WebScrape — Data extraction from a public API using python

Taking advantage of the fact that we are in an election year, I decided to get public data on the current deputies and the spending made with public money from the beginning of the pandemic until today (2020–2022).
Keywords: Python, webscrape, API.

2 - Cloud — Autonomous database integration with python

In this step, we will create an Oracle Cloud Infrastructure (OCI) account, where we will be able to create an autonomous database and store the extracted data using python and database tools.
Keywords: Python, Oracle Cloud, Database, SQL, Alwaysfree, Dbeaver.
Article here.

3 - Power BI Desktop integration to Oracle Autonomous Database

In the Power BI article, I will show how to connect our database from Oracle Cloud to power bi and create a simple dashboard using some calculations and metrics to present the data.
Keywords: Power BI, Oracle DB connection, metrics, dashboard.
Article here.

4 - Google Data Studio

In the Google Data Studio article, we will try to replicate the dashboard already made in power BI, but we will publish it online at the end, since Google allows us to do this for free, unlike the Microsoft Power BI that only makes this option available by paying.
Keywords: Google Data Studio, dashboard, publish online.
Article soon.

5 - Python dashboard

This article will be where I will go beyond the limits of my knowledge. I will teach you step by step how to create a dashboard using the knowledge I have in Python, HTML, and CSS. Let’s upload this dashboard to a server that we will create in the oracle cloud and it will be online so that anyone can access it via browser.
Keywords: Python, Dash, Cloud server.
Article soon.

Github project here.

In this section, I want to show step by step how to get some data from a public API using the following checklist:

1- Deputies IDs
2 - Last page number
3 - Pagination URL’s to get all data
4 - Change ID on URLs to get data from all deputies
5 - Change the year on URLs to get data from all years that we want
6 - Save all data that we got
7 - We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Alright, let’s get started.

1 - Deputies IDs

Accessing the URL:

https://dadosabertos.camara.leg.br/swagger/api.html

We have all the API documentation for the data we want. On this page we search for the expenses (“despesas” in Portuguese), of the deputies:

When we click on the expenses tab, we can see that to obtain the data, we must have the deputy’s “ID” as shown in the documentation.

In this case, we want the expenses data of all the deputies, so we click on the deputies tab because there we can find the URL to make requests that will bring information about the deputies and among this information, the “IDs” of each one of them.

When we click on the ‘deputados’ tab we get the following:

1- A brief description of what the request will return to us
2- Button to ‘test’ the API and see what information it will return to us.

Clicking on Try it out, scrolling the page you can see a button for execution:

Clicking on ‘Execute’, the API returns the following:

1- URL used to request the data
2 - Code that the request returns, 200 in this case, it means that the request was successful.
3 - The data we need.

Great, now that we have the URL to make our requests we can start our Python code. In this first stage of the project, we will use only two libraries: Pandas and Requests. If you don’t have them installed, type in your terminal:

Pip install pandasPip install requests

In case you have any difficulties, here are the official pages with the full documentation and how to install it.

With the libraries installed, we can start:

I’ll put the code and then explain the step by step of each line

First, let’s import the libraries that we will use:

import requests
import pandas as pd

To make the requests, we will need the URL that we saw in the previous step (item 1).

https://dadosabertos.camara.leg.br/api/v2/deputados?ordem=ASC&ordenarPor=nome

Now that we have the URL, we can make the request. For that we will use the requests library imported at the beginning of the code:

request = requests.get(url)

To know if everything went well we will show the server response code:

print(request.status_code)

With everything right, we will use pandas library to assemble a data frame with the ‘data’ returned in JSON format:

dataframe = pd.DataFrame(request.json()['dados'])

And finally, our result:

1- Response code (200 = ok = success! = uhuuu!)
2 - DataFrame created with the data that we got
3 - Number of rows and columns

With this request, we have already obtained information from all the deputies, including the ‘id’ of each one of them. With this ‘id’ we will be able to make the requests to also get the data about the expenses.
Now, back to the documentation, let’s the first ID that we got in the previous request for our tests:

1- ‘id’ field we put the first ID that we collected in our first request: 204554;
2 - ‘year’ field, we initially want to get data from the beginning of the pandemic, in this case: 2020;
3 - ‘page’ field, we want the first page of expenses so in this case, we will put: 1;
4 - ‘items’ field, we can choose how many items the request will return per page, according to the documentation, if we leave the field blank by default we will have 15 items per page. In this case, we will choose to return 100 items in each request.

After adjusting the parameters click on execute and we have as result:

This is the initial URL for us to make the expense requests:

https://dadosabertos.camara.leg.br/api/v2/deputados/204554/despesas?ano=2020&pagina=1&itens=100&ordem=ASC&ordenarPor=ano

Let’s test it now using our python code:

As result, we have the following output:

We now have a data frame with 100 deputy expense items with ID: 204554, in the year 2020. Excellent!

2 - Last page number

But how do we know we got all the data? We started at page 1 but what if there are more pages, how do we get the data from there too?
Well, let’s analyze the request URL:

In this URL we have the following parameters that we can manipulate to make the requests:

1- ID
2 - Years
3 - Page
4 - Items

So if there are more pages, we just change the ‘page’ parameter and make another request, but we’ll see that later.

Clicking on execute, as a result of the request, we are returned the following:

As result, we have a dictionary with 100 items of data on the expenses of the deputy of id 204554, in the year 2020.

After showing the 100 items, we also received the following URLs:

1- The current URL
2 - The URL of the next page (if any)
3 - The URL of the first page
4 - The URL of the last page.

These last URLs are very important for us to get all the information about the expenses. With the URL of the last page we have information about how many times our Code will be repeated, in this case:

We know that for the deputy with Id: 204554, in the year 2020 we will have 5 pages of data to scrape.
Now that we have this information, we know that we will have to paginate the requests. We will first make a simple function that will only serve to know which is the last page, having this information we can use the following logic:

"Python, go to the first URL and get the value of the last page and call it as 'last_page';
Now get the all expense information from the first page and repeat this procedure changing the parameter 'page' of the URL from 1 to the value of 'last_page'".

Too confused?
Let’s go to the Code:

This Code returned the URLs already mentioned above, in this case, we will only use the link on the last page. Then we need to select the ‘href’ column, index 3 that represents our last page URL:

last_page = links['href'][3]

As a result, we have the URL from the last page:

'https://dadosabertos.camara.leg.br/api/v2/deputados/204554/despesas?ano=2020&ordem=ASC&ordenarPor=ano&pagina=5&itens=100'

Now we have to isolate the page number in a variable and to do this we will use regular expression. In case you don't have this library in your environment :

pip install re

To isolate the last page number we can select all numbers of the URL using the following code:

last_page_number = re.findall('[0-9]+', last_page

In the 6th line, we are using the ‘re’ function and ‘findall’ method. This method will find all things that you ask for. In this case, we asked for all numbers between 0 and 9 in ‘last_page’.

As output, we have the following list:

Now our variable ‘last_page_number’ is a list with all numbers from URL (as string):

last_page_number[0] : ‘2’
last_page_number[1] : ‘204554’
last_page_number[2] : ‘2020’
last_page_number[3] : ‘5’
last_page_number[4] : ‘100’

We know that the URL is standardized, so the parameters used will always be in the same order. In this case, we know that the first item in the list is part of the body of the URL:

https://dadosabertos.camara.leg.br/api/v2/deputados/

Right after we have the ID number of the deputy in question:
204554

After that the year:
2020

Our number of interest at the moment which is the last page number:
5

And finally the number of items that the page is showing:
100

Now we can return our last page number and our function is finished!

Pause for breath o(* ̄o ̄)o

Recapping:

We want to get public expenses data from Brazilian deputies from 2020 to 2022. What do we need for this?

1- Deputies IDs ✔
2 - Last page number ✔
3 - Pagination URL’s to get all data
4 - Change ID on URLs to get data from all deputies
5 - Change the year on URLs to get data from all years that we want
6 - Save all data that we got
7- We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

3 - Pagination URL’s to get all data

Let’s study our URL again:

https://dadosabertos.camara.leg.br/api/v2/deputados/204554/despesas?ano=2020&pagina=1&itens=100&ordem=ASC&ordenarPor=ano

We can count three variables in bold:
1- Deputy ID
2- Year
3- Page

So far we’ve only made requests for a static URL, let’s add our first variable.

Our code until now must be like this:

Our changes will take place from line 24. We create a variable with value 1 of type integer that will serve to access the first page of the API:

page = 1

In order to use it in the URL, we will convert it to a string:

url = 'https://dadosabertos.camara.leg.br/api/v2/deputados/204554/despesas?ano=2020&pagina=' + str(page) + '&itens=100&ordem=ASC&ordenarPor=ano'

on the next line (line 26) we used our function using the initial URL as a parameter to get the last page number.

last_page = get_last_page(url)

As we checked before, we already know the steps that the function will take, and at the end the function will return the value 5 as a string. So last_page = ‘5’

An important step we have to do is create an empty list that will save the data of the first page and then the second and so on:

list = []

To pagination, we will use a simple loop from 1 to ‘last_page’ (as we already know, last_page = ‘5’):

for page in range(1, int(last_page)+1):

Converting our variable value from string to integer, we have a loop that will be repeated 5 times:
page = 1
page = 2
page = 3
page = 4
page = 5

Python will use range from start in our case 1 until the previous number of the last value. Then:
- Start: 1
- End: 4(last_page = 5)

This loop will be repeated 4 times. So we need to add +1 at the end. Thus, the loop will be repeated 5 times.

In line 36 we’re taking the data and putting it in a list.

list.append(df_expenses)

After page 1, our loop will go through all the pages and save all the data we want in our ‘list’.

df_expenses = pd.concat(list)

The line above will allow us to transform our list into a data frame. In the end, we have the following output:

A data frame with all expenses data from a deputy with ID: 204554 in the 2020 year. (◠﹏◠)

For now, our checklist looks like this:
1- Deputies IDs ✔
2 - Last page number ✔
3 - Pagination URL’s to get all data ✔
4 - Change ID on URLs to get data from all deputies
5 - Change the year on URLs to get data from all years that we want
6 - Save all data that we got
7 - We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Let’s add another variable to our URL, the ids:

4 - Change ID on URLs to get data from all deputies

We already have a data frame called df_dep_info with some information about the deputies, including among this information we have an entire column with the ids of each one:

To run through a column of a data frame, we can use the following code:

for x in DataFrame['column name']:

So we have:

Doing some changes our code must look like this:

For now, we have a loop inside a loop *_*. For each ID that we make a requisition, our script will get all data making pagination, and saving the data on DataFrame df_expenses, after this the script does the same with the next ID until the list of ids ends.

Running this code (it may take a while depending on your computer, to reference, it took 10 minutes here), we have as output:

Now we have a DataFrame with all information about expenses (170 thousand rows O_O), from all deputies in 2020.

But we have a little problem, we have all the data, but we don’t know exactly who they belong to, where that deputy’s data ends, where another begins, and so on. We have to add a new column(s) with information about the deputy who had the expense in question.

Let’s understand how we are getting and storing our data:

That is our data frame df_dep_info;
For each ID from this data frame we get some rows of data without number patterns and saving in another data frame, df_expenses:

This image is just an example, the data are not real

In our data frame df_expenses, we don’t have the ID column (it’s there for a while just for a better understanding of how things are working).
I was thinking about how to do it for a while and I came to the following logic:
We can create a list for each data that we want to add to the final data frame, let’s see the schema:

First, let’s create an empty list to store the new information;

id_list = []

1- Then we will make the request using the first ID (100 in this case);

2- Now we collect the data and count how many lines appear on the page (3 lines for example);

3- Now we take the id we used to make the request, and we put it in our new list times the number we just counted;

After repeating until the data are finished we have the data frame with the expenses, and a list whose ID data will match precisely the lines of the expenses data frame, so join and we have the result:

We can do this with all data we want:

Ok, but how can we do this in python? Let’s code:

In our case, we want data besides the ID and name, so let’s start by creating the lists to store each data (line 23):

id_list = []
name_list = []
partido_list = []
uf_list = []
urlFoto_list = []
email_list = []

To select the right deputy data we have to create a counter that will show us exactly the row that we want (line 33):

count = 0

And we are also going to create an incrementer that will change the line for each time the script changes the deputy id (line 40):

count += 1

After getting data, we have to count how many rows of data we have (line 51):

count_rows = len(r2.json()['dados'])

And finally, let’s create another loop to add the data to our lists (line 53) :

for i in range(count_rows):
id_list.append(id)
name_list.append(df_dep_info['nome'][count-1])
partido_list.append(df_dep_info['siglaPartido'][count-1])
uf_list.append(df_dep_info['siglaUf'][count-1])
urlFoto_list.append(df_dep_info['urlFoto'][count-1])
email_list.append(df_dep_info['email'][count-1])

But how things are working here?

We have a variable that starts from 0:
count = 0

each time that the script changes the ID our variable will be changed:
count = +1

After getting expense data, we count how many rows we got:
count_rows = len(r2.json()[‘dados’])

Now, our loop will repeat entering the data in the same number of lines that were counted in the previous step:
for i in range(count_rows):

And finally, our script does the following:

To get a specific data from a data frame we have to use the following syntax:

DataFrame['column'][index]

Thus, to insert the first name in our empty list we have to:

name_list.append(df_dep_info['nome'][count-1]

This line will get the name there in the line of index 0 and add it to name_list and due to the loop, it will repeat this step from 1 until the value of count_rows.

When we have all the data in the lists, just put it all together and
we have the following output:

Now we have all data from all deputies from 2020 in a single DataFrame, all we have to do now is create another loop that will change the year.

5 - Change the year on URLs to get data from all years

For now, our checklist looks like this:
1- Deputies IDs ✔
2 - Last page number ✔
3 - Pagination URL’s to get all data ✔
4 - Change ID on URLs to get data from all deputies ✔
5 - Change the year on URLs to get data from all years that we want
6 - Save all data that we got
7 - We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Let’s finish our code creating the last loop and adding another variable to our URL:

for year in range(2020, 2022+1):  url = 'https://dadosabertos.camara.leg.br/api/v2/deputados/'+ str(id) + '/despesas?ano=' + str(year) + '&pagina=' + str(page) + '&itens=100&ordem=ASC&ordenarPor=ano'

With this, our code will get all data from all deputies passing from 2020, 2021, and 2022. Let’s see how our code looks like:

As output:

364979 rows ヽ(´▽`)/

Now we have all data from all deputies from 2020 until now

♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Now we need to export this data to a CSV and we’re done!

6 - Save all data that we got

Our checklist:
1- Deputies IDs ✔
2 - Last page number ✔
3 - Pagination URL’s to get all data ✔
4 - Change ID on URLs to get data from all deputies ✔
5 - Change the year on URLs to get data from all years that we want ✔
6 - Save all data that we got
7 - We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Let’s organize the columns and export them to a CSV file:

Output:

Now just export:

Our checklist:
1- Deputies IDs ✔
2 - Last page number ✔
3 - Pagination URL’s to get all data ✔
4 - Change ID on URLs to get data from all deputies ✔
5 - Change the year on URLs to get data from all years that we want ✔
6 - Save all data that we got ✔
7 - We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

7- We finished ♪┏(・o・)┛♪┗ ( ・o・) ┓♪

Complete code:

So that’s it, this was the first part of this project, in the next article we will integrate these data with the oracle cloud using python.
Until next time (◠﹏◠)

Next article here.

Github project here.

--

--

Saulo Toledo Pereira

PhD student trying to learn some code and practice my English. Can we talk five minutes?