Web scraping with Python using Beautiful Soup
Python
Load in packages
#Packages
#--Web scraping packages
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import pandas as pd
import numpy as np
Load URLs we want to scrape into an array
#load URLs we want to scrape into an array
BASE_URL = [
'http://www.reuters.com/finance/stocks/company-officers/GOOG.O',
'http://www.reuters.com/finance/stocks/company-officers/AMZN',
'http://www.reuters.com/finance/stocks/company-officers/AAPL'
]
Loop through our URLs, scrape table, pass information to array
#loading empty array for board members
board_members = []
#Loop through our URLs we loaded above
for b in BASE_URL:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
officer_table = soup.find('table', {"class" : "dataTable"})
#try clause to skip any companies with missing/empty board member tables
try:
#loop through table, grab each of the 4 columns shown (try one of the links yourself to see the layout)
for row in officer_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 4:
board_members.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
except: pass
Create new array, check length to ensure things pulled in correctly
#convert output to new array, check length
board_array = np.asarray(board_members)
len(board_array)
49
Convert new array to dataframe
#convert new array to dataframe
df = pd.DataFrame(board_array)
Rename columns, preview output
#rename columns, check output
df.columns = ['URL', 'Name', 'Age','Year_Joined', 'Title']
df.head(10)
URL | Name | Age | Year_Joined | Title | |
---|---|---|---|---|---|
0 | http://www.reuters.com/finance/stocks/company-... | Eric Schmidt | 61 | 2015 | Executive Chairman of the Board of Director |
1 | http://www.reuters.com/finance/stocks/company-... | Sergey Brin | 43 | 2015 | President, Director |
2 | http://www.reuters.com/finance/stocks/company-... | Lawrence Page | 44 | 2015 | Chief Executive Officer, Director |
3 | http://www.reuters.com/finance/stocks/company-... | Ruth Porat | 59 | 2015 | Chief Financial Officer, Senior Vice President |
4 | http://www.reuters.com/finance/stocks/company-... | Sundar Pichai | 45 | 2017 | Director, Chief Executive Officer, Google Inc. |
5 | http://www.reuters.com/finance/stocks/company-... | David Drummond | 54 | 2015 | Senior Vice President - Corporate Development,... |
6 | http://www.reuters.com/finance/stocks/company-... | John Hennessy | 64 | 2007 | Lead Independent Director |
7 | http://www.reuters.com/finance/stocks/company-... | Diane Greene | 61 | 2015 | Director |
8 | http://www.reuters.com/finance/stocks/company-... | L. John Doerr | 65 | 2016 | Independent Director |
9 | http://www.reuters.com/finance/stocks/company-... | Roger Ferguson | 65 | 2016 | Independent Director |
Export data to CSV
#export data
df.to_csv('/Users/yourname/desktop/board_members.csv')
That's it! If you're interested in seeing how I used this data check out my visualization on the interconnectedness of companies through shared board members here.