Nowdays is more common than before the use of frontend frameworks like React, Vue, Angular and so on. Using those framewoks means that the application gets rendered dynamically on the client. It can be challenging to scrape these SPAs because there are often lots of Ajax calls, so it might not be possible to get the content because it’s rendered asynchronously, the content is not backed into the HTML received over the wire and JavaScript loads data over AJAX and builds the DOM dynamically.
Even though the content of those applications rendered dinamycally, it is posible to get the content from them using the browser captabilities.
In this tutorial I’m going to give an example scrapping the twitter application. For this example we are going to use Selenium, which is a funtional testing tool developed by google.
Downloading Selenium webdriver
Firstable, in order to use Selenium. As the ducumentation says, Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before use it. You can download your favorite browser webdriver here. In this case, I’m going to choose chrome.
After you got to download the webdriver is imperative to store it in the same folder of the script you are going to run later.
Now Let’s to code.
First thing to do in our script is to import the require libraries and modules.
1
2
3
4
5
6
7
8
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time
Now we need to declare a webdriver instance which we are going to use to navegate the app.
1
2
3
4
5
6
7
8
9
10
11
12
13
driver = webdriver.Chrome('./chromedriver')
driver.get ('https://twitter.com/login')
delay = 20 # seconds
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located(
(By.CSS_SELECTOR, 'input.r-30o5oe.r-1niwhzg.r-17gur6a.r-1yadl64.r-deolkf.r-homxoj.r-poiln3.r-7cikom.r-1ny4l3l.r-1inuy60.r-utggzx.r-vmopo1.r-1w50u8q.r-1lrr6ok.r-1dz5y72.r-1ttztb7.r-13qz1uu')))
print("Page is ready!")
a = driver.find_elements_by_xpath("//input[@class='r-30o5oe r-1niwhzg r-17gur6a r-1yadl64 r-deolkf r-homxoj r-poiln3 r-7cikom r-1ny4l3l r-1inuy60 r-utggzx r-vmopo1 r-1w50u8q r-1lrr6ok r-1dz5y72 r-1ttztb7 r-13qz1uu']")
a[0].send_keys("Your email")
a[1].send_keys("Your password")
driver.find_element_by_xpath("//div[@role='button']").click()
except TimeoutException:
print("Loading took too much time!")
The code above, what it does is first, declare a webdriver instance taking into account to pass the webdriver PATH inside the method. Next we declare a GET request to twitter site and we add some delay variable that we include as wating time for loading the page. Then it declares a error handling block en which we can find at the beginning a method calls ‘WebDriverWait()’ which recieves two arguments, the driver and the delay variable, this method allows us to wait till the precense of an especific element.
When the element is found and the page is loaded there is a print comment saying “page is ready”. But if the page takes more than the number of the delay variable to load,the exception part block take place and print a comment saying “Loading took too much time”.
The rest of the script is just selector methods from Selenium to point to those element inside the page that we are looking for. That’s pretty much it for this tutorial.
Conclusion
Web scraping is an awsome practice to get data from websites and web apps. I hope you enjoyed it, if you have some kind of trouble don’t hesitate to write a comment I’ll be ready to help.