How to scrape a table that is populated by javascript?

I am learning scraping. Scraping site.

I am able to select:

जिला:, अनुमंडल:, अंचल:

from dropdown using selenium.

I can select from मौजा का नाम चुने:.

enter image description here

Afterwards, I am able to click on the खाता खोजें button.

enter image description here

As a result, a table is populated at the bottom by javascript.

enter image description here

The button’s div code:

<input type="submit" name="ctl00$ContentPlaceHolder1$BtnSearch" value="खाता खोजें" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$ContentPlaceHolder1$BtnSearch&quot;, &quot;&quot;, true, &quot;S&quot;, &quot;&quot;, false, false))" id="ctl00_ContentPlaceHolder1_BtnSearch" style="width:146px;">

Pagination is done by:

javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1','Page$11')

I am not able to scrape this table.

What I have tried:

  1. phantomjs isn’t supported with selenium
  2. Table’s id, ctl00_ContentPlaceHolder1_GridView1, is not in HTML source code. Tried some approaches, no luck so far.
#p_element = driver.find_element_by_id(id_='ctl00_ContentPlaceHolder1_GridView1')

p_element = driver.find_element_by_xpath('//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]')

print(p_element.text)


path_for_table='//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]'

table_list = WebDriverWait(driver, 2).until(lambda driver: driver.find_element_by_xpath(path_for_table))

print(table_list)

Pages I have looked at:

1 thought on “How to scrape a table that is populated by javascript?”

  1. First, let’s get the site. I am using BeautifulSoup to scrape along with Selenium.

    import bs4 as Bs
    from selenium import webdriver
    
    DRIVER_PATH = 'D:\chromedriver.exe'
    driver = webdriver.Chrome(executable_path=DRIVER_PATH)
    
    driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')
    

    Then click on a village name (change according to your need)

    driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()
    

    click on "खाता खोजें" button:

    driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()
    

    Get the page’s source using BeautifulSoup

    page_src = Bs.BeautifulSoup(driver.page_source)
    

    find the id: ctl00_ContentPlaceHolder1_UpdatePanel2 and find all tds in it:

    table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")
    

    Get columns and get the text out of them

    columns = table_elements[:6]
    column_names = [e.text for e in header]
    

    columns:

    [<td>क्रम</td>,
     <td>रैयतधारी का नाम</td>,
     <td style="white-space:nowrap;">पिता/पति का नाम</td>,
     <td>खाता संख्या</td>,
     <td>खेसरा संख्या</td>,
     <td>अधिकार<br/>अभिलेख</td>]
    

    column_names:

    ['क्रम',
     'रैयतधारी का नाम',
     'पिता/पति का नाम',
     'खाता संख्या',
     'खेसरा संख्या',
     'अधिकारअभिलेख']
    

    Next get the body of the table

    body_of_table = table_elements[6:-4] 
    

    Then create chunks of 6 columns for each entry and get the text out

    chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]
    data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]
    

    data:

    [['1', 'अरूण कुमार', 'शिवलाल पासवान', '55', '406', 'देखें'],
     ['2', 'इन्द्रदेव प्रसाद', '\xa0', '98', '789', 'देखें'],
     ['3', 'ईश्वर मांझी', 'चमारी मांझी', '78', '42', 'देखें'],
     ['4', 'कवलसिया देवी', 'तुलसी मांझी', '120', '41', 'देखें'],
     ['5', 'कामदेव पांडे', 'शिवदानी पांडे', '210', '457, 459, 461, 474', 'देखें'],
     ['6', 'कामेश्वर मांझी', 'उत्ती मांझी', '78', '43', 'देखें'],
     ['7', 'कारू मांझी', 'राधे मांझी', '78', '42', 'देखें'],
     ['8', 'कारू मांझी', 'मेघन मांझी', '78', '42', 'देखें'],
     ['9', 'कौशल्या देवी', 'केदार महतो', '253', '757', 'देखें'],
     ['10', 'गणेश साव', 'छेदी साव', '156', '236', 'देखें'],
    
    ....
    

    Now import Pandas and use it to create a dataframe out of this list of lists:

    import pandas as pd
    df = pd.DataFrame(data, columns = column_names)
    
    # set क्रम as index
    df.set_index(df.columns[0])
    

    Final result:

    import time # using time.sleep for illustration only. You should use explicit wait
    import bs4 as Bs
    import pandas as pd
    from selenium import webdriver
    
    DRIVER_PATH = 'D:\chromedriver.exe'
    driver = webdriver.Chrome(executable_path=DRIVER_PATH)
    
    driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')
    
    time.sleep(4)
    
    #click on a village name
    driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()
    
    time.sleep(2)
    
    # click on खाता खोजें
    driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()
    
    time.sleep(2)
    
    
    # ----------- table extracting part ------------------
    
    
    # get page source
    page_src = Bs.BeautifulSoup(driver.page_source)
    
    # find the id: ctl00_ContentPlaceHolder1_UpdatePanel2 and find all tds in it
    table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")
    
    # get columns and get the text out of them
    columns = table_elements[:6]
    column_names = [e.text for e in header]
    
    # get the body of the table
    body_of_table = table_elements[6:-4]
    
    # create chunks of 6 columns for each entry
    chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]
    
    # get the text out
    data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]
    
    df = pd.DataFrame(data, columns = column_names)
    
    # set क्रम as index
    df.set_index(df.columns[0])
    
    print(df)
    

    enter image description here


    To scape the next pages:

    • Click on the next button using Selenium.
    • Wait for the page to load
    • Rerun the table extracting part (by putting it into a function)
    • discard the column names (we already have them)
    • append data into the already created data frame
    • repeat the above steps of all pages (you can add a while loop and try clicking on a page till an exception occurs, see try and except)
    Reply

Leave a Comment