How to grab a web page content, just as the importxml formula would do with Google Apps Script?

Basically, I’m building a scraper to get the contents of a web page, so that then I can count a certain keyword occurrence and get that keyword density from the page.
I can already count the number of occurrences, the issue I’m having is that UrlFetchApp.fetch pulls the source code I ended up counting not only the words on what would be the frontend of the page but also whatever is on the source code.

for that, I wanted to filter the page so that I only grab the actual content of the page.

I would love to get the content just as =importxml("[page]","//div[@id='main']") does in google sheets but I have no idea on how to apply that logic on code, I’m using google scripts so, using querySelector, or getElementsByTag is not an option, unfortunately.

the reason I’m trying to do this in code and not by using the google sheets formula is that this would be applied to a big number of pages.

var response = UrlFetchApp.fetch(url, options) 
      var html = response.getContentText();

var body = html.substring(html.lastIndexOf("<body") + 1, html.lastIndexOf("</body>"));  //removed everything from above the <body> tag
      
   
var lowerBody = body.toString().toLowerCase()  // changing the contents to lower case so is not case sensitive.

var kw = kw.toString().toLowerCase()  // same as above but for the keyword

 var filter = ['<p>','li','span','h1','h2','h3','h4','h5','h6'] //what I want to get from the page

   
        var newBody = new Array;     
      var filter = ['p','li','span','h1','h2','h3','h4','h5','h6']
      
      
    for(var i = 0; i< filter.length; i++){
        
      if(filter[i] == 'p'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          newBody[i] = lowerBody.match(/<p>(.*?)<\/p>/g).map(function(val){
            return val.replace(/<\/?p>/g,'');
          });       
        }
  
      }
      else if(filter[i] == 'li'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          newBody[i] = lowerBody.match(/<li>(.*?)<\/li>/g).map(function(val){
            return val.replace(/<\/?li>/g,'');
          });
        }
  
      }
      else if(filter[i] == 'span'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          newBody[i] = lowerBody.match(/<span>(.*?)<\/span>/g).map(function(val){
            return val.replace(/<\/?span>/g,'');
          });
        }
      }
      else if(filter[i] == 'h1'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          
          newBody[i] = lowerBody.match(/<h1(.*?)<\/h1>/g).map(function(val){
            return val.replace(/<\/?h1/g,'');
          });
        }
      }
      else if(filter[i] == 'h2'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          
          newBody[i] = lowerBody.match(/<h2(.*?)<\/h2>/g).map(function(val){
            return val.replace(/<\/?h2/g,'');
          });
        }
      }
      else if(filter[i] == 'h3'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          newBody[i] = lowerBody.match(/<h3(.*?)<\/h3>/g).map(function(val){
            return val.replace(/<\/?h3/g,'');
          });
        }
      }
      else if(filter[i] == 'h4'){
        if(lowerBody.match(/<p>(.*?)<\/p>/g).length != 0){
          newBody[i] = lowerBody.match(/<h4(.*?)<\/h4>/g).map(function(val){
            return val.replace(/<\/?h4/g,'');
          });
        }
      }
        else{
        
        }
      }```

I know my code is bad so I'm willing to scrape all this if there's a better way of accomplishing this.
Thanks in advance.

34 thoughts on “How to grab a web page content, just as the importxml formula would do with Google Apps Script?”

  1. Pingback: japan gay dating

Leave a Comment