Web scraping/parsing in Node.js to detect the language of a HTML page?

I am using the Readability Parser API and the node-readability module to do web scraping/parsing for a server built on Node.js. I can get much information (title, links, date, content, length…) about the articles published on sites of publishers and blogs (my target), but cannot get their written language. Any idea of how I could do this?

There is the Google Translate API, but it is not free, and I don’t need any translation.
There is the Alchemy Language Detection API, or there is the node-language-detect module, but it seems to detect language from a given text, whereas in my case some information about the language may be available in the HTML code of the page (see http://www.w3.org/TR/i18n-html-tech-lang/).

39 thoughts on “Web scraping/parsing in Node.js to detect the language of a HTML page?”

  1. While inferring the language of a web page can be difficult (Bonjour!), HTML is there to help. Look for the lang attribute:

    <html lang="en-us">
    

    It should be noted that any element can have said attribute. In the case of my opening sentence:

    <p lang="en-us">While inferring the language of a web page can be difficult <span lang="fr">(Bonjour!)</span></p>
    

    More info here: https://stackoverflow.com/a/7076990/1216976

    Alternatively, you could check the Content-Language of the return headers, but that’s not as specific, defining the entire page.

    Reply

Leave a Comment