Web scraping/parsing in Node.js to detect the language of a HTML page?

I am using the Readability Parser API and the node-readability module to do web scraping/parsing for a server built on Node.js. I can get much information (title, links, date, content, length…) about the articles published on sites of publishers and blogs (my target), but cannot get their written language. Any idea of how I could do this?

There is the Google Translate API, but it is not free, and I don’t need any translation.
There is the Alchemy Language Detection API, or there is the node-language-detect module, but it seems to detect language from a given text, whereas in my case some information about the language may be available in the HTML code of the page (see http://www.w3.org/TR/i18n-html-tech-lang/).

62 thoughts on “Web scraping/parsing in Node.js to detect the language of a HTML page?”

  1. I’m really enjoying the theme/design of your website.
    Do you ever run into any web browser compatibility problems?
    A handful of my blog readers have complained about my site not working correctly in Explorer but looks great in Opera.
    Do you have any advice to help fix this issue? http://droga5.net/

    Reply
  2. Pingback: uses of priligy
  3. I’m really impressed with your writing skills as well as with the layout
    on your weblog. Is this a paid theme or did you customize it yourself?
    Anyway keep up the nice quality writing, it is rare to see
    a great blog like this one these days.

    Reply
  4. Pingback: stromectol cost

Leave a Comment