I am using the Readability Parser API and the node-readability module to do web scraping/parsing for a server built on Node.js. I can get much information (title, links, date, content, length…) about the articles published on sites of publishers and blogs (my target), but cannot get their written language. Any idea of how I could do this?
There is the Google Translate API, but it is not free, and I don’t need any translation.
There is the Alchemy Language Detection API, or there is the
node-language-detect module, but it seems to detect language from a given text, whereas in my case some information about the language may be available in the HTML code of the page (see http://www.w3.org/TR/i18n-html-tech-lang/).