Keyword extraction: Still worth a look

Keyword Extraction is the task of identifying relevant (or “salient”) concepts in a text. It’s sometimes also called Keyphrase Extraction if one wants to emphasize that many concepts span more than one word. Similar to text summarization, it helps to get the profile of a text. While doc2vec and similar translation mechanisms from unstructured text to numerical, computable values made this use case to some degree obsolete, Keyword Extraction is still a great way of presenting the gist of a text or many texts to users. For instance, a search engine can present the keywords found in search results as a drill down – a feature that is often used in scientific or bibliographical search engines.

A good background article on the process of keyword generation selection, keyword ranking and filtering is provided by Amit Chaudhary and can be found here. This article covers all the technical basics and is a great introduction. Here we want to focus on how different keyword extraction solutions compare to each other and what this means for such a common and importance task as keyword extraction.

For a quick comparison study, we used the first about 5,000 chars of a random Wikipedia article (“Graviton” – because it’s an interesting topic, a long-enough article and it yields a lot of great keywords to a human readers) and stripped some – but not all – of noise to get a clean text, e.g. removing the footnote references in square brackets but leaving in one instance of the iconic “citation needed” signet.

We compared four solutions: two popular Python modules, Rake and Yake, and two leading Cloud NLP solutions – Amazon’s AWS Comprehend and Google Cloud Natural Language. Yake is supposed to perform better than Rake, but how do they compare to AWS and Google? As a side note: for larger collections, TextRank should be considered as well, but for our Graviton-snippet, those four approaches are probably widely considered to be state-of-the-art and are easily accessible.

We looked for the top 50 keyphrases returned by each solution (every solution provides some sort of score between 0 and 1, although AWS Comprehend assigns 0.99+ to almost all keyphrases it presents). The top ranked keyphrase gets score 50, the second ranked gets score 49 and so on until rank 50 with score 1. Each of the 164 unique keyphrase returned by at least one of the four solutions was put in a table, sorted by the sum of their rank scores in descending order and with individual rank on each solution as columns. Here are the top entries:

Es wurde kein Alt-Text für dieses Bild angegeben.

A few takeaways from this quick comparison

–      Google Cloud and Yake are somewhat similar and are probably using a similar approach (interesting, because Yake’s approach is really clever, but simple statistics and no rocket science at all)

–      Overall, there is a remarkable amount of inconsistency with many keywords just returned by one of the four solutions and ranks usually all over the place

–      Also, there is considerable amount of noise – we highlighted in red two egregious examples from AWS Comprehend, but each solution has results that seem to be really off (of course, in the lower section of the table there are even more of these examples)

In short, despite its importance, Keyword Extraction is by far something that is “already solved” and in fact calls for a close look at processes, applications, training data, QA and user feedback in order to come up with a useful solution.

We at Glanos can help you in this area.