How Text Rank Enables Extractive Summarization (Video)
Learn more about AI and NLP at the next SpeechTEK conference.
Read the complete transcript of this clip:
Paco Nathan: There's a project called TextRank, and a friend of a friend came up with this algorithm and was subsequently hired by Google.
The idea is that if you take a paragraph, or if you take a text document and you split it into paragraphs and sentences, we have tools for that. Now, for each sentence, parse it with NLP. Every word, you know what the word is, you know what its root is, you know its part of speech. You can now construct a graph, you can look at a noun and say, "Okay what are its neighbors? Where are the adjectives? Where are the verbs? Where are the other nouns?"
And for the neighbors, link them together in a graph. And if you see repeat instances of the same reward, link those too. And if you have a knowledge graph, some context coming in, and you see links in common there, yeah add those too. Now you end up with this graph representing your text document. And again, the text document may be a transfer from a video.
The trick in graph algorithms is we use an approach called centrality. Basically, we can look at all the parts that are linked together and we can find out which are the ones that are the hubs, the ones that are the most connected? And there are mathematical ways to describe that. One technique, it's something called eigenvalue centrality. A lot of two bit words. There's a variant of it called stochastic eigenvalue centrality. The other name for that is PageRank. So the algorithm that Google invented.
Basically, once you've constructed a graph out of a text and you run PageRank on it, the most highly ranked phrases are the ones that have a lot of reference throughout the document. Basically, you start pulling those out and you come out with what we call a feature vector. You come out with a list of key phrases that are highly ranked, but they're also basically a characteristic of the document.
This is a really great way not just to parse a text but really start to understand what is that about? And it's a way to bring in prior knowledge. By the way, here's a GitHub, I'm one of the lead committers on this project that implements it in Python opensource on GitHub.
One of the interesting things about this now is you can go back and reevaluate the text. If you've got that feature vector and the graph, you can go back and look at every sentence and you can see what's the vector distance from a sentence to my feature vector? And now I can go through the entire document and I'll rank those sentences, and then take the top end and put them back into their original order, and you come up with a summarization. This is a technique called extractive summarization. We use it for our search results, both on video and books. It worked very well. Here's an example of taking an entire article and dropping it into a paragraph, down at the bottom right.
And there are ways of using deep learning with this too. Extractive summarization is one step. You can go a few steps further in terms of abstractive or generative summarization. But there are very interesting ways of using AI to take a large quantity of media and condense it down. And that's a big win if you're an editor and you have to watch four thousand hours of video per year. That's like 10 months of having your finger on the fast-forward button.
So summarization is a big upcoming tool and technique.
Paco Nathan of O'Reilly Media's R & D Group discusses the role of big models in the commoditization of AI in this clip from SpeechTEK 2018.
Paco Nathan of O'Reilly Media's R & D Group discusses the role of big compute in the commodification of AI in this clip from SpeechTEK 2018.
Paco Nathan of O'Reilly Media's R & D Group discusses the role of big data in the commoditization of AI in this clip from SpeechTEK 2018.
O'Reilly Media's Paco Nathan discusses the explosive growth in CPUs and GPUs that have opened new vistas for natural-language processing applications and rendered old approaches obsolete in this clip from his SpeechTEK 2018 keynote.