Topic Extraction and Its Usefulness in Data Science Ready Organizations
Date: August 3, 2017
|
Posted by:

Situation Analysis

Text data is very valuable resource that’s also in abundance, nowadays. Apart from the text within an organization’s documents, there is plenty of text freely available on the web, containing useful signals waiting to be harnessed. However, oftentimes this text is raw, without any labels to provide some kind of hint as to how it would be best used, through a data analysis method or some ETL process. As a result, this resource is not properly utilized, or at the very least, it doesn’t fulfill its potential, to the extent that it can positively affect the bottom line.

Relevant Problems and Their Implications

This situation may seem benign at first, but having data that is not utilized in a way that it adds value to your organization is something wasteful. Also, if this data accumulates, this translates to storage costs. What’s more, the text data that is freely available on the web, e.g. the various social media feeds, the relevant blog posts, the majority of industry-related news feeds, may not cost you anything, but it is a resource that is probably being harenessed by your competitors. Finally, text data this is being utilized superficially (e.g. if you ask a statistician or some junior analyst to process it), is bound to be a waste as far as human resources as concerned, as the lack of structure in it is bound to make it less usable to your staff. In general, you are better off delegating a less challenging task to your analysts.

Apart from the obvious effects of these issues to your bottom line, leaving a great resource as text data to chance, is something that has strategic shortcomings too. In the subtle race towards the adoption of AI-related tech, if you don’t make use of some advanced NLP method in your data pipeline(s), you are bound to be left behind.

How Topic Extraction Can Help

Topic extraction is a data science methodology, that falls into the Natural Language Processing (NLP) category. It involves processing a corpus of text documents, written in the same language and having some overlap in the vocabulary. This process comprises of distilling the information in the text of the documents (something possible in various ways), putting this information into a numeric table (i.e. a matrix), and then using that to find which words or phrases are most important, and then clustering the documents into groups (topics) using this model. Afterwards, the process finds the most relevant words / phrases of each group, and provides a list of them along with the documents that correspond to that group.

In essence, a topic extraction model organizes the documents, much like a person managing a private library would organize a  collection of books into shelves, according to their themes. The topic extraction process, however, is automated, fairly fast, and works with all kinds of texts, making it a versatile solution to structuring a corpus of documents into a comprehensive taxonomy. As a bonus, getting involved in this methodology, is an easy way towards AI-related technologies (although the latter involve more than just NLP).

Things You Can Do Right Here Right Now

In order to make the most of all this, you can take one or more of the following steps. First of all, you can learn more about this topic through a reliable source, such as a book or a video. Moreover, you can contact Data Science Partnership (DSP) for a consultation session or two, exploring the various ways you can make use of your text data (or any other text available on the web) for topic extraction that can be beneficial for your organization. Finally, DSP can supply you with an expert in this field, so that you can incorporate this methodology in your pipeline, without having to rely on any external vendors. Whatever your decision, you are bound to gain a lot from this powerful data science methodology.

Share with...


Zacharias Voulgaris

Zach is the Chief Technical Officer at Data Science Partnership. He studied Production Engineering and Management at the Technical University of Crete, shifted to Computer Science through a Masters in Information Systems & Technology (City University of London), and then to Data Science through a PhD on Machine Learning (University of London). He has worked at Georgia Tech as a Research Fellow, at an e-marketing startup in Cyprus as an SEO manager, and as a Data Scientist in both Elavon (GA) and G2 (WA). He also was a Program Manager at Microsoft, on a data analytics pipeline for Bing.

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *