Extracting Value from the Unstructured Internet

Gaining insights from unstructured data is valuable but requires a daunting collection of skill and technology. Gartner says “Unstructured data growth is rapidly outpacing structured data and is poorly controlled, stored and managed on file shares, on personal devices and in the cloud”. They conclude that it isn’t a storage problem, it’s an intelligence problem.

I shook my head when I read a recent article in Datamation about the “9 Steps to Extract Insight from Unstructured Data”. The article makes a great case for why we invented Bitvore. I’ll summarize the challenge: figure out where and how to collect the information, establish a data lake, invent a taxonomy, master a variety of natural language analysis methods and algorithms, then begin. It’s no wonder few companies attempt it. Once you get the budget, hire the team, buy and integrate the tools to do this, you are at the starting line. You have to have a strong ROI case to get this class of project budget approved. This article does not address the additional challenge of analyzing continuous streams of information for new intelligence rather than a static data set for historical insights. The cost and skill keeps this kind of project out of reach for all but a few companies.

However solving the problem is very valuable. In the end, you have created a surveillance platform to “listen” to your market and “know” before anyone else. Some of the obvious beneficiaries of this solution are hedge funds, portfolio managers, commodities houses, insurance companies, and the government. These organization all take actions based on knowing that certain external events are happening. Many companies have formed competitive intelligence teams and business development teams chartered with “knowing” what is happening around the company though their toolkit is often manual dredging using search and information services like Lexus Nexus and Hoovers. Marketing and sales also receive great value from understanding changes in markets, messages, feedback and customers.

Why is Bitvore such a giant leap forward? Bitvore essentially collapses the twisted maze of technologies and techniques into one integrated system. The result is a time-to-value for projects that is measured in days instead of months. This opens the door to attacking use cases that could never justify a positive ROI using a custom-built in-house system. Bitvore ingests, stores and analyzes many parallel sources and streams of unstructured and semi-structured information continuously in the cloud. No data lake needs to be established. No new technologies need to be integrated. Even the core natural language processing techniques are built into the system.

An example of an application of Bitvore being put to work on Wall Street today helps explain how it works. Today this application of Bitvore is indexing content from 13,000 sources of news, social media and email. To gather the information, the system is “aimed” both using explicit URLs, hashtags and email addresses, plus guided crawling using keyword logic to scan search engines and social media hunting for relevant information. Found data is automatically indexed by the system with time stamps for the indexing time and publication time. In parallel to the indexing engine, analysis processes are running against all the data – old and new. In essence, the indexing system hoovers up great slices of the Internet based on the needs of the application and the analysis system inspects and enhances the information. New sources and adjustments can be made on the fly without interrupting the continuous processing.

In this application, the system has several missions for the analysis. First it is watching for several hundred specific material situations that our customers want to know about right away. These material situations happen across the US dozens of times daily. When they occur, the customers has an information advantage to make better decisions over everyone who is unaware. In addition, the information needs to be sorted by specific geographies down to the town, county and state. The industry sector also has to be identified. Similar to SIC codes, customers need to know if the information is about hospitals, transportation, corporations, schools, or dozens of other sectors. Additional classification determines if the information is about elections, crimes, lawsuits, regulations, taxation, opinion pieces and other topics. And last, the suppression system gates the irrelevant and redundant from reaching the user. Suppression is critical to removing false-positives and ‘echo chamber’ repetition from news syndication and retweeting.

The analysis techniques come from an extensible pallet of tools built-in to Bitvore. The tools, range from relatively simple keyword logic (similar to Googles “advanced” search), and regular expression analyzers, to built-in trained Naive Bayesian classifiers and Vector Space analysis. One tool allows new algorithms to be added at any time using scripting in a host of popular languages including Javascript, Python, PHP, R, even LISP. The job of the analyzers are to examine the source information and add metadata information to enhance the original indexed information. Like the indexing system, analyzers can be added or modified at any time without stopping the continuous process. Even the script-driven analyzers can be added on the fly. The ability to adapt the system on the fly is paramount. Evolutionary discovery, testing and adaptation are a key workflow in operating a system that monitors changing live data streams.

The metadata tags in Bitvore provide structure to the unstructured. Because of their flexible nature, metadata tags provide a fluid, malleable structure that can be expanded or re-organized as needed. The metadata yields the benefits of a structured system because the tags can be queried and return instant results. A language is included to formulate queries so that any slice of the information can be extracted to create data sets, search results, alerts and trigger actions in other applications. For example, a simple query in the application above can set alerts on “any new lawsuit filed against a hospital in Pennsylvania” or “Any regulatory change impacting the energy companies” in my investment portfolio.

While Wall Street’s interest in Bitvore is obvious, executives across industries are excited by the ability to immediately “know” about all external events that will impact them from actions at their suppliers, customers, competitors and regulators. By harvesting and analyzing the Internet at scale, Bitvore will bring to an end the ad hoc searching and word of mouth information sharing that is the de facto solution for business intelligence gathering today.

Extracting Value from the Unstructured Internet

Extracting Value from the Unstructured Internet

GET IN TOUCH