In a perfect world, all data for an organization is structured – sorted neatly into categories, labels, columns and boxes, synchronized and collected across the organization, and accessed easily.
The reality is that about 80% of business data is unstructured, including information in documents, spreadsheets, emails, presentations, audio and video, web searches, images and social media posts. While some of the information may have been generated internally, it is considered “unstructured” because the intelligence doesn’t fit neatly into a database.
While structured information can define or identify behaviors, unstructured data provides a more complete explanation, description or prediction of a particular behavior or change in demand.
Unstructured data is currently analyzed by extraction. For example, with fingerprint matching, the actual fingerprint image is totally unstructured. To analyze a fingerprint, key points are identified and then mapped. The map, which is structured data, is what is actually matched. Unstructured prints aren’t analyzed; the extracted information is.
Overall, most unstructured data uses extraction, text analysis and text abstraction with a relational database to create an integrated view of the data, enabling the organization to make smarter business decisions.
Retailers such as Chico’s FAS have been able to integrate social media communications with its customer data to offer targeted promotions to customers, while healthcare providers such as Seton Healthcare Family have been able to save money from readmitting patients by identifying readmission triggers, according to an Information Management article.
When analyzing unstructured data and integrating the information with its structured counterpart, keep the following in mind:
Do you need a simple number, a trend or something else? Knowing your end goal is essential to determine how to analyze your unstructured data. Not all of it will be useful or yield a win given the volume of information potentially involved. For example, when companies want to determine social media sentiment, they want to know if statuses, tweets or comments are negative or positive. If the goal is to examine if the overall reaction to one of their marketing initiatives is negative, considering only words and hashtags related to that campaign can be a useful focus point instead of analyzing all social media information from an arbitrary time range.
After determining what the end goal and other goals are, you can decide how to structure the unstructured data to identify the information. With social media sentiment, certain words and phrases within posts are assigned as good or bad values. A good word or phrase may get a “+1”, a bad “-1,” and neutral “0.” The sentiment score is determined by the sum of the word or phrase scores, thus creating structured numeric data that came from the unstructured source text. Any analysis of these scores, such as seeing if the posts are mostly positive, is therefore on the structured numeric text summaries rather than on the text itself.
Use sources of data that are entirely relevant (nothing tangentially related to the topic), including information from online reviews and customer feedback forms, as well as information from devices.
Choose data storage and information retrieval architecture based on scalability, volume, variety and philosophy. Some big data tools are designed to manage and analyze unstructured data, such as those based on Hadoop, a software platform that can store huge files and process the information.
Real-time access requires tracking real-time activities and making predictions relevant to the business based on predictive analytics. For example, in e-commerce, real-time access can allow companies to provide quotes in real-time. In a 24/7 business climate, real-time data collection is especially essential and whatever technology platform an organization uses needs to make sure no data is lost.
Data warehouses store data with concrete structures and categories, which is useful when all the information is structured. However, repositories known as data lakes are easier to use for unstructured data because you can access data in its native format, preserving the metadata and anything else that may assist in analysis.
Create a copy of the original file and clean up the data; for example, expand any text that is informal or written in shorthand or symbols. Organizing data ensures that all valuable information is represented.
After data is identified and cleaned up, prioritize parts of it depending on what you are looking for. One method uses tagging by parts of speech to extract entities, such as “person,” “location,” or “organization.” Logistic Regression, Naive Bayes, k-means and other supervised and unsupervised machine learning also can be utilized to find patterns in customer behavior, target for a campaign and classify documents.
Temporal modeling techniques can be utilized to analyze the most relevant topics being discussed by customers on topics and events they share via forms, social media or other platforms. Customers’ disposition may also be analyzed with sentiment analysis of social media, reviews and feedback. This can inform future product recommendations and overall trends.
Using graphs and charts to visualize your analysis ensures that the information can be used by other parties to make recommendations based on the data.