Lesson 5 - ANALYSIS OF UNSTRUCTURED DATA How to analyze them?

Step 1 of 3

These data cannot be analyzed by means of traditional tools. Firstly, because there are a lot of them and they change dynamically. Secondly, qualitative techniques, including automatic text analysis methods, are still underdeveloped - it is necessary to pre-structure the collected information for text analyses. Thirdly, presently we face a significant shortage of analytical and managerial talents necessary for the proper use of integrated structured and unstructured data[9]. Therefore, more and more IT professionals and business users closely monitor the solutions that could cope with this mass of knowledge and data.

An important change in the analytical approach used to date is the introduction of intelligence and automation to the process of digitisation[10] of unstructured data. This means that documents are classified according to their content, i.e. information is captured contextually, and then validated and transmitted smoothly, e.g. into core business applications and document flow processes. Smart digitizing with a high level of accuracy can instantly change the data from unstructured to structured[11]. As such, they become part of the data model that can be indexed and integrated. Only with this approach achieving added value of this content is easier and it is possible to search, preview and analyze the content within particular organizational units.

The market offers you a variety of solutions; also their basic standards have stabilized allowing for a relatively reasonable planning so in the realm of technology, as well as in understanding of the Total Cost of Ownership (TCO). Increasingly, there appear also services such as outsourcing e.g. of document management – ArchiDoc[12]. While establishing systems for digitisation of unstructured data, attention is paid to the roles, procedures and assignment of information.

InformationProcessesOrganization
Catalogue of business-relevant data used by the organization. Allows for introduction of the same description language for allUniform information flow processes. They govern the handling of dataIt defines the role played by individual organizational units in the processes and what information is necessary for them

And in further proceedings, it is necessary to determine the approach to unstructured data through:

  1. Search for information
  2. The use of information.

From the point of view of data analysis, important is:

  1. The data collection as part of an established data management system by converting the recorded data and certain accesses.
  2. The preliminary analysis consisting in discovering the necessary information by analyzing the context, text mining, extraction of concepts and so on.
  3. Organizing the data by determining categorization, ontology[13], taxonomy[14], abstracts, or brief descriptions of the content, and principles of deduplication. Very important at this stage is the organization of work with the data because that is what allows for further use and quick access to information.

Therefore, further steps to organize unstructured data include:

  • analysis and visualization,
  • statistics of used terms, identification of similar documents,
  • graphical presentation of the relationship between the data already structured,
  • automatic classification, taxonomy involving the organization of topics in hierarchies and grouping into default or pre-defined classes
  • predictive modelling, using only text, e.g. predicting the customer's attitude based on his comments; or text and other data, e.g. predicting future purchases based on the opinions and demographic information.

Pre-treatment of the information contained in the text involves identification of the text units: paragraphs, expressions, words, phrases, etc. It is important to introduce the so-called. stop list, or exclude irrelevant words and phrases that often occur, but are useless in the analysis because they do not convey any meaning. Unfortunately, from the point of view of the accuracy of the analysis it is necessary to bring the words to their basic grammatical form, their standardization, and the use of synonyms by analyzing text collection. The data obtained in this way is the basis for further analyses such as data mining connected with finding information, search for patterns, clustering of documents, automatic generation of abstracts or keywords.

Practices that help in the analysis of unstructured data are, for example, the construction of the subject information model, the vocabulary or corporate taxonomy, classification of the content used in the company, coherent system of content tagging, storage and retrieval of information. These are not complicated things nor are they long in implementation. It is rather the process of organizing, which should be started and continued alongside the changing information environment.

It is recommended that a RCFA analysis (Root Cause Failure Analysis) should be conducted before the introduction of the information seeking process. Since this method is focused on identifying the sources of problems or incidents so it allows for adapting into the organizational culture nomenclature of groups (classes), namely the introduction of appropriate ontologies and taxonomies to facilitate data retrieval. In this case, the root cause is identified as impossibility to get on time to the required information. Identification and removal of the root causes by focusing on correction prevents the occurrence of such problems in the future. In this case RCFA is used as an interactive process and a tool for continuous improvement of automatic digitalisation of the data collected.

The basis for automation are the tools of artificial intelligence[15], dealing with creating of intelligent behaviour models and programs simulating this behaviour.

Use of information aims to power analytical and reporting systems with the relevant data, set alerts, or early warning signals, perform predictive analytics, utilize, index and search, and conduct operational reporting .

As in any field, in the case of combined structured and unstructured data, analyses start with a clear definition of the purpose and the data sources used.

Table 1. Examples of application areas, objectives and data sources specified for them

missing table

Then in the next step, it is expanded by the concrete ranges of analysis, which are worth carrying out.

Table 2. Examples of areas of analyses conducted for the area of customer relationship management depending on the purpose.

missing table

Examples of advanced IT tools for exploration of unstructured data:

  1. Teradata Aster Discovery Platform[17] - is an analytic platform from the area of big data[18] that has the ability to acquire and analyze data of any format and from a variety of sources. These can be structured and unstructured data: plain text, billing data, data from the Internet, any multi-structured data. Teradata Aster Discovery Platform can actually process the data and, using analytical tools, find the most valuable ones. It also allows for quick testing of business hypotheses and presenting their results in a user-friendly visualization environment. The platform is available in the offer model, the so called services in the cloud or a traditional model of deployment of solutions at the client's.
  2. Unified Data Architecture[19] - is one of the best and most complete solutions available on the market for advanced business analytics. The result of the interconnection of databases of Teradata Aster Discovery Platform and open-sourcing Hadoop platform is a unified, high-performance analytical environment for the enterprise. Organizations can ask here questions about a broad spectrum of analyzes carried out on the basis of any type of data, at any time, and discover new and valuable dependencies, which will result in higher productivity, lower costs and new business opportunities.
  3. Automatic Business Modeler (IBM) [20] maps the business issues onto Machine Learning algorithms and their parameters and settings, processes optimization algorithms, and then presents the results in the form of business solutions. ABM also allows full automation of the necessary but time-consuming tasks associated with construction of predictive models, such as the selection of variables for analysis, transformation of variables or choosing the best model. Available online, in SaaS model (Software-as-a-Service).
  4. Hadoop[21] is a popular modern open platform that allows you to collect and analyze large sets of data from sources such as social networking, history of visits to websites, logon servers, transaction systems, video files collections, or sensory data of related devices.