Lesson 2 - Unstructured data

Step 1 of 5

In case of companies there is many various sources of data in both structured (for example in tables like invoices) and unstructured (for example – customer emails or reports) as transactional data, social media discussions, product reviews, email, claims, internal reports, intranet, forums, archives, public data.

As those data are vast in many companies and coming from various sources new approaches as BigData appeared on market to cope with such data diversity and analysis solutions. Also algorithms for analysis were actively developed in last years as machine learning, artificial intelligence (a subject within computer science), discipline concerned with the implementation of computer software that can learn autonomously.[ http://www.britannica.com/technology/machine-learning ]

Expert systems and data mining programs are the most common applications for improving algorithms through the use of machine learning. Among the most common approaches are the use of artificial neural networks (weighted decision paths) and genetic algorithms (symbols “bred” and culled by algorithms to produce successively fitter programs). [http://www.britannica.com/technology/machine-learning ]

For performing analyses and algorithms development there is many software packages for data analyses. Selected open source products are presented in table below.

missing image

Image: Open source Data miming tools comparison

http://www.infoivy.com/2014/06/not-all-data-mining-packages-are.html

There is also market for paid closed licence solutions from companies as ORACLE, SAS, IBM SPSS, STATISTICA and also software libraries form Java, Python and other programming languages.

Data analysis process

Data analysis procedures should start from stating research question (hypotheses) which should be validated during research. Research hypotheses could be stated on various stages of innovation process. In many cases main interest would be related to actual or new product development but could also be related to strategies, inputs and outputs.

With next step measurements indicators should be set up to allow for analysis results assessments. For this benchmarking methods could be used or KPI set on research supported tasks. Then the process of actual analysis as on next image can be setup. And then implemented in software.

missing image

Image: Text mining process

[http://www.mu-sigma.com/analytics/thought_leadership/cafe-cerebral-text-... ]

For analysis there should be relevant data collected as patents, articles, news feeds, internal and external reports, customer based data. Data collection could be most time and resources consuming tasks. But in the process tools for scraping data from internet sources can be used. Such solutions are integrated in some data analysis packages (as Rapidminer) and could be used in many cases to capture data from well-structured websites. But in other cases there would be necessity to write special pieces of code to populate database with relevant information or use external software as IFTTT which allows to automate various online task (or other paid tools).

missing image

Image: IFTTT: service screenshot with premade task for social media.

Source: iftt.com screenshot

Next step would be to clean the data from unnecessary information. In case of website data scraping in some cases there would be a need to remove some HTML code, or unnecessary information as website name, dates etc. In case of text analyses this task could be even more important and time consuming as often repeated keywords should be removed and to spot this at least partial analysis should be performed.

Such sample analysis with where data pre-processing could be included is presented on next image.

missing image

Image: Sample implementation of process in Rapidminer

Source: Rapidminer screenshot

Then results can be presented in various forms as tables, chards or graphs as in next image.

missing image

Image: Process results text-mining analyses ISOM graph

Source: own research