Far from clean and structured tables, the challenges posed by semi-structured and unstructured data are more than ever at the center of attention.
Behind this generic name, we find all the texts, emails, web pages, social network statuses, multimedia files such as sounds, images and videos, etc. It is therefore above all irregular data, the information of which cannot simply be put in boxes in a systematic way, and the term unstructured does not refer to the absence of structure, but to the fact that its data have very complex and non-standard structures, where information cannot be obtained with simple queries to which we are accustomed.
Data and the business
Unstructured data has always been present in the business environment, but it has long remained in the shadow of structured data, which is easier to analyze and process.
Recently, however, there has been renewed interest, for several reasons:
- On the one hand, data has multiplied through automatic data collection systems and / or through user-generated content.
- On the other hand, numerous technological and algorithmic advances make it possible to envision information systems which can process them efficiently and within reasonable deadlines.
The analysis of structured data is a well-known and naturally mastered activity. Comparing, applying statistical tools and making prediction on numerical values are common activities. As for the analysis of unstructured data, it presents many more difficulties, since by their nature, it is not possible to associate meaning to them in a systematic way. The whole challenge therefore consists in making a machine understand how to extract information from what is for it only a long chain of 0 and 1, and which a priori does not follow any particular rule.
Data exploitation
In order to simplify their management and operation, the initial step is to index the content, that is to say to associate it with certain information within a computer system to facilitate its subsequent access through forms. research.
Many companies still stop at this stage, which although necessary, is far from sufficient. The machine only plays a limited storage role here, while the possibilities go much further than the simple fact of speeding up a task that can be carried out by a human operator. On the contrary, it is necessary to tap into all the computing power of IT, to extract the information contained in the data and the associated metadata, in order to promote the emergence of knowledge to which it would otherwise be impossible to access.
At this level, it is useful to distinguish the processing associated with text and language data, from other types such as images, videos, and audio tracks.
Data can in fact be processed along two axes, through their intrinsic content, and through the context and associated metadata. The processing of textual data is simpler because the intrinsic content can be manipulated more easily.
All the difficulty comes from the ability to identify basic elements, common between different data. It is indeed easier to find a verb conjugated in different ways through a corpus of documents, than to identify all the possible cats, taking into account the changes in sizes and angles, as well as the different postures and possible obstacles. Although impressive progress has been made in this direction, they are still in the field of research, and require very heavy structures, with advanced skills in machine learning and signal processing. This is where the analysis of metadata and the context of the data plays an interesting role, and often quite sufficient.
Intrinsic information
Text remains the subject of choice for unstructured data processing solutions on the market. It easily lends itself to many manipulations and algorithms, and can be approached from both semantic and syntactic angles.
Different solutions exist to isolate the structured information that is present there, such as names, dates or places. But we can also adopt a more global approach, with semantic analysis models, which make it possible to identify the most relevant documents during a search, group them by topic, associate them with keywords automatically, identify the topics covered as well as the dominant tone and feelings.
But non-textual data is not left out, and it is now possible to identify in images the presence of people or certain objects, or to distinguish the recording of a speech from the cry of a whale, with success. variables depending on the nature and complexity of the corpora considered, and the objectives that we set ourselves.
Cameras have learned over time to recognize human faces, and automatic speech recognition systems have become increasingly reliable.
Metadata and context
For best results, additional information can be added to the content, through titles and subtitles, categories and keywords. Most of this metadata is generally entered by the users themselves, but the operation can on the one hand quickly become very time-consuming, and on the other hand give rise to a lot of errors, with fields not or badly filled in, duplicates due to the absence or non-compliance with standards, etc. The latter can be imposed by repositories defined upstream, and the technology exists to automatically generate a lot of metadata from the content, by fetching the structured information as we have seen above. This is how tools make it possible to identify named entities, and generate keywords or summaries.
But the context can also come from the relationships between the documents. This is how Google defined the authority of the web pages that it indexed, not from their content, but from their situation within the network, the pages gaining in authority through the links resulting from other authoritative pages.
A strategic advantage
This brief overview of the possibilities offered through the management and analysis of unstructured data is far from exhaustive, but it aims to raise awareness that a lot of knowledge is there, somewhere, but remains unusable without it. adequate tools.
However, it becomes essential to seize it, and to put in place the infrastructures and the processes which make it possible to extract from it the crucial information necessary for the development of any organization. As the technology is better understood, and becoming more user-friendly, nothing stands in the way of its large-scale adoption.