WEKA
Close

What is Structured and Unstructured Data

There are multiple ways to arrange electronic data in computer systems.

One approach is to use a filesystem and place the data as files. The files will be arranged in any directory or folder structure that is convenient to the user. These files would be of various types, for example, csv, parquet, binary, text, hdf5 and more (see a different what is that discusses what is filesystems). For example, to save a list of names and phone numbers we can create a directory called “contacts” and within that directory (AKA Folder in windows) create a file that is called “phone numbers”. This file can be a text file that will have multiple names and phone numbers, for example “Alexander Hamilton 4083350085 George Washington 202-456-1414” or the file can be a csv file that would contain the same details in a different representation “Alexander, Hamilton, 4083350085, George, Washington, 202-456-1414” or this can be a parquet file that would look like “Alexander, George, Hamilton, Washington, 4083350085, 202-456-1414” or it can be a binary file that is not human readable, and contain the data as “8c1e1e126c10232c8af069894a0ab9e9”.

In order to understand the content of the files, different software programs are required to be used per each file type. this approach is called, unstructured data, Since the data is saved in multiple types of files and structures. This approach allows for easy placement of multiple data types in multiple directory structures, but it does not easily allow searching across all of the data to gain insights. In the example above, it is very hard to look at all of the different files and types and ask for all of the contacts of “Alexander Hamilton”, this will be a slow process that will require opening each file using the proper software and looking at all of the data in the file before closing it and moving to the next file. It is very inefficient and will take a long time.

Another approach for arranging electronic data in computer systems is to enforce strict structure on the data that is saved, so that it is will always looks have the same structure. This is done by placing it in a strict table format of columns and rows. Using the above example, we will have a table defined for contact and the column names would be “First name, Last name, Phone number”. while each person’s details will just be placed in that table format as rows. That is how Databases save their data, and more specifically, how Relational databases saves their data.

The advantage of this approach is that the data is placed inside the database tables in a way that is easy to search for, there is no need to open and close multiple files. Additionally databases software are built in such a way that is optimized for search of data in an efficient way using Indexes for the data. Databases are also optimized for performing operations which involve multiple tables, for example, look for all of the people called “Alexander” and grab their “phone number” from the contacts table while looking at another table called billing for that “phone number” and looking at their “monthly bill”.

Eventually the data of databases can reside on a filesystem as well but since it is accessed only through the database software it has an efficient and fast search time.

In summary,

Filesystems allows for unstructured data of many types which is useful when working with different environments and accommodating for multiple different applications while databases enforce strictness and structure on the data so that is can be searched upon very efficiently to gain insights from it.