Raw and clean data are common buzzwords you will hear when talking about data-driven decision making. Let’s break down their meaning.
Raw data refers to data that has been captured, or gathered, but has not been processed in any way. Other terms for raw data include primary data and source data.
Clean data refers to data that has been processed to remove outliers, inconsistencies, duplication, and other incomplete information.
Importance of Cleaning Data: An Analogy
Imagine you’re thirsty. Raw data is like being parched and surrounded by seawater. The ocean is vast and yet you can’t make use of a single drop to quench your thirst.
To produce potable water you would have to use a desalination process to turn that salty seawater into freshwater for consumption.
Can you see the connection to data? An organization can have vast amounts of data, but if the raw data is not processed into clean data that can be used to create value, then all of that data really won’t make any meaningful difference to the organization.
Structured datais highly organized and stored in a predefined format, or structure. Some qualities of structured data are:
It can be displayed in rows, columns, and is often stored in a structured database.
It is easily searchable because it consists of clearly defined data types.
It is generally more accessible as it can connect to a variety of tools.
Examples of structured data include:
Spreadsheet (i.e. Excel).
Customer names and contact information.
Financial transactions (i.e. account details, bank information, sender and receiver data).
Medical information (i.e. patient data and medical history).
Barcodes and serial numbers.
Unstructured datais data that comes in a wide array of formats and doesn’t have any predefined structure. Some qualities of unstructured data are:
It is often a mixture of different data types such as images, audio, video, text files, etc.
It cannot be easily displayed in rows or columns.
It is not as easily searchable or easy to understand.
Examples of unstructured data include:
Text files and documents (i.e. Word, Google Docs, etc.)
Video and audio files
Social media posts
Satellite imagery
Weather data
Add a caption...
Metadata
Metadata is additional data that describes the data itself - it’s essentially data about data. Metadata gives context to the data, such as its versions, timestamp, format, etc. Metadata is critical because it helps organize, classify, label, sort, and search data.
For example, the metadata of a digital picture could include information on the size of the image, the camera used to capture it, the date it was created, the file name, the GPS location of the image, and more.
The metadata of a digital image
Another example is the metadata of an email. This may include the name and email address of the sender and receiver, the date and time the email was sent, the subject line, and the number of attachments included.