Organizing Data



Raw Data vs. Clean Data

Raw and clean data are common buzzwords you will hear when talking about data-driven decision making. Let’s break down their meaning.

Raw data refers to data that has been captured, or gathered, but has not been processed in any way. Other terms for raw data include primary data and source data.

Clean data refers to data that has been processed to remove outliers, inconsistencies, duplication, and other incomplete information. 

Importance of Cleaning Data: An Analogy

Imagine you’re thirsty. Raw data is like being parched and surrounded by seawater. The ocean is vast and yet you can’t make use of a single drop to quench your thirst. 

To produce potable water you would have to use a desalination process to turn that salty seawater into freshwater for consumption. 

Can you see the connection to data? An organization can have vast amounts of data, but if the raw data is not processed into clean data that can be used to create value, then all of that data really won’t make any meaningful difference to the organization.



Structured and Unstructured Data

Another important distinction to make is regarding  structured versus unstructured data 

Structured data is highly organized and stored in a predefined format, or structure. Some qualities of structured data are:
  • It can be displayed in rows, columns, and is often stored in a structured database.
  • It is easily searchable because it consists of clearly defined data types.
  • It is generally more accessible as it can connect to a variety of tools.

Examples of structured data include:
  • Technologist Spreadsheet (i.e. Excel).
  • E-Mail Customer names and contact information.
  • Heavy Dollar Sign Financial transactions (i.e. account details, bank information, sender and receiver data).
  • Medical Symbol Medical information (i.e. patient data and medical history).
  • Label Barcodes and serial numbers.

Unstructured data is data that comes in a wide array of formats and doesn’t have any predefined structure. Some qualities of unstructured data are:
  • It is often a mixture of different data types such as images, audio, video, text files, etc.
  • It cannot be easily displayed in rows or columns.
  • It is not as easily searchable or easy to understand.

Examples of unstructured data include:
  • Page with Curl Text files and documents (i.e. Word, Google Docs, etc.)
  • Video Camera Video and audio files
  • Hash Key Social media posts
  • Satellite Satellite imagery
  • Sun Behind Rain Cloud Weather data




Metadata

Metadata is additional data that describes the data itself - it’s essentially data about data. Metadata gives context to the data, such as its versions, timestamp,  format, etc. Metadata is critical because it helps organize, classify, label, sort, and search data. 

For example, the metadata of a digital picture could include information on the size of the image, the camera used to capture it, the date it was created, the file name, the GPS location of the image, and more.  


Another example is the metadata of an email. This may include the name and email address of the sender and receiver, the date and time the email was sent, the subject line, and the number of attachments included.




Next:  🏟️Aggregate Data & Big Data