Aggregate data refers to data that is combined from multiple sources, such as multiple individuals or organizations. Bringing separate data sources together to create an aggregate is done so that data processing and statistical analysis can be performed on the aggregated dataset.
Individual data sources are combined to produce the aggregated dataset.
Data aggregation is a popular technique used by researchers, policymakers, and businesses. For example, researchers may use aggregate data to investigate the relationship between two variables within the group that contributed to the aggregated dataset. Businesses use aggregate data to develop customer analytics that can be used to understand trends and drive sales and marketing activity.
Data anonymization is often part of creating an aggregated dataset. This involves removing, or stripping out, personally identifiable information (PII) to reduce the potential for re-identifying a person or business that contributed data to the aggregated dataset.
That said, even after PII has been removed there may still be some potential for re-identification, and therefore some remaining privacy concerns about how the aggregated data is managed. These issues will be discussed in greater detail in the module on privacy that will be published later in 2024.
Aggregated Mobility Data to Track COVID-19
A good example of aggregate data is how mobility data was used by public health agencies to fight the spread of COVID-19. At the early stages of the pandemic, the Public Health Agency of Canada (PHAC) worked with telecommunications companies to obtain licensed access to aggregated, de-identified mobile device location data. This data offered PHAC insights into population movement to understand the spread of the COVID-19 virus and how effectively public health measures and social distancing interventions were being followed. Later in 2023, the Privacy Commissioner of Canada launched an investigation into how this data was used in response to complaints but found that the anonymization and safeguards put in place by PHAC did not exceed an acceptable threshold.
It allows for the integration of similar data from multiple sources.
It can increase the value of the data by enabling data-driven analytics to identify trends in a way that is not possible using individual data sources.
What are the challenges of aggregate data?
A high level of data aggregation can hide important differences among the individuals, or groups that contributed to the aggregated dataset.
Organizations that are aggregating the data derive most, if not all, of the benefit from the aggregated dataset by controlling the results of the aggregate data analysis.
Big Data
Big Data is a term that is used to describe large datasets collected from a number of diverse sources. More specifically, big data refers to the situation where the scale and complexity of the data require new tools, techniques and algorithms to manage and process it. Big data is simply too large or complex to be processed by traditional software.
The best way to identify big data is through the 5V’s of big data:
Add a caption...
Volume - What is the size and amount of data?
Value - What value is being produced from insights derived from patterns found in the data?
Variety - How much diversity is found within the dataset? Is there a range of different data types - structured, semi-structured and unstructured data? Does the dataset contain raw data?
Velocity - How fast does data flow through the system as it is being captured, stored and processed?
Veracity - Does the data accurately represent the real world?
Data becomes big data when it has most, if not all, of the 5Vs of big data.
What makes big data important?
In some applications, big data can provide insights that are not available from smaller data sets. This happens because the larger more diverse data sets can provide a wider view than is possible using data coming from a single individual or business.
This connects back to the 5Vs of big data - Value. Taken together the larger volume and velocity of big data can create new value. Of course, it is still important to ensure that the data is accurate (i.e. veracity) - that goes without saying.
Now, this doesn’t mean that smaller datasets aren’t useful. Smaller datasets are cheaper and easier to manage and remain an important resource. In fact, the decision to use big data is one that should not be taken lightly, given the increased complexity that can come from collecting larger volumes of data.
What are some examples of big data being used in the real world?
Marketing
Big data and other technologies revolutionized the marketing industry by transforming traditional survey and advertisement techniques into digital marketing. Big data collected from millions of customers can be analyzed to better understand customer behaviour and develop more effective marketing campaigns. For example, Amazon collects vast amounts of data on customer purchases globally and then uses the patterns they find to iterate on their products, services, and advertisements. This approach can also be seen in the entertainment industry, where Netflix and Spotify use data from viewers, or listeners, to offer personalized suggestions.
Transportation
Big data enables our transportation systems to work effectively. For example, Google Maps uses big data analytics to identify routes with less traffic. It also provides real-time updates on traffic incidents to keep you informed on any accidents or road closures along your route. Ride-sharing services like Uber or Lyft use massive amounts of data related to their drivers, riders, vehicles, trips, and locations to make predictions on supply and demand so they can determine the cost of a ride on their service.
Now that we’ve gone through some of the opportunities associated with big data, it’s important to recognize it can also have some drawbacks.
Big data requires more capable and complex systems to adequately store and manage the larger datasets. This large scale infrastructure and associated cost can limit the accessibility of big data to smaller organizations and businesses.
There can be increased security and privacy risks due to the large volume and high speed of data collection. This is especially true for big data that contains valuable private or personal information.
Ensuring the high quality of big datasets can be difficult given the large amount of diverse data being collected. Automated tools and software are likely needed for quality control.