Basic concepts of data

Types of data

Aggregate data

Data that’s grouped together and expressed in summarised form. The purpose of aggregation is often to get more information about particular groups based on characteristics like age, profession, or income. Aggregation also protects people’s privacy. Aggregation helps to make it safe to share data publicly.

Data is usually aggregated using statistical standards and classifications. These tell us how the data is grouped together.

All of the data Figure.NZ publishes is aggregate data.

Public data

Public information or datasets that were created with public funding. Typically, this means datasets that were collected by or for a government agency.

Open data

Data that is free for everyone to use, provided you follow the license terms. Most often, open data is released by government agencies. However, private companies sometimes also release some of their data as open data. This will be aggregate data. They may choose to do this so that other people can learn from how people interact with their products and services.

Raw data

Data as collected in its original non-aggregated form. Each data point may refer to a single person, organisation, or item. Raw data is rarely released to the public because it may be able to be used to identify individuals, and this may be a breach of privacy. Some countries release a form of raw data called ‘microdata’ for researchers to use. This normally has names and other unique identifiers removed, and there are strict controls on how this can be accessed.

Collecting data

Census

A national collection of information about everyone in the population. Censuses try to get an accurate snapshot of what’s going on at a particular point in time so it can be compared to previous years. Censuses normally provide more detail and reliability than surveys, which take a sample of the population. Because they collect more data, they can safely release much more detailed aggregate data than a survey, such as information about smaller areas or more detailed jobs.

In New Zealand, the census usually happens once every 5 years.

Survey

An investigation about the characteristics of a population, done by collecting data from a sample of that population. Surveys in New Zealand include the New Zealand Health Survey and the Household Labour Force Survey (and lots of others!). You may get a letter from Stats NZ saying you’re required to participate in one of these surveys. They may post you information or someone may come to your door. There are also business surveys like the Business Operations Survey and Agricultural Production Survey which help us to understand these sectors.

Sample

The part of a population that’s observed when gathering survey data. Statisticians try to ensure that the sample is representative of the whole population.

Each sample will have a sample size. Sample size is the number of people, organisations, or other units included in the sample.

Because the survey only takes a portion of the population, there will also be a sample error measure. The sample error is the measure of the variability that happens because a sample is surveyed, rather than the entire population. It helps us to understand how reliable the data is, and what range of uncertainty we should take into account when using the results.

Administrative data

Data collected as part of government agencies or other organisations going about their daily business. Where census and survey data is designed and collected deliberately, administrative data is a byproduct. For example, you’re required to register your dog with your council. When you register your dog, you provide information about the breed of the dog. This data is aggregated and released to form a dataset about the breeds of dogs in New Zealand.

Administrative data is great ‘real-world’ data, but requires care and consideration when using. You need to understand where the data came from, and what limitations it might have because of that, like who might not be represented or included, and how it was classified.

Sharing data

Data collection

A series of one or more datasets collected at the same time by the same entity or entities. This might be a survey, or a by-product of a service like registering a dog. For example, the Census is a data collection. So is the New Zealand Health Survey. The data being collected and released about COVID-19 is also a data collection.

Dataset

A set of information that measures something. Datasets are usually captured in spreadsheets, tables or maps. A data collection is often made up of multiple datasets. For example, the Census has a dataset on the number of people who live in different parts of New Zealand.

Datapoint

An individual piece of data. A dataset is made up of multiple datapoints. For example, the population of Auckland is one datapoint in a dataset of the population of different regions of New Zealand.

Metadata

Data that describes and gives information about other data. It summarises what you need to know to understand and use the data with confidence. It may include things such as definitions for terms, limitations of use, information about how they data was collected, sampling errors, and more.

Variable

Any characteristic, number, or quantity that can be measured or counted. Age, gender, country of birth, eye colour, and vehicle type are all examples of variables.

Analysing and using data

Median

The value separating the higher half of a set of numbers from the lower half. It can be thought of as the middle value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6 — it’s the fourth largest and also the fourth smallest number in the set. The median helps us understand what the experience at the very middle of our dataset is. This is particularly useful when we have outliers — values that are either very high or very low — as the median will not be affected by these.

Mean

The average of a set of numbers, worked out by adding the values together and dividing the total by the number of values. The mean helps us understand what the most common experience is. However, means can be skewed by outliers, making it seem like the common experience is higher or lower than it is. The mean is a very useful number, but it’s a good idea to use it together with a median to make sure it’s not being skewed.

Rate

A rate is a measure of frequency. A rate can either be:

  • The frequency that something happens over a specific interval of time. For example a rate of 30 km per hour.
  • The frequency that something happens in a particular group or population. For example, the rate of breast cancer in New Zealand is 1 in 2,118 for women and people with breasts in their 20s.

Statistics like population mortality rates are usually shared as a rate per 1000 people. Be careful when you see or use a rate. It’s important to check what the constant (the ‘per 1000’ bit) is, as different statistics use different measures, like per 100, per 1000, per 100,000. Or even per 2,118, in the case of the breast cancer statistic!

Index

Indexes are a way of measuring how much a group of datapoints have changed. The change is measured against a base level, such as a particular year of collection. The base level is usually assigned the value of 100, and subsequent collections are assessed based on how far above or below the base level they are.

One example of this in New Zealand is the Consumer Price Index (CPI). The CPI collects data on the changes in prices of goods and services in New Zealand each month. It helps us to understand the amount of inflation in New Zealand. Changes to the CPI are used to help decide things like benefit payment amounts. That’s why you might hear people talk about benefits and superannuation being ‘index-linked’.

Trend

A tendency of a series of data points to move in a certain direction over time. Trends help us understand the longer term direction of changes in data.

Time series

A dataset that shows changes over time. When shown on a chart, time series have a line that connects lots of individual data points, rather than bars or columns. You may sometimes see a ‘trend line’ on a time series chart. This is a line that tells you what the overall trend of the data is. A flat line means that overall, the data isn’t increasing or decreasing, whereas one sloping upwards indicates that overall, values are increasing.

When you see a time series chart, you look at the shape of the line to understand what is happening.