Open data collection in NZ
What is open data?
Open data is data someone else collected which is free for you to take and use, as long as you acknowledge the people who collected it.
You can tell if something is open data because it is shared with the public under an ‘open’ license. The most common license you’ll see is Creative Commons.
Where does it come from?
Most of the open data in New Zealand comes from New Zealand Government agencies or organisations such as Stats NZ, the Ministry of Health, and the Ministry of Education.
There are some private companies who release open data too.
How does the Government collect data?
There are 3 main ways the Government collects data:
- Census
- Surveys
- Administration
Let’s take a look at what each of these are, and what that means for the data.
Census
A census an official count or survey of everyone in a population. Censuses help us to get an accurate picture of the population and dwellings (places people live) so the Government can make better decisions about how to share resources and provide support. You can read more about what a census is and how it works here.
The accuracy of a census relies on everyone participating in the census. If some people, or groups of people are missed, it makes it harder to make good and accurate funding decisions and decisions about issues like electorate boundaries.
Census data is comprehensive, and usually of high quality. It’s very valuable because asking everyone in New Zealand means you get an extremely accurate picture of what’s going on, rather than using statistical methods like samples. The most recent NZ census had a lower-than-usual participation rate which impacted the data quality, especially for some groups like Māori.
Because it’s only collected every 5 years, census data loses value the longer it’s been since the last census. Censuses are very expensive to run, so we have to balance the value of the data with the cost of running a census.
Examples of census data:
- Weekly rent paid by households in rented occupied private dwellings 2018
- Usually resident population by total personal income 2018
- Dwelling mould indicator by household tenure by region 2018
Things to remember when viewing Census data:
- It may be up to 5 years old.
- Numbers are rounded to base 3 (especially important for small numbers).
- In the most recent Census, some groups were less well represented.
- Narrower range of topics covered by data collection.
Survey
A survey is a set of questions filled in by a proportion of the population. This might be online, a form sent out to you, or a person may come and do the survey with you in person.
Some examples in NZ are the NZ Health Survey, the Business Operations Survey, and the Agricultural Production Survey.
Surveys are very common because they are a more cost-effective way to collect data than a census, and give us a good indication of what’s going on in a particular area.
The proportion of people surveyed is called a sample. The organisation running the survey tries their best to ensure the sample represents the overall population, but sometimes will need to weight results due to over or under-representation. Surveys often use census statistics to understand what their sample should look like.
It’s important that the survey sample is representative because the organisation will extrapolate the survey responses as figures for the whole population. This is done using statistical methodologies. In order to make it as accurate as possible, the organisation running the survey also tries to make sure the sample size is large enough that they can extrapolate the results with a reasonable degree of certainty
Because of these processes, some survey data comes with confidence intervals. Confidence intervals are a way for the statisticians to tell us how confident they are with the extrapolations they made. You’ll see confidence intervals as a range. A lot goes into them, but the key thing you need to know is that the smaller the range is compared to the figure, the more confident the statisticians are in the data. So, if a figure is 100, and the confidence interval is 99-101, the statisticians are more confident than if it were 75-125.
Examples of survey data:
- Dental health of children in year 8, by DHB, ethnic group, and water fluoridation status 2018
- All indicators of adults health by DHB (prevalence) 2011–2017
- Who pays for private health insurance cover, by income 2011–2015 average
Things to remember when using survey data:
- Surveys are a representative sample.
- Read the methodology to understand how the survey was run.
- Check for a confidence interval.
- Wider range of topics covered in more depth.
Administration
The previous two types of data collection, census and survey, are deliberately designed data collection. In other words, someone decided we needed to collect data on those topics, and they specifically designed and created a survey/census to get that data. But our final type of data is not designed data.
Administration data is data collected by the government or other organisations as part of doing going about their normal business. For example, every time someone is admitted to hospital, they fill in an admissions form and the hospital records why they were admitted. Every time you register your dog with your council, you provide information like their breed. Every time water quality is checked, it’s recorded.
In other words, administrative data is a byproduct. Nobody decided to collect it specifically to have a dataset, but in the process of doing their jobs, data was created. Administrative open data sets are data sets that have been aggregated and confidentialised from this.
Because of this, there are many, many administrative data sets. Some of them are released to the public. A benefit of this is that these data sets often cover topics we don’t have survey or census data about. But a downside of this is that any issues or biases in the system are carried through into the dataset. For example, someone who is multiracial may only record one ethnicity on their hospital admissions form because they’re in a rush. One well-known example of this is rape statistics. The NZ Police release these statistics with a warning as they know that they are very under-reported.
Examples of administration data:
- Dog control statistics 2001–2019
- KiwiSaver Statistics: Annual statistics 2019
- Energy Efficiency - Clothes dryers sold by star rating 2002–2019
Things to remember when using administration data
- The data is collected as a byproduct of providing services/doing daily business.
- Always check who collected the data and how it was collected.
- There is a wide, wide range of data available.
What do I need to know when using open data?
- Check the license to make sure it’s open data.
- Check it’s the most recent version of the dataset.
- Check how the data has been collected so you understand the quality and are aware of any potential issues.