Clean the data

Submitted by sylvia.wong@up… on Tue, 07/26/2022 - 18:52
Sub Topics

The first stage of the data cleaning process is to remove anything that won't result in meeting the desired work outcome.

Unwanted data is also anything that fails to meet quality standards.

The Australian Bureau of Statistics (ABS) is responsible for setting many industry statistical standards for validation. The ABS Data Quality Framework (ABS DQF) provides the standards for assessing and reporting on the quality of statistics. It provides conceptual information in the form of questions for each quality dimension to determine whether the data is suitable.

Learn more about data quality here.

Assess the validity

Data validity is the degree to which information conforms to the defined format or set of rules.

Validity criteria are included in the table.

Validity criteria Indicators
Values match the data type Numerals have been inserted wherever numbers are required
Whole numbers have been inserted unless decimals are required
The word true or false has been entered where one or the other is required
Numbers and words have been inserted in the correct order where a residential address is required
Data falls within the correct ranges A day between Monday to Sunday has been entered where a day of the week is required
A date in the DD/MM/YYYY format has been entered where the days are between 1-31, the months are between 1-12, and the years are in the correct century
Mandatory data has been included There are no empty cells
Data is expressed in the correct format There are no spaces between numbers unless required
Data is entered in the correct order
Read

Information about different types of data validation is available here.

Assess the accuracy

A man analyzing the business papers

Data accuracy is a measure of truth or correctness.

The ABS DQF recommends asking yourself the following questions to access data accuracy:

  • How was the data collected?
  • Has the data been adjusted in any way – if so, how much was adjusted and on what data items?
  • What is the sample size? 
  • What is the collection size? 
  • What are the standard errors for the key data items? 
  • Are there sensitive questions or topics that are collected that may cause bias?
  • What steps have been taken to minimise processing errors? 
  • Are any parts of the population unaccounted for in the data collected?

Data is accurate if it reflects actual circumstances

Assess the completeness

Completeness is a measure of sufficiency in terms of whether you have what's necessary to hypothesise.

Data is complete when every field has been populated as required. Incomplete data has missing variables and may result in a biased analysis.

Read

Instructions to manage incomplete data are available here.

Assess the consistency

Consistent data matches with information from other sources. If two pieces of data are inconsistent, determine which is accurate by looking for the most recent entry or determining reliability in another way.

Assess the uniformity

Uniform data uses the same units of measurement. For example, dollars, cents, kilograms, miles, metres, centimetres and Celsius.

Make sure data is converted to a single unit of measurement (preferably metric) to ensure uniformity.

Assess the relevance

The ABS DQF recommends asking yourself the following questions to access data relevance:

  • About whom, or what, was the data collected?
  • What levels of geography are data available for?
  • What key data items are available?
  • If rates and percentages have been calculated, are the numerators and denominators for the same data source(s)?

Relevant data is also recently generated, current, and applicable to the correct location or population.

Data entry errors are one of the most common and can happen when data is inputted into a system incorrectly. Transcription errors can occur when data is copied from one source to another and mistakes are made in the process. Incorrect calculations are another way that data can be wrong - this can happen when formulas are entered incorrectly, or data points are miscalculated.

Mismatched data sets can often lead to inaccurate statistical analysis. There are a few ways to deal with mismatched data sets. One way is to discard any data that doesn't match up, but this can lead to problems if there is a large amount of mismatched data. Another way to deal with mismatched data sets is to try and find relationships between the different variables. This can be a difficult and time-consuming task, but it can often lead to more accurate results.

The best way to deal with mismatched data sets is to use a technique called imputation. This involves replacing any missing values with estimated values.

Watch

Watch the video to learn more.

Make sure all data suits the work outcome. For example, if an estimate of costs in Australia is required, convert foreign currency into dollars. Likewise, if results are required for multiple locations, split the data into individual groups.

Data formatting can also involve changes to suit system requirements.

Read

Information about data formatting is included here and here.

If your analysis is required in the form of a spreadsheet, identify appropriate formatting conventions by reading organisational policies and procedures.

If data is missing, investigate by looking for other sources of information. If a sales figure for the month of May is missing, you may be able to find it in a Profit and Loss Statement. Likewise, if a customer name is missing, you may be able to find it in a Customer Management System (CMS).

Module Linking
Main Topic Image
Busy matured woman working on project with laptop
Is Study Guide?
Off
Is Assessment Consultation?
Off