Businesses can leverage data from virtually countless sources, including internal data, customer service encounters, and data from across the internet, to guide decisions and enhance their operations.
However, you can't immediately run machine learning and analytics tools on raw data. Your data must first undergo preprocessing for machines to correctly "read" or comprehend it.
Data preprocessing is a phase in the data mining and data analysis process that converts raw data into a format that computers and machine learning algorithms can understand and evaluate.
Text, photos, video, and other types of unprocessed, real-world data are disorganised. In addition to the possibility of faults and inconsistencies, it lacks a regular and consistent design.
Because machines prefer to handle neatly organised data, it is simple to calculate structured data like whole numbers and percentages. However, unstructured data must first be cleaned and prepared in the form of text and graphics before analysis, this is where data pre-processing comes into place.
There are four (4) tasks that are involved in data pre-processing.
The first task is known as Data cleaning, which comprises of:
- removing incorrect data
- correcting incomplete data
- correcting inaccurate data from data sets
- replaces missing values
Some strategies used in data cleaning include handling missing values as well as fixing “noisy” data. The purpose of fixing “noisy” data is to ensure that there are no unnecessary data points, irrelevant data or data that may be hard to group together. There are special terms that are used to do this, including:
- Binning - which will sort wide sets of data into smaller similar groups.
- Regression - will decide which variables will be needed in your analysis. This is important so you’re not overwhelmed with large unnecessary data.
- Clustering- which is generally used in unsupervised learning to find the outliers in grouping data.
This is considered one of the most crucial stages as it ensures that your data is appropriately prepared for your downstream requirements.
The second task is known as data integration, which focuses on ensuring that the speed of a data model and accuracy of information is effective. Data being collected from multiple places can increase the liability of inconsistent and redundant data, which is why data integration is crucial in ensuring data integrity.
Ways in which data integrity can be sought after including:
- Data consolidation - data is collected and grouped into one data store which is also known as data warehousing.
- Data propagation - includes copying data from one location to another. This can either be synchronous or asynchronous data propagation.
- Synchronous data propagation is when data is copied from one location to another, and the two locations are kept in sync with each other. This is typically done by having the two locations communicate with each other so that when one location changes, the other location is updated as well.
- Asynchronous data propagation is when data is copied from one location to another, but the two locations are not kept in sync with each other. This can happen if the two locations are not able to communicate with each other or if there is a delay in the communication.
- Data virtualization - to provide a real-time, unified view of data from several sources, an interface is employed in this case. There is a single point of access where the data can be seen.
The third task is known as data reduction which condenses the volume of data whilst still maintaining the integrity of the original data set. This is an important task as data can still be hard to analyse even after cleaning. Data reduction not only makes the analysis process easier and more accurate, but it also is shown to cut down on data storage.
The transformation of the data into a format suitable for data modelling is the last stage of data preprocessing. Data transformation will start the process of transforming the data into the appropriate format(s) you'll need for analysis and other downstream operations. Data cleaning has already started the process of modifying our data.
The following techniques facilitate data transformation:
- Combining all data together into a uniform format
- New attributes are constructed from the given set of attributes
- Data is then scaled into a regularised range, so it is easily comparable. This can be between a smaller range for example 0 - 1 or 2 - 3.
- Feature selection where the decision of which variables, characteristics and categories are most important to the analysis are selected. This can include either supervised or unsupervised learning.
Below is a simplified example on how data preprocessing may work.
We have chosen four well-known technology companies you might be familiar with.
The dataset provides two variables;
- founder name,
- and company name.
Founder Name | Company Name | |
---|---|---|
#1 | Steve Jobs | Amazon |
#2 | Adolf Dassler | Adidas |
#3 | Mark Zuckerberg | |
#4 | Jeff Bezos | Apple |
Can you spot the data mismatch?
The data provided in #1 and #4 are inaccurate, they have been assigned to the wrong companies.
Founder Name | Company Name | |
---|---|---|
#1 | Steve Jobs | Amazon |
#2 | Adolf Dassler | Adidas |
#3 | Mark Zuckerberg | |
#4 | Jeff Bezos | Apple |
To correct this, we can use data cleaning and data transformation to manually fix the problem.
Founder Name | Company Name | |
---|---|---|
#1 | Steve Jobs | Apple |
#2 | Adolf Dassler | Adidas |
#3 | Mark Zuckerberg | |
#4 | Jeff Bezos | Amazon |
Once the issue is fixed, data reduction can be performed. In this case we are going to sort the information in alphabetical order (A - Z) according to the company name.
Founder Name | Company Name | |
---|---|---|
#1 | Adolf Dassler | Adidas |
#2 | Jeff Bezos | Amazon |
#3 | Steve Jobs | Apple |
#4 | Mark Zuckerberg |
Data preprocessing can be a lengthy and tedious task. However, once you have this set up you will be on your way to ensuring that your data analysis is accurate and reliable.
Data sampling is the process of looking at a small subset of all the data to find important or specific information in the broader data set. The aim of data sampling is to obtain samples that can represent the given population. An example may include finding the percentage of people who own a car in a city that has access to public transport.
The following illustration shows how this would look:
There are a few techniques that can be used to help sample from a target population.
These can be categorised into two groups:
- Probability sampling
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Non-probability sampling
- Convenience sampling
- Voluntary response sampling
- Quota sampling
Probability sampling
One of the key categories of sampling techniques is probability sampling. By using probability sampling, every member of the population has a chance of being chosen. When you wish to obtain results that are representative of the entire population, it is primarily employed in quantitative research.
Some examples of probability sampling include:
- Simple random sampling
- Stratified sampling
- Cluster sampling
Simple random sampling
One of the simplest sampling techniques, known as simple random sampling, or SRS, chooses a subject at random using probability. There is an equal chance that each component will be chosen.
Each sample is typically given a number, and a fortunate draw is then conducted to determine the sample size.
Stratified sampling
In a stratified sampling procedure, the population is divided into smaller groups (also known as strata), according to specific criteria (age, gender, income or profession). You can choose a sample for each subgroup using random or systematic sampling after creating the subgroups. You can reach more exact conclusions with this strategy since it guarantees that each subgroup is fairly represented.
An example may include a company that has 200 male employees and 400 female employees, based on this the population, they will be divided into two subgroups, male and female.
Cluster sampling
Cluster sampling divides the population into smaller groups, this population is usually diverse and likely to shares traits with the entire sample.
Clustering can be done by choosing an entire subgroup at random instead of choosing a sample from every subgroup.
Here is an example, a business has over 50 offices in 5 cities across New Zealand, all of the offices have roughly the same number of employees in similar job roles. The researcher randomly selects 4 to 5 offices, these are the entire subgroup, and uses them as the sample.
Non-Probability sampling
Non-probability sampling limits the possibility that each person will be selected for the sample. Although easier and less expensive, this sampling technique carries a significant risk of bias. It is frequently employed in qualitative and exploratory research with the goal of gaining a basic understanding of the community.
Some examples of non-probability sampling include:
- Convenience sampling
- Voluntary response sampling
- Snowball sampling
Convenience sampling
In this sampling technique, the researcher only picks the people who are closest to them. Although it is simple to collect data in this manner, it is impossible to determine if the sample is representative of the total population. The only criteria are that participants be available and willing. Because convenience samples are susceptible to biases like gender, colour, age, and religion, they are not representative of the population.
An example may include an owner standing outside their business and asking customers or people walking by to complete a survey.
Voluntary response sampling
Like convenience sampling, voluntary response sampling depends only on participants' willingness to participate. However, with voluntary response sampling, people will choose to volunteer themselves rather than being selected by the researcher.
An example may include the business owner sending out surveys to their employees and giving them a choice as to whether they would like to take part or not.
Snowball sampling
In a snowball sampling procedure, research participants seek other study participants. This method is used when it's difficult to find the people a study needs. Snowball sampling gets its name from how it grows larger and larger as it travels, much like a snowball.
An example may include a researcher wanting to know about the experiences of full-time/single motherhood in a city. Since there is no detailed list of where to find them, the only way to get the sample is to get in touch with one single mother, who will then put you in touch with others in the particular area.
Welcome to Getting Started with Enterprise Data Science | |
---|---|
Total Tasks: | 3 Modules |
Description: | Explore how people, processes, and technology interact in a data science project lifecycle, tackling real-world industry challenges. |
Total Time Budget: | 1 hour |
Time budget: 1 hour
Access Task 1: Getting Started with Enterprise Data Science