Introduction to Data Science

Submitted by sylvia.wong@up… on Mon, 06/27/2022 - 18:18

Welcome to Foundations of Data Science. This module will provide you with opportunities to gain an understanding of the foundational concepts, including data analytics pipeline, management of large-scale data, and how analytics and machine learning capabilities are built. You will also cover how data scientists develop machine learning and modelling platforms using libraries. 

Some of the content of this module has been curated by IBM and is delivered on the IBM keySkills platform. 

Who is IBM? 

IBM (International Business Machines Corporation) is an American multinational technology company with operations in over 170 countries. IBM is one of the world's largest providers of information technology services, hosting and consulting services, and technologies, including computer hardware, software, and services. IBM is also a major research organization, holding the record for most U.S. patents generated by a business for 26 consecutive years. 

IBM offer a range of tools and solutions for data science as well as professional certification options. 

How to navigate this module 

Each of the following three topics contains learning content, and activity and concludes with a link to an IBM KeySkills course. 

  • Data Science Methodology  
  • Data Visualisation with Python 
  • Data Modelling and Machine Learning

You will have received an enrollment key and login credentials for each of the courses at the beginning of the course. 

When you have completed a course on the IBM KeySkills platform, return here to continue your learning.

These topics and learning opportunities have been curated to provide the experience necessary to achieve the stated learning outcomes of the module.

On successful completion of this course, students are able to:

  1. Demonstrate an understanding of common mathematical models used in machine learning
  2. Demonstrate an understanding of the core concepts of machine learning algorithms and models
  3. Create and apply machine learning pipelines to solve common practical problems
Sub Topics

Data science is the study of data.

It involves developing methods of recording, storing, and analysing data to effectively extract useful information to make informed decisions. The objective is to make sense of large quantities of data to find patterns and insights that can help make better decisions for a business or organisation.

Data science can help us understand problems and solutions more effectively, and it can help find solutions to problems that we couldn't have thought of before.

Organisations use data science to improve their products, find insight on day-to-day business processes or help with product recommendation. It is ultimately used to solve real problems and guide businesses in the right direction.

The goal is to turn Data into information and information into insight
Carly Fiorina

Data science enables businesses to effectively comprehend huge amounts of data from numerous sources to gain insightful information for more informed decisions.

Different industries need data science for a variety of reasons. Some industries need data science to improve their product or service offerings. Data science is widely employed in many different business sectors, including:

  • Marketing
  • Healthcare
  • Banking
  • Finance.

Retail industry - grocery stores may use data science to improve their product recommendations to customers.

Health industry - may use data science as it helps providers identify patterns and trends in patient data and make more informed decisions about where to allocate resources as well as how to improve patient care.

Businesses in general may use data science to improve their business operations.

For more information about data science and how it is used check out the following video.

There are many roles within Data science, some of which include:

  • Data Scientist
  • Data Analyst
  • Data Engineer
  • Database Administrator
  • Data Visualisation Engineer
  • Big Data Engineer
  • Market Analysis Specialist

Individuals working with data have a variety of roles and responsibilities, which may include:

  • Developing and using data-analysis techniques to understand and predict patterns in data
  • Creating and using models to make predictions or forecasts
  • Helping to design and implement data-collection and analysis programs
  • Creating and managing databases of data
  • Creating and using software to communicate and analyse data
  • Working with other members of the data-analysis team to produce reliable and accurate insights

Data scientists typically have a strong technical background, along with experience in data analysis, modelling, and data-collection methods. They also need good problem-solving skills and a good sense of collaboration.

There is no one process that all data scientists follow, as the process can vary depending on the data science problem being solved. However, some common steps that data scientists may take include:

  1. Collecting data
    The first step in data science is collecting the data that will be used to solve the problem. This can involve gathering data from a variety of sources, such as surveys, data sets from scientific experiments, or data from social media.
  2. Analysing the data
    Once the data has been collected, the data scientist will need to analyse it in order to find the information that is relevant to the problem. This may involve looking at the data structure, analysing the data using statistical methods, and looking for patterns.
  3. Building models
    Models are used to make predictions about the data. These predictions may be about the data itself, or about how it will behave in future circumstances.
  4. Evaluating the models
    Once the models have been built, the data scientist will need to evaluate them in order to decide whether they are accurate. This may involve testing the models against new data or comparing them to models that are known to be accurate.
A person loooking down at the smart device they are wearing

What is big data?

According to Gartner’s popular definition, big data is described as:

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making and process automation.

On a daily basis, companies receive huge amounts of data from their daily interactions and processes. This massive amount of data may be produced through different sources including:

  • social media platforms
  • weblogs
  • Internet of Things (IoT) devices such as sensors found in smart devices and many more.

Traditional database management systems are not able to handle this vast amount of data, which is where big data comes to play. Big data is a collection of both old and new technologies that will allow firms to gather useful information and insight.

Big data is the ability to manage a significant amount of diverse data in a timely manner and at the necessary speed to allow for real-time analysis and response.

There is a wealth of information and patterns that can be discovered through the use of big data, businesses may use this information for various reasons including:

  • Increase revenue
  • Improve products and services
  • Better understand customer behaviour
  • Manage supply chains
  • Detect fraud
  • Identify operational issues
  • Create personalised marketing campaigns

For more information on what Big Data is, check out the following video

A diagram showing the parts that make up big data

Big data characteristics

Big data is typically broken down into five characteristics:

  1. Volume - refers to the amount of data generated
  2. Velocity - refers to the speed at which new data is generated and the speed in which it moves around, think about social media messaging
  3. Variety - refers to the different types of data that can be used
  4. Veracity - refers to the trustworthiness or reliability of your data
  5. Value - refers to getting the most value out of your data

Hadoop and cloud support

With data growing at an exponential rate, businesses are now needing to use, store and process a large amount of information more than ever before. Hadoop is an open-source software framework that’s effective in handling massive amounts of data.

It enables a network of computers to work together to address problems that call for large amounts of data and processing power. Due to its tremendous scalability, it can support computing on anything from a single server to a cluster of thousands of computers.

There are three main components of Hadoop, which include:

  1. Hadoop Distributed File System (HDFS)
  2. MapReduce
  3. YARN

 

Additional information on Hadoop and its component can be found here.

 

Module Linking
Main Topic Image
A data scientist hard at work on a laptop in a modern office space
Is Study Guide?
Off
Is Assessment Consultation?
Off