Optimise big data sample results and documentation

Submitted by sylvia.wong@up… on Sun, 12/19/2021 - 15:37

In this topic, you will learn how to further optimise the big data sample results and prepare relevant documentation.

Sub Topics

After validating the output of the captured big data sample, the final phase is to perform data cleansing tasks on the sample dataset.

First, let us understand some important concepts behind data quality and data cleaning.

What is data quality?

To determine the data quality, it is important to examine some of the characteristics of the data in the sample dataset.

A diagram showing criteria of data quality

Data quality is traditionally split into six dimensions: 46

  • Accuracy
  • Completeness 
  • Consistency
  • Timeliness 
  • Validity
  • Integrity.

However, when looking at a single dataset, the quality considerations may be narrowed down into two main dimensions; validity and accuracy. 46

This is the degree to which the data conforms to defined business rules or constraints. 47

To determine the validity of the data, ask the following questions. 

  • Is the data in a specific format, type or size? 46
  • Does it allow for business rules or best practices? 46

This is to ensure that the data is close to the true values. 47

To determine the accuracy of data, ask the following questions. 

  • Is the data correct?
  • Does the data contain true values and figures?
  • How well does the piece of information reflect reality? 46

Watch the following video to understand what data quality is and some common factors that can cause data quality issues.

After watching the video, complete the following questions.

What is ‘data cleaning’?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. 46

Note: Data cleaning is also known by other terms such as data cleansing, data wrangling, data remediation, and data munging. All these terms refer to the same data cleaning process.

Cleaning the data in the sample dataset will: 

  • increase its overall productivity
  • allow for the highest quality of information 
  • have better reporting outcomes 
  • ensure efficient business practices 
  • ensure that the right decisions are made based on accurate data.

Watch the following short video further understand what data cleaning is, with some examples of what is involved in the data cleaning process.

After watching the video, answer the following question.

It is important to always follow the standard practices in your industry and your organisation’s relevant policies and procedures when performing data cleansing tasks.

An accountant at a desk with data displayed on 2 computers

Organisational procedures

Organisations may have their own data cleansing and quality assurance policies in place that would need to be adhered to.

An organisation’s data cleansing and quality policy helps to: 48

  • ensure that all business data adheres to the principles of data quality and is fit for its desired purpose 
  • establish processes across all designated company systems 
  • to ensure that all business data is accurate, adequate, relevant, and timely.

Industry practices

A diagram showing parts of data cleaning

Consider the following industry practices for data cleaning. 49

  • Parsing – this involves converting some of the values from the sample dataset into a data structure that is more suitable for representations. 
  • Correcting – this involves fixing problems in the data (e.g. fixing misspellings, typos, errors) 
  • Standardising – follow a standard set of rules for formatting the data.
  • Matching – involves searching and matching similar records within and across the parsed, corrected and standardised data.
  • Consolidating – involves analysing and identifying relationships between matched records and then merging them into one representation.

Watch the following video that will brief you on the main approach for data cleansing.

Perform data cleansing

Data cleansing is first performed at the source system (which is the sample dataset) and then on the target output tables (fact and dimension tables) in the data model.

Perform data cleaning tasks at the source (sample dataset)

  • Fixing structural issues in the dataset – There may be inconsistent data types and formats specified in the sample dataset that should be corrected. These corrections, when done at the source sample dataset, will flow through to the rest of the fact and dimension tables in the model.
  • Removal of unwanted/ irrelevant data – There may be some data fields that do not fit into the scope of the specific data analysis project. These data fields can be removed from the source sample dataset to ensure all data in the sample dataset is relevant.

Perform data cleaning tasks at the target output (fact and dimension tables)

  • Removal of duplicate records – After segregating data from the sample dataset into multiple dimension tables, these tables are likely to contain duplicate values. Therefore, de-duplication is one of the most important tasks in the data cleansing process. 
  • Removal of irrelevant or incomplete data – There can be blank records or null values included in the data, which need to be removed or filtered out from the dimension tables.

Note: It is important to ensure that the dimension tables have the correct data type and data format flowing through from the source table. If not, this needs to be corrected as part of the data cleaning process.

Data cleansing process

When preparing to perform data cleansing tasks on a big data sample, one should consider the following:

  1. Plan for the task: List the details of the data cleansing tasks that need to be carried out in a logical order. Be aware and make notes of the industry practices and organisational procedures that need to be followed during the process.

    Note: It is important to make notes of all the data cleaning tasks that were performed on the sample dataset so that these corrections can be made to the source applications.

  2. Carry out the data cleansing tasks according to the plan and document all evidence of the process that captures the applied steps and modified queries.

Watch the following video to learn how to do this process.

Perform optimisations on the big data sample

Other optimisation tasks that can be performed on the big data sample may include:

  • ensuring appropriate relationships are formed between the dimension tables and the fact tables in the dataset 
  • ensuring that report is optimised for time intelligence by creating a DateTable as required for any transactional datasets.

It is important to implement these optimisations that are performed according to the industry standards and relevant organisational procedures.

Perform optimisations in the data model

This includes ensuring that:

  • a star schema is implemented as the data model
  • the most appropriate relationship types (e.g., one to one, one to many) are established between the dimension table and the fact tables.

Watch the following video to learn about this process.

Optimise report for time intelligence for transactional datasets

A DateTable should be created to optimise reports for time intelligence, especially when dealing with transactional datasets.

The following is an example of a procedure that can be used to create a Date table.

Procedure to create a Date table

Refer to the following guidelines on how to create data tables in Power BI - Create date tables in Power BI Desktop

  1. Use the following DAX script to create a new Date table:

    Date = ADDCOLUMNS(
      CALENDARAUTO(),
      "Year", YEAR([Date]),
      "Month", FORMAT([Date], "mmmm"),
      "Month Number", MONTH([Date]),
      "Quarter", FORMAT([Date], "\QQ"),
    )
  2. Create a relationship between the appropriate Date column in the Fact table with the Date table. Use the drag and drop features in PowerBI Model mode.
  3. Good practice – is to hide the relevant Date column in the Fact table so that it is no longer used even accidentally.

Watch the following video that will show the step-by-step process of how to create a date table in Power BI.

Finally, you should collate your validated output – that is, place the cleaned data into a file, ready for analysis. You should complete one last check of the data to confirm it is free of corruption.

As was mentioned previously, it is important to conduct validation checks and run through the previously formulated test case scenarios each time there’s a change in the data model or the underlying sample dataset.

Due to performing data cleansing tasks, there can be significant changes in the sample dataset and the data model. Therefore, a validation check should be performed, and the test results should be documented to identify any new issues in the sample dataset.

Confirm the absence of big data corruption in the sample

Therefore, the results of the validated output of testing should include:

  • The actual results and the final test outcome (Pass/Fail) of the new test run against the test case scenario document
  • The date when the validation was performed should also be noted in the documentation
  • Use the comments/notes section to record any additional information or issues related to the validation activity.

The results from this validation should confirm the absence of big data corruption in the sample dataset.

Collate validated output

The associated supporting evidence gathered during the validation of the output may include:

  • evidence of the test scripts run during the validation check
  • evidence of the performance data recorded as part of the validation checks 
  • completed test run document with final test outcome clearly recorded as Pass/Fail.

To check your understanding, complete the following questions.

Coworkers discussing financial figures on a laptop

When recommending configuration optimisations for sample testing big data, it is common to look at performance optimisation throughout the entire data flow. This includes: 

  • Data extraction – This refers to the process of collecting and retrieving data from different sources. One can explore the benefits of using specialised extraction tools (e.g. ETL tools) to further optimise and simplify the process. 
  • Movement of data – This refers to the movement of data from one system to another and how it can be optimised. For example, data may need to flow from the data store (central repository) to the analytic platform and then onto the publishing service. 
  • Data modelling 
    • Have we tuned performance, e.g. scaling processing resources? 
    • Have we optimised the model for Power BI?
      • e.g. Power BI cannot connect to a multi-dimensional model (only a flattened extract), whereas a Tabular model provides an optimal model for Power BI to utilise)? 
  • Data visualisation 
    • Poor methods include creating a table with a lot of data that requires a scroll bar, or having too many visuals on the same page may cause rendering issues.
    • Memory optimisations by choosing different visualisation options. 
    • Certain DAX functions that are memory intensive can be optimised for efficiency using various query optimisation techniques.
    • Not having multi-directional or cross-directional filters unnecessarily.
    • Disable unwanted options, such as the automatic Date hierarchy option.

Overview of ‘required personnel’

When communicating the final results of the big data sample test, it is essential that you speak with the required personnel in your organisation who are expecting the final results of the sample test. The ‘required personnel’ may include: 

  • Your immediate line manager (e.g. supervisor, project manager) 
  • Project sponsor (e.g. Chief Data Officer) 
  • Key stakeholders (e.g. Heads of departments)

When communicating the final big data sample test results, the general rules to keep in mind are: 50

  • make the results visually appealing
  • present shapes, not numbers 
  • use data visualisation software 
  • know your audience – and give them what they want 
  • practice communication skills.

Prepare visual representations

It is essential that you are able to develop some form of a visual representation that can be used to report on the key outcomes of the sample testing.

This could be in the form of a PowerPoint presentation, a set of screenshots, a demonstration of visual dashboards etc., that can be used to outline your testing methods and results. Communicating your findings using some form of visual representation is essential to ensure that the information presented is used to inform business decisions and activities.

Be aware of the discipline of communicating big data results to an audience. Watch the following video that explains the importance of data visualisation.

Additional resources

What is involved when communicating sample test results?

When communicating sample test results, it is important to: 

  • use effective and clear language to communicate your ideas
  • provide evidence of the test results– this can be in the form of: 
    • test results reports 
    • validation test results 
    • numerical information presented using appropriate graphical representations 
    • screenshots from the analytical platform or demonstrations using the analytical platform. 
  • ask the right questions (e.g. open and closed probing questions) 
    • to obtain feedback 
    • to clarify requirements.

For example, let’s investigate some of the standard protocols that should be followed when communicating to an audience.

An accountant discussing data results with colleagues

Open and closed probing questioning

It is important to ask the right type of question at the right time.

Typically, when you need to obtain feedback, begin with a question. It is important that you use appropriate question types to obtain feedback and particularly to clarify information where required. Questions may be structured as:

These types of questions:

  • encourage longer, descriptive or explanatory answers
  • are useful when you are seeking detailed information about a topic of discussion.

These types of questions: 

  • require a simple ‘yes’ or ‘no’ answer
  • are useful when you are seeking a clear, rapid response.

Watch the following video to understand more about when to use different questioning types.

Active listening

Once you have communicated the overall test results, it is important to ask questions from the audience and obtain their feedback on what you have presented to them.

When feedback is provided, it is always important to actively listen to what is communicated by others and respond appropriately.

Watch the following video to understand what active listening is and how you can demonstrate active listening when communicating with others.

After watching the video, answer the following questions.

Topic summary

Congratulations on completing your learning for this topic Optimise big data sample results and documentation.

In this topic you have learnt how to: 

  • perform data cleansing on big data sample following testing according to industry practices and organisational procedures 
  • collate validated ouput of testing, confirming absence of big data corruption in sample 
  • recommend configuration optimisation changes based on performance testing results 
  • communicate final sample results to required personnel.

Check your learning

The final activity for this topic is a set of practical activities that will help you prepare for your formal assessment.

Practical activities:

You should continue to do the following tasks using the following files and documents you have worked on at the end of the previous topic. 

  • The Power BI data model – that contains a sample of the Sales 2020-2021(.xlsx file) dataset, which was isolated for sub-standard data 
  • The Test case template (.xlsx file) – which was used to design, formulate, document and record the results after implementing the test scenarios and test cases.

Task 1: Perform data cleansing on big data sample

Read the following industry practices and organisational procedures for performing data cleansing:

Industry practices and procedures

Step 1: Perform data cleaning on source table of the dataset.

This includes corrections such as:

  • changing data type and data format
  • removal of unwanted columns.

Step 2: Perform data cleaning on target output tables.

This includes corrections such as:

  • removal of duplicate records

    This can be done by revising the queries used to create target output tables to return only distinct rows. For example, to ensure that the Customer table returns distinct values the following query modification can be done:

    Customer = DISTINCT(
      SELECTCOLUMNS('Sample of Customer',
        "Customer ID", [Customer_ID],
        "Customer Segment", [Segment],
      )
    )
  • removal of irrelevant or incomplete data

    Customer = DISTINCT(
      SELECTCOLUMNS(
        FILTER('Sample of Customer',
          [Customer_ID] <> BLANK() || [Segment] <> BLANK()
        ),
        "Customer ID", [Customer_ID],
        "Customer Segment", [Segment]
      )
    )
  • ensuring data type and data format is corrected if not flowed through from the source table.

According to industry practices and organisational procedures: 

  1. Plan for the task by first making a list of the data cleansing tasks that need to be carried out in a logical order. 
  2. Carry out the data cleansing tasks according to the plan by following a step-by-step process and using any modified queries.

Task 2: Perform optimisations

Perform the following optimisations on the big data sample. 

  • Optimise the data model to establish appropriate relationships between the dimension tables and the fact tables.
  • Create a DateTable to optimise the report for time intelligence.

Assessments

Now that you have completed all topics for this module, you are ready to complete the following assessment event: 

  • Assessment 4 (Project).
Module Linking
Main Topic Image
A young accountant sitting at a desk working on a financial report on their laptop
Is Study Guide?
Off
Is Assessment Consultation?
Off