Validate big data sample process and business logic

Submitted by sylvia.wong@up… on Sat, 03/06/2021 - 15:47
Sub Topics

Once the sample dataset is validated, the next step is to align the source fields (column names) of the dataset to specific entities and relevant parts of the organisation.

This is done for a number of reasons.

  • It helps you to understand the business processes from where the data is generated and then acquired from
  • It helps to identify any specific data segments/categories that are not required for reporting and hence do not need to be included in the data model
  • It ensures that the data model you create would be completely relevant to the organisation’s requirements for analysing data.

Understanding the parts of an organisation

To understand the data, you must understand the business's operational workflows that generate the data. These workflows are associated with various departments and operational areas of an organisation and may vary from business to business.

A sample of a business operational workflow is as follows.

A diagram showing operational workflow

Adapted from E-Commerce Workflow Diagram by SmartDraw, LCC

Once you understand the business process, then it is possible to further break down the sample dataset into segments that can be aligned with relevant parts of the organisation.

A business matrix can be used to capture the alignment of the business entities at a high level. Following is an example of a simple business matrix based on the previously provided sample of the business operational workflow.

  Business entities
Business processes Date Order Customer Product Credit card Warehouse Shipping
Submit order X X X X      
Customer pays through credit card transaction X X X   X    
Order sent to warehouse for shipping   X   X   X  
Shipping is scheduled and sent to the customer     X X   X X

This information is then recorded in more detail using a document called source to target mapping.

What is a ‘Source to target mapping’?

Source to target mapping can be defined as a set of instructions that define how the structure and content in a source system would be transferred and stored in the target system. 34

To achieve business requirements for data analysis, the target system needs to be structured and modelled in a way that would help to build visualisations for reporting.

For large projects dealing with complex data, may consider using software tools to automate source to target mapping.

Industry practice and guidelines

To do a source to target mapping, there are certain industry practices and guidelines that should be considered when dealing with table attributes, data types, schema design, relationships etc.

Read the article Power BI Best Practices - Excelerator BI to learn more about the best practices that can be used when planning to do a source to target mapping document.

Plan to use column naming standards

All target table fields should conform to a naming standard as it impacts how the field names are displayed in the Power BI reports.

Example naming standard - Column headings should not have any hyphens or underscores in the name (e.g. customer_name or customer-name). Instead, if there are two words, have one space in between with capitalised first letter. E.g. Customer Name.

Plan for a star schema design

Go through the short learning module Describe Power BI Desktop models - Learn from Microsoft’s official website, where you will learn about Power BI Desktop models. Focus on learning what the star schema design model is and what type of tables it is made of.

Watch the following video to understand why star schema should be used when designing data models in Power BI.

After watching the video, answer the following question.

Recording source to target mapping details

To understand how to properly record source to target mapping information, you need to first understand the terms used in a typical source to target mapping template.

  • Source system – refers to the sample dataset table name that was obtained from the raw big data source. You have learnt how to assemble this source data into the data analysis platform in the previous topic. 
  • Source Field – refers to the column names in the sample dataset table. 
  • Target system/output – refers to the fact tables or dimension tables as a result of applying a star schema model.
  • Target Field – refers to the column names in the target output tables (fact or dimension).
  • Transformation Logic – refers to any applicable queries, scripts or formatting changes that need to be made to transform the source fields into the required target fields. 
  • Comments on issues noted – this is an optional column that can be used to record any issues identified during the mapping that needs to be addressed during implementation.

Often organisations would have a template that can be used to record source to target mapping details. It is important to record all source to target fields/columns clearly using the Source to Target Mapping table.

An example of a source to target mapping table is shown below.

Source System (Table Name) Source Field (Column Name) Transformation Logic (Query, Formatting) - if applicable Target Output (Table Name) Target Field (Column Name) Comments on issues noted:
'Sample of Customer' Row_ID       Not required in the report.
'Sample of Customer' Customer_ID
Customer = SELECTCOLUMNS('Sample of Customer', 
  "Customer ID", [Customer_ID]
)
Customer Customer ID Discrepancy exists between source field name and target filed name
'Sample of Customer'
  • Customer_Firstname
  • Customer_Lastname
Customer = SELECTCOLUMNS('Sample of Customer', 
  "Customer Name", [Customer_Firstname] & " " & [Customer_Lastname]
)
Customer Customer Name This needs to be merged into one column separated by a space delimiter (" ")
'Sample of Customer' Customer_Address
Customer = SELECTCOLUMNS('Sample of Customer', 
  "Customer Address", [Customer_Address]
)
Customer Customer Address Discrepancy exists between source field name and target filed name
'Sample of Customer' Customer_Phone
Customer = SELECTCOLUMNS('Sample of Customer', 
  "Customer Phone", [Customer_Phone]
)
Customer Customer Phone Discrepancy exists between source field name and target filed name
'Sample of Customer' Customer_DOB
Customer = SELECTCOLUMNS('Sample of Customer', 
  "Customer DOB", [Customer_DOB]
)
Customer Customer DOB Change data type to Date and Format dd/mm/yyyy

Data segregation

Data segregation refers to separating the data from a larger group into smaller groups. During the implementation of a star schema design model, we will need to segregate the sample dataset into multiple target tables (e.g. fact tables and dimension tables).

For example, you can create multiple calculated tables using DAX functions referencing the source system/table.

To learn more about how to create calculated tables, refer to the Using calculated tables in Power BI Desktop.

Watch the following video to understand how a flat-file (source data) into a data model by segregating the source dataset into multiple target output tables.

Data aggregation

Data aggregation is any process whereby data is gathered and expressed in a summary form. 35

When you are testing big data samples, it is important to apply data aggregation rules where it is appropriate to do so. Consider the following data aggregation examples. 

  • Sum of all sales figures 
  • The average age of customers.

Aggregated data does not need to be only numeric. Any non-numeric data can also be aggregated to provide a count of values. For example, consider the following. 

  • Count of total customer segments 
  • Distinct count of products by category.

Implementing data aggregation rules

Data aggregation rules can be applied to data, or datasets, using scripts. Specifically in Power BI, this can be done by using DAX measures.

To learn more about Power BI aggregate functions, refer to the Work with aggregates (sum, average, and so on) in Power BI.

Watch the following video to understand how summarisations are used in Power BI to perform data aggregations.

The final activity for this topic is to complete the following questions.

A small, professional team holding a casual meeting in a common area in their modern office

Identifying anomalies

In the context of data analysis, a data anomaly is often much more than just an error or random glitch. Simply data anomalies can be identified as any data values that fall outside of the normal pattern of activity. 36

Identifying data anomalies helps to better understand the behaviour of the business operations and may also indicate problems in the data that would require further investigation.

The detected anomalies in the data may be a result of:

  • errors in data entry
  • errors in the system generated data 
  • an error during data integration.

Identifying anomalies in the data is not an easy and straightforward task. To begin analysing and interpreting anomalies, you would first need to have the correct tools or software that can detect them. You can also use customised scripts to validate and check the data for any anomalies.

Detecting anomalies in transactional (time-series) datasets

Transactional or time-series data refers to records or transactions that also have a timestamp attached to them. For example, online banking transaction data would always have a timestamp that indicates the date and time of when the transaction occurred.

Refer to the following article on What is Time Series Data? and watch the following video to learn more about time-series data.

Refer to the Anomaly detection tutorial - Power BI from Microsoft’s official website on how anomalies can be detected using Power BI Desktop.

Watch the following video to understand how to use the anomaly detection feature in Power BI to detect anomalies in time-series data.

Detecting inconsistencies in non-transactional datasets

When there are inconsistencies present in an organisation’s database (from where the dataset is acquired), this can be apparent in the dataset being tested. Examples of such inconsistencies may be a result of:

  • data insertion anomalies 
  • data deletion anomalies 
  • data modification anomalies.

To detect such inconsistencies in non-transactional data, we will need to use different validation checks. These validation checks may vary according to the type of data that you are analysing.

It is important to further explore and understand the data in order to formulate relevant validation checks to find out any inconsistencies. These validation checks can be done by, 

  • running custom test scripts 
  • using simple visualisations in Power BI Desktop to generate simple reports where these anomalies can be clearly detected, such as:
    • finding unusual metric values 
    • finding changes in important metrics.

Consult required personnel to clarify detected anomalies

It is necessary to consult with the appropriate people in your organisation to clarify and resolve any anomalies you have identified in the dataset.

Overview of ‘required personnel.’

When clarifying detected anomalies, it is essential that you speak with the relevant people. ‘Relevant’ people include anyone with the skills and knowledge necessary to support you in clarifying and resolving the particular problems you have identified. This may include:

  • your immediate line manager (e.g. supervisor, data analyst lead, chief data officer) who would have more insight into the business process and have the capability to escalate the issue further to other stakeholders of the business to find out the root cause of the detected anomalies
  • subject matter experts (e.g. health industry; talk to health industry professionals to understand the data and what it is telling us) 
  • application owners (e.g. Experts on the application/platform and its underlying structures) 
  • process owners ( e.g. financial leads who understand the financial processes)
  • external specialists who manage the organisation’s big data
  • the owner of the data (the ‘client’), for whom you are performing testing processes
  • people in your organisation with expertise in relation to testing big data samples.

What is involved in the consultation process?

You can consult about identified anomalies from the relevant personnel using a variety of communication protocols such as emails, phone conversations and meetings (e.g. online or face-to-face).

Regardless of the communication method used, it is important to: 

  • use effective and clear language to communicate your ideas
  • ask the right questions (e.g. open questions) 
    • to inquire about advice 
    • resolutions for detected anomalies 
  • provide evidence of the detected anomalies – this can be in the form of: 
    • error reports 
    • validation test results
    • numerical information 
    • screenshots from the analytical platform.

For example, let’s investigate the standard protocols that should be followed when consulting your supervisor via email.

Email etiquette

When writing an email, ensure that you: 

  • address the email to the correct person(s) email address
  • use the standard email template as recommended by your organisation 
  • include the following main elements within the structure of the email: 
    • subject-line 
    • openers (greeting) 
    • body 
    • closings (sign-off)
  • include all necessary information regarding the data anomalies 
  • check for any errors in grammar, spelling and punctuation.

Refer to the How to Write a Proper Email: Make the Right Impression | Grammarly to gain more insights and tips on using proper email etiquette.

Watch the following video to learn the recommended protocols that you need to consider when writing a formal email.

Resolving anomalies

After consulting with the appropriate personnel regarding the identified anomalies, the next step is to resolve the anomalies according to the advice received.

Note: Some identified issues in the data, although identified as anomalies, may be found as genuinely correct data after further investigation. In such cases, this information should be noted in the test results reports as it may be required for further reference during the sample testing process.

If the detected anomalies are identified as an actual error in the database, this then needs to be corrected ideally in the actual raw big data source. However, for the purpose of proceeding with the sample data testing, one would need to:

  • Perform the corrections in the sample dataset by correcting, adding, and manipulating any missing or incorrect data.
  • Make notes of any suggestions to resolve these issues in the actual raw big data source in the validation test results or other relevant documentation.

Note: This information will be useful when communicating suggestions to your supervisor, client or relevant stakeholder once the sample testing process is completed.

Resolving issues in the sample dataset

At this point, you need to manually correct these issues in the sample dataset.

If the dataset is in the form of a .CSV or .xlsx format, this can be edited using Microsoft Excel.

Once the sample dataset is manipulated to correct any verified anomalies, the sample dataset loaded to the Power BI Desktop needs to be refreshed in order to reflect the fixes applied.

A close view of an accountant checking the numbers on a tablet device

Once the data aggregations and segregation are implemented and the identified anomalies have been resolved in the sample dataset, it is now time to test for the report performance.

An introduction to performance testing data throughput was provided in Topic 2.5 of this module. In addition to throughput testing, you will learn two other types of performance tests which are:

  • Data processing testing 37 
    • This is a type of test that focuses on how the data that is loaded onto the analytical platform is processed. It involves verifying the speed with which 38 
      • queries and DAX scripts are run to perform required data aggregations 
      • MapReduce jobs are executed 
      • the underlying data model is populated.
  • Sub-component performance testing 
    • As analytic platforms are made up of multiple components, it is important to test each of these components in isolation. For example, within a sub-component, how quickly 38 
      • the message is indexed and consumed 
      • MapReduce jobs are performed 
      • queries are executed 
      • search options are performed.

Now, we will focus on how to carry out these tests using various platforms and tools.

Testing data throughput performance

To collect the performance data of the visualisations of a report, we will use the Power BI Performance Analyzer, which is a feature included in Power BI Desktop.

When using Power BI Performance Analyzer, it is important to note the following about the captured performance data.

  • The recorded performance data for each visual includes details of the following three elements: 39 
    • DAX Query – Every visual generates one or more DAX EVALUATE statements to request data. Therefore this indicates the time elapsed between sending the request and receiving a result. 
    • Visual display – This indicates the time required for the visual to render the information from the underlying data and present/display it on the user interface of the report. 
    • Other – This indicates the time spent waiting on any other type of operation to complete. For example, preparation of queries, waiting for other visuals to complete, background processing etc. 
  • The throughput of each report visual is measured in milliseconds (ms) which indicates the difference between the start and end timestamp for each operation. 40

Refer to the following articles to learn more about how to use Power BI Performance Analyzer to capture the performance data of visualisations in a report.

Watch the following video to learn how to:

  • use the Performance Analyzer tool in Power BI Desktop 
  • understand the throughput data captured 
  • save the recorded performance data into a JSON file.

Testing data processing performance

Here, we will further analyse the already captured performance data from the Performance Analyzer tool in order to gain insights into data processing performance.

The tool that we will use here is DAX Studio. Refer to the following article DAX Studio for Power BI: Getting Started - Excelerator BI, to understand how to download, install and start using DAX Studio.

To do the analysis, we will do the following steps.

  • Load the performance data – In DAX Studio, select the ‘Load Perf Data’ option. Then navigate to and select the previously saved performance data (.JSON format) file to be loaded into DAX Studio. 
  • Analyse the PBI Performance data that is listed in a grid which includes information such as: 41
    • Visual – lists the type of visual used (e.g. Slicer, Matrix, Measures etc.) 
    • QueryStart – the start timestamp 
    • QueryEnd – the end timestamp 
    • Rows – rows of data requested by the query 
    • Query Ms – query execution time
    • Render Ms – time is taken for the visual to render information 
    • Total Ms – total query execution time in milliseconds 
    • Query – the actual query that runs in the background during Power BI report interactions.

Watch the following video that demonstrates how to load Performance Analyzer metrics into DAX Studio for further analysis.

Testing sub-component performance

Here, we will further dive into the sub-component performance details of the current data model.

To do the analysis:

  • Select the View Metrics option in DAX Studio to analyse a data model connected to Power BI Desktop
  • Analyse the sub-components of the current data model, which includes: 42 
    • Table metrics – information about tables showing the amount of memory used by tables/columns in the model
    • Column metrics – lists all column information without grouping them inside tables
    • Relationships – shows information about the relationships between tables that currently exists in the data model
    • Partitions – Shows partition information for each table. By default, every table has at least one partition. However, in larger datasets, there may be multiple partitions that contain this information.

Read through the following article, Model Metrics, which explains the performance metrics in data models.

Topic summary

Congratulations on completing your learning for this topic, Validate big data sample process and business logic

In this topic, you have learnt how to: 

  • align datasets to relevant parts of the organisation 
  • implement data aggregation and segregation rules 
  • consult with required personnel to clarify and resolve identified anomalies 
  • conduct performance testing for data throughput, data processing and sub-component performance.

Check your learning

The final activity for this topic is a set of practical activities that will help you prepare for your formal assessment.

Practical activities:

Task 1: Align datasets to relevant parts of the organisation

You have been provided with the following:

Investigate the operational data flow to understand how the data is derived. Then, complete the Source to Target Mapping tab in the Excel worksheet file to align datasets to relevant parts of the operational data flow.

Task 2: Implement data segregation rules

According to your previously documented source to target mapping, create new tables in Power BI data model view, to implement data segregation rules.

Task 3: Implement data aggregation rules

Use the correct visualisations and measures to implement the data aggregation rules for the dataset according to the following reporting requirements.

Note: Ensure that each visualisation acquires data from the target output tables you’ve created in Task 2.

  1. Create a new measure to get the Cost value from Sales and Profit columns considering the business logic: Profit = Sales – Cost 
  2. Add a Card visual to display: Cost, Sales and Profit. 
  3. Add a Clustered bar chart visual to display: Cost, Revenue and Profit. 
  4. Add a Slicer visual to filter report data based on yearly and quarterly. 
  5. Add a Map visual to display Sales, Profit and Cost details by State and Order Type
    • The Sales amounts should be made visible in Bubble size
    • The Order Type, should be displayed in the Map Legend as InStore and Online

Hint: Before adding the Map visual you’ll have to transform the data and split the current Order ID column by Delimiter (-), which will then provide three separate columns that contain the Order Type (as InStore or Online), Year (2020, 2021) and Order ID (a 6-digit number).

Task 4: Identify anomalies

  1. Create a new page in the Power BI Desktop report view called “Anomaly detection”.
  2. Create three line charts to display each of the following time series data from the dataset. 
    • Cost by Year, Quarter and Month
    • Sales by Year, Quarter and Month
    • Profit by Year, Quarter and Month
  3. Use the Find Anomalies feature in Power BI Desktop to detect any anomalies in this data.

Task 5: Conduct performance testing for data throughput

5.1 Use the Performance Analyzer tool to record data throughput as follows:

  1. Start recording performance data using the Performance Analyzer.
  2. Clear any previously captured data.
  3. Refresh all visuals and capture performance data.
  4. Perform some filters within the data visualisations in the report.

5.2 Review data processing (query) performance as follows:

  1. Export the previously recorded data from the Performance analyser into a JSON File, with a meaningful name that relates to the type of data output report tested. 
  2. Open DAX Studio. 
  3. Load the performance data JSON File into DAX Studio.
  4. Review the PBI Performance data.

5.3 Review sub-component performance

  1. In DAX Studio go to Advanced > View Metrics.
  2. Expand the relevant datasets (tables) for Sales Output view to obtain sub-component performance data.
  3. Review sub-component performance metrics for the relevant tables associated with the data output report.

Assessments

Now that you have completed the Topics 3 and 4 of this module, you are ready to complete the following assessment event: 

  • Assessment 3 (Project).

What’s next?

Next, we will dive into the more practical tasks related to validating the output of captured big data.

Module Linking
Main Topic Image
A data scientist's holding a visualisation of a large sample of data relating to their latest project
Is Study Guide?
Off
Is Assessment Consultation?
Off