Big data sample testing fundamentals

Submitted by sylvia.wong@up… on Thu, 03/31/2022 - 15:14

Sub Topics

Considering any legislative requirements should be the starting point for any big data project. This helps to specify certain boundaries for the project and identify any limitations for using specific types of data.

Data protection and privacy concerns

Following are some of the issues that need to be considered by businesses when working with big data. ⁴

Expand each item below to see more details.

Data protection

Regulatory requirements dictate that personal data must be processed for specified and lawful purposes and that the processing must be adequate, relevant and not excessive.

Privacy

When dealing with confidential and sensitive information included in the big data being tested, businesses need to be transparent on how the information is handled and protected and demonstrate that confidentiality is maintained at all stages.

Identifying personal information

Watch the following video to understand how the Australian Public Service (APS) is committed to protecting the personal information of Australians. Pay close attention to how you can identify ‘personal information’ and how this information should be handled.

After watching the video, answer the following questions.

When carrying out data analytic activities especially here in Australia, there are various data protection and privacy laws and regulations that businesses need to comply with when testing big data sources.

This is often done by: ⁵

using a ‘privacy-by-design’ approach
- involves embedding privacy into the project, instead of making it an afterthought
considering social responsibility in data analytics projects
- involves having an ethics-based approach to ensure that the processing of personal information as part of an organisation’s data analytics is carried out in a fair, transparent, responsible and ethical manner
conducting Privacy Impact Assessments (PIAs)
- for the purpose of identifying risks and making appropriate recommendations.

Watch the following video to understand why Privacy Impact Assessments should be carried out and when they should be carried out.

Refer to the APS Big Data Strategy document published by the Office of the Australian Information Commissioner (OAIC) which outlines how to build privacy into data analytic activities. ⁶

Complete the following task based on the information from the APS Big Data Strategy document.

Data protection, privacy laws and regulations

a group of people using celphone while walking on the street

Privacy Act in Australia

The Privacy Act 1988 makes provisions to protect the privacy of individuals in Australia. Therefore, businesses that deal with the personal information of individuals in Australia must ensure that they comply with this legislation.

Watch the following video to understand what ‘privacy information’ is and how personal information requires to be protected according to the Privacy Act in Australia.

After watching the video answer the following question.

Australian Privacy Principles (APPs)

Under the Privacy Act 1988 the Australian Privacy Principles (APPs) provide guidelines for the collection, management and use of personal information.

Refer to the Guide to data analytics and the Australian Privacy Principles published by the Office of the Australian Information Commissioner (OAIC) for more detailed information about how the APPs specifically apply to data analytics activities, especially ‘data integration’ as it relates to the topic covered in this module.

Data integration refers to the process of bringing together multiple datasets to provide a new dataset. ⁵

Watch the following video to understand more about the APPs. Pay attention to those specific APPs that are relevant to data integration activities (e.g. testing big data collected from various sources).

General Data Protection Regulation (GDPR)

The European Union General Data Protection Regulation (the GDPR) contains new data protection requirements that came into effect from 25 May 2018. According to the General Data Protection Regulation (GDPR):

Australian businesses of any size may need to comply if they have an establishment in the EU, if they offer goods and services in the EU, or if they monitor the behaviour of individuals in the EU. ⁷

For more information about GDPR, refer to the GDPR Australia How to Comply with General Data Protection Regulation.

Watch the following video to understand what GDPR is and how it impacts privacy and data protection requirements for businesses that operate in Australia.

Privacy laws and regulations for specific industries

In addition to the Privacy Act, the APS Big Data Strategy document published by the Office of the Australian Information Commissioner (OAIC) outlines a number of other legislative controls that apply to businesses that operate in various industries. ⁶

Some examples are given below.

Data protection and privacy laws of states and territories

Here are some examples of data protection and privacy legislation as it applies to different states and territories in Australia.

The final activity for this topic is a set of questions that will help you prepare for your formal assessment.

a person digesting big data sources on multiple screens

Big data sources can be of a variety of types. These data sources can be categorised based on the:

format of the raw data
features of the systems that generate the raw data
where the data is generated from (e.g. either internally or externally)
whether it has a time-dimension attached to it or not (e.g. either transactional or non-transactional).

Now let’s take a closer look at these categories of big data sources.

Formats of common big data sources

Big data sources generally fall into one of three main types:

Structured
Unstructured
Semi-structured

A flow chart diagram depicting the methodology of big data testing three stages

Structured

Structured data refers to data that is identifiable and organised in a defined way. ⁶ Therefore this type of data is easier to search and analyse as its elements are addressable for effective analysis.

Structured data is usually stored in a database having one or more tables. Tables contain one or more columns that specify what type of data should be stored on each column. Then the actual data records are stored within the rows of a table.

To make things clear, consider the following terms and their definitions.¹⁰

Database – a container that stores organised data.
Table – a structured list of data of a specific type. (e.g. Customer table, Order table, Suppliers table)
Column – a single field in a table. All tables are made up of one or more columns. (e.g. In a Customer table a column names could be Customer ID, Customer Name, Phone, Email)
Row – a record in a table
Datatype – this refers to the type of allowed/restricted data in a column. (e.g. characters, numbers etc)
Schema – information about database and table layout and properties.
Key column(s) – a column (or set of columns) whose values uniquely identify every row in a table.

Consider the following Customer table as an example of how structured data can be stored in a database.

Notice that:

the table column headings (e.g. Customer ID, Customer Name, Phone, Email) indicate what type of information about customers is stored in the table
the table rows contain records of specific customer details
the table contains a key column (e.g. Customer ID) to uniquely identify each record in the table.

Customer
Customer ID	Customer Name	Phone	Email
2658	Fiona Black	0426 142 866	fiona.black@mail.com
1952	Benjamin Low	0496 100 200	benlow@mail.com
2587	Grace Murray	0423 654 258	gmurray@mail.com
9542	John Kelly	0438 962 800	kellyjohn@mail.com

The following image shows an example of a database schema or model.

Notice how the key columns are used to form relationships or links with other tables in the schema.

Some of the identifiable features of structured data are that:⁸

it is organised into a well-designed format (e.g. tabular)
it is generally stored in relational databases (e.g. SQL, MySQL, Oracle) and enterprise management systems (e.g. ERP, CRM)
it is less flexible and less scalable as it confines to its pre-defined structure
it is easier to query and process this data to obtain meaningful results
it allows for parallel processing of data and is often used to capture transactional data generated from multi-tasking systems such as point-of-sales terminals, retail systems etc.

Unstructured

Unstructured data refers to data that has little identifiable structure ⁶ and cannot be contained in a row and column database.

Unstructured data can be textual (e.g. email body, log files, documents, social media posts) or non-textual (e.g. audio, video, images).

Some of the identifiable features of unstructured data is that: ⁸

it does not have a pre-defined data format
usually stored in a datalake*
it does not have a specific format or structure
it cannot be stored in traditional relational databases
it does not contain any identifiable tags or keywords
it is harder and complex to query and process this data in order to obtain meaningful results
it requires a specialised skillset to be able to analyse and model this type of data.

* A datalake is a centralised repository that allows general storage of all types of data from any source

Semi-structured

Semi-structured data refers to data that does not conform to a formal structure based on standardised data models. ⁶

As semi-structured data appears to be unstructured at first glance. Therefore, it is often hard to clearly differentiate between these two types.⁹

However, some of the identifiable features of semi-structured data are that: ⁸

it is not fully formatted as structured data
it is organised up to some extent (e.g. CSV, XML, JSON, HTML), but not at the level of structured data
it is not stored using tabular formats or in traditional relational databases
flexibility and scalability is more than structured data but when compared with unstructured data, it is less flexible and scalable
it is best managed in NoSQL databases
it contains identifiable tags or keywords that helps query and process the data somewhat easily than unstructured data. However, the process is not as easy as structured data.

The following images show examples of XML, JSON and CSV files. Notice the use of various tags and keywords used to provide some form of structure for the data contained in these files.

An example of a XML file

<customers>
  <customer>
    <customerid>2658</customerid>
    <customername>Fiona Black</customername>
    <phone>042 614 2866</phone>
    <email>fiona.black@mail.com</email>
  </customer>
  <customer>
    <customerid>1952</customerid>
    <customername>Benjamin Low</customername>
    <phone>049 610 0200</phone>
    <email>benlow@mail.com</email>
  </customer>
  <customer>
    <customerid>2587</customerid>
    <customername>Grace Murray</customername>
    <phone>042 365 4258</phone>
    <email>gmurray@mail.com</email>
  </customer>
  <customer>
    <customerid>9542</customerid>
    <customername>John Kelly</customername>
    <phone>043 896 2800</phone>
    <email>kellyjohn@mail.com</email>
  </customer>
</customers>

An example of a JSON file

{
  "Customers": [
    {
      "Customer_ID": "2658",
      "Customer_Name": "Fiona Black",
      "Phone": "042 614 2866",
      "Email": "fiona.black@mail.com"
    },
    {
      "Customer_ID": "1952",
      "Customer_Name": "Benjamin Low",
      "Phone": "049 610 0200",
      "Email": "benlow@mail.com"
    },
    {
      "Customer_ID": "2587",
      "Customer_Name": "Grace Murray",
      "Phone": "042 365 4258",
      "Email": "gmurray@mail.com"
    },
    {
      "Customer_ID": "9542",
      "Customer_Name": "John Kelly",
      "Phone": "043 896 2800",
      "Email": "kellyjohn@mail.com"
    },
  ]
}

An example of a CSV file

Customer_ID, Customer_Name, Phone, Email
2658, "Fiona Black", 042 614 2866, fiona.black@mail.com
1952, "Benjamin Low", 049 610 0200, benlow@mail.com
2587, "Grace Murray", 042 365 4258, gmurray@mail.com
9542, "John Kelly", 043 896 2800, kellyjohn@mail.com

Watch the following video to understand more about big data types.

After watching the video, complete the following activities.

Features of common big data sources

Big data sources may be defined as follows:

Batched – This is a collection in ‘batches’ over time, then fed into a system for testing
Real-time –This is fed into the system piece-by-piece, and tested in real-time
Interactive – This allows the analysts to manipulate the data during collection and testing.

Let us look into the features of each of these types of big data sources.

Expand each item below to see more details.

Batched

Data sourced from batch processing systems typically involve very large data sets. Often testers need to wait for hours or days till a certain amount of raw data is collected before processing that data. This type of data source is characterised as non-time sensitive as computations performed on the data require significant time to generate results.

The common procedure when working with batched sources is to schedule data extraction, transforming and loading (ETL) jobs to run at a set time or when data reaches a certain threshold.

Examples: Payroll data, billing data, customer orders, archived data.

This type of big data source typically deals with structured or semi-structured data formats.

Real-time

Raw data from this source:

includes smaller chunks of data that are changing/added to the system rapidly
is processed immediately (within seconds) using simple calculations
uses in-memory computing for fast processing.
typically deals with unstructured data formats.

Examples: Radar systems, Twitter feeds, Bank ATMs.

Interactive

Raw data from this source:

is used to find patterns in data
provides the ability to work with more data
is processed within minutes and with low latency
can scale reliably when new data becomes available
can be explored using various platforms that can run ad-hoc queries.

Internal and external big data sources

It is important that you are able to identify internal and external sources of big data for testing.

Expand each item below to see more details.

Internal

Internal sources of big data are those generated, owned and controlled by your organisation. Typically, internal data sources are easier to collect and are more relevant to an organisation. ¹³

Some examples of internal data sources include:

Transactional data and point of sale (POS) information – which includes both current and historical data that relates to the organisations’ own purchases and details of their customers’ purchasing trends.
Customer relationship management system (CRM) – this holds details of customers which can be further analysed to understand where customers are located geographically and their company affiliations.
Internal documents – Today, internal documents are mostly in digitised form (PDFs, Word Documents, Spreadsheets etc.) and they provide a wealth of information about business activities, policies and procedures.
Archives – Historical data can contain a wealth of information on how the business used to operate and help further investigate into patterns or to resolve issues.
Device sensors – Any Internet of Things (IoT) devices used within the organisation can generate data on network connectivity as it relates to the business. For example, a transport company can use sensor data received from their vehicles to source valuable information such as milage, fuel consumption, travel route etc.

External

External sources of big data are those generated outside of your organisation.

Data obtained from external sources may include:

public data (e.g. publicly available data for machine learning, official statistics, forecasts)
unstructured data (e.g. social media posts)
other types of data gathered by third-party organisations (e.g. Peak body data, Government data, data collected from Google)

Your organisation may obtain permission to use these sources, but does not own or control them.

Transactional and non-transactional big data sources

It is important that you are able to differentiate between transactional and non-transactional data sources.

Expand each item below to see more details.

Transactional

Data sources that have a time dimension and becomes historical once the operation is complete.

Important facts about transactional data are as follows: ¹¹

They consist of records that relate to any type of financial transactions related to a business
They are generated by various systems and applications such as point-of-sale servers, ATMs, payment gateways etc
Each record contains a transaction ID, that is accompanied by a series of items that are part of the transaction
They lose their relevance over time and becomes historical once a transaction is completed. Therefore, processing and making sense of this type of data need to be done quickly in order to maintain a competitive edge
Used to understand peak transaction volumes, peak ingestion rates, and peak data arrival rates
For example transactional data may contain:
- sales order data (e.g. revenue, cost, discounts, products purchased, quantity)
- financial data (e.g. purchases, sales, payments, claims)
- account transfer data (e.g. withdrawals, credits, debits)
- work related employee data (e.g. time-in, time-out, hours worked).

Refer to the article What is Transactional Data? | TIBCO Software which explains transactional data in more detail.

Non-transactional

Data sources that are relevant to an organisation for a longer duration.

Example: customer details, product details.

Important facts about non-transactional data are as follows: ¹²

Non-transactional data is relevant to an organisation over a long period of time than transactional data.
For example transactional data may contain customer details (e.g. name, address, preferences, contact details), product details (product ID, product name, category), supplier details, account information etc.

The final activity for this topic is a set of questions that will help you prepare for your formal assessment.

To write queries and scripts for big data testing you will need to use some form of programming language. Most applications and big data platforms available today support multiple programming languages with in-built scripting editors to make writing scripts and queries easier, especially for those testers who do not have previous experience with programming languages.

It is also important to note that the type of programming language used in a big data testing scenario would also depend on the type, source and structure of the raw data that needs to be processed.

For example, consider the following scenarios and the choice of programming language:

If the data is sourced from a traditional relational database containing transactional data that has a well-defined structure, the most basic and standard programming language you would need to use is the Structured Query Language (SQL).
If the raw big data is semi-structured or unstructured, then a programming language such as Scala can be used. Scala is identified as one of the top open-source programming languages used for big data projects.

There are many other programming languages used for big data testing and analysis such as Python, R, Java etc.

However, for the purpose of this module, you will be introduced mainly to DAX (Data Analysis Expressions) and the Power Query Formula Language (informally known as ‘M code’) as you will be working with these languages to perform big data testing tasks using Microsoft Power BI platform from the next topic.

Power Query Formula Language (M code)

The first step when testing big data samples is to import data into a data model in Power BI. At this stage, the Query Editor in Power BI provides a powerful graphical interface that allows you to perform complex data modifications without having to look into the M code that the Query Editor is building behind the scenes.

Power Query is used to perform the extract, transform, and load (ETL) processing of data. This can be done using the graphical user interface in Power BI without having to worry about writing complex M code.

To see the M code behind a query in Power BI, we can use the Advanced Editor in the Query Editor.

In this course, we will not go into detail about how to write Power Query formulas as it can be easily done via the graphical user interface in Power BI, which you will get to experience in the next topic.

Most importantly you need to understand and learn the basic industry protocols and procedures to write DAX queries and scripts for big data testing. So let’s begin.

DAX rules and requirements

DAX formulas are similar to those used in Excel tables. However there some key differences.

DAX formulas are often written as calculations and comes in three forms.¹⁵

Expand each item below to see more details.

Calculated measures

Used to perform aggregations (e.g. calculate totals, count distinct values etc.)

Calculated columns

Used to add additional columns to tables in a dataset.

Calculated tables

Used to create a new table with required columns

Let’s understand the basic rules that you need to follow when writing DAX formulas and specific object naming requirements.

Basic rules

When writing DAX formulas for calculated columns or measures it is important to follow these general rules or protocols.

A DAX formula always starts with an equal (=) sign
After the equals sign, comes the expression which may include elements such as statements, functions, operators, values etc. It is important to note that expressions:
- are always read from left to right
- use parentheses to control the order in which the elements are grouped
A DAX function always references a complete column, measure or table. This is commonly known as the fully qualified name of that object. For example:
- The fully qualified name of a column refers to the table name, followed by the column name in square brackets. For example, 'AUS Order'[Products].
- The fully qualified name of a measure refers to the table name, followed by the measure name in square brackets. For example, 'AUS Order'[Profit].

Object naming requirements

All object names are case-insensitive. (e.g. The table names 'SALES' and 'Sales' would represent the same table name).
- Note that the term object here refers to tables, columns, measures etc
The names of objects must be unique within their current context. For example:
- measure names must be unique within a model
- column names must be unique within a table
- table names must be unique within a database
Leading or trailing spaces that are enclosed by name delimiters, brackets or single apostrophes are valid in the names of tables, columns, and measures.
If the name of a table contains any of the following, you must enclose the table name in single quotation marks.
- Spaces (e.g. 'Order Detail')
- Reserved keywords (e.g. 'Analysis Services' cannot be used as a table name unless it is enclosed in quotation marks)
- Disallowed characters .,;':/\*|?&%$!+=()[]{}<>
- Characters outside the ANSI alphanumeric character range

Take a look at the following examples of object names used within DAX formulas.

Example	Object type	Comments on the naming requirement
`Customer`	Table name	This table name does not contain any spaces or other special characters. Therefore, the name does not need to be enclosed in quotation marks.
`'AUS Orders'`	Table name	This table name contains a space therefore it needs to be enclosed in single quotation marks.
`'Discount%'`	Table name	This table name contains a special character. Therefore it needs to be enclosed in single quotation marks.
`Orders[Cost]`	Fully qualified column name	The table name precedes the column name and the column name is enclosed in square brackets.
`Orders[Profit]`	Fully qualified measure name	The table name precedes the measure name and the measure name is enclosed in square brackets.
`[Cost]`	Unqualified column name	This is just the column name in square brackets. This can be used in certain situations in formulas such as: in calculated columns within the same table in aggregation function that scans over the same table.
`'AUS Order'[Cost]`	Fully qualified column in table with spaces	The table name contains spaces, so it must be enclosed in single quotation marks.

Operators

There are different types of operators used when writing DAX formulas. Some of the frequently used DAX operators are listed in the following table.¹⁶

Operator type	Metacharacter symbols	Meaning
Arithmetic operators	`+`	Addition
	`-`	Subtraction
	`/`	Division
	`*`	Multiplication
Comparison operators	`=`	Equal to
	`==`	Strict equal to
	`>`	Greater than
	`<`	Less than
	`<=`	Less than or equal to
	`>=`	Greater than or equal to
	`<>`	Not equal to
Text concatenation operator	`&`	Connects two values to produce one continuous text value
Logical operators	`&&`	AND condition
Logical operators	`\|\|`	OR condition
Parenthesis operator	`()`	Used for precedence order and grouping of arguments

For more information on DAX operators, refer to the DAX operators - DAX | Microsoft Docs.

Now that you are aware of the general rules for writing DAX formulas, let’s look into the structure and basic syntax of a DAX formula.

DAX formula syntax

Before you can write your own DAX functions you need to be familiar with the various elements that make up a DAX formula. ¹⁴

Following are the basic elements that make up a DAX measure.

1 of 2

Simply, what this DAX statement does, is to create a measure named Total Costs that calculates (=) the SUM() of values in the CostAmount column in the Orders table.

Let’s understand each of the elements in this DAX measure.

The name of the new measure (e.g. Total Costs)
The beginning of the formula is indicated by an equals sign operator (=)
DAX function (e.g. SUM() – used to add all values in a given table’s column (e.g. Orders[CostAmount])
Parenthesis (), which surrounds an expression that contains one or more arguments. Most functions require at least one argument. An argument passes a value to a function.
The name of the table that the formula applies to (e.g. Orders)
The name of the column that the formula applies to (e.g. CostAmount)
Calculated measure name
DAX formula

Let us take a look at some examples of how the DAX formula syntax can be used to create a calculated column, measure and table. In each of these examples notice how the syntax rules are followed.

1 of 3

DAX column/measure name (e.g. Profit, Total Profit, Customer)
DAX Formula

DAX Functions

DAX functions help to perform commonly used data calculations on data models, databases, tables etc.

There are a variety of DAX functions that are inbuilt in the DAX language that can be used for various purposes when writing DAX formulas. It is important to note that each DAX function, has specific syntax rules that must be followed to ensure the script executes without any errors.

For more information on DAX functions and their specific syntax rules, refer to the DAX function reference - DAX | Microsoft Docs.

The following table explores and provides examples of some of the DAX functions, their purpose and how they can be used when writing queries.

Function type	DAX functions	What it does	Example
Filter functions	FILTER()	Used to get an output of a table that contains only the filtered rows.	The following example filters all customer records from the `Customer` table where the `Location = "NSW"` Sample expression: FILTER(Customer, Customer[Location] = "NSW")
Aggregation functions	SUM()	Used to add all the numbers that are in a column.	The following example adds all the numbers in the `Cost` column from the `Sales` table. SUM(Sales[Cost])
Aggregation functions	DISTINCTCOUNT()	Used to return the number of distinct values in a given column.	The following example returns the calculated number of distinct values from the `Category` column from the `Products` table. DISTINCTCOUNT(Category)
Table manipulation functions	SELECTCOLUMNS()	Used to add calculated columns to the given table	The following example creates a new table called `Orders` with specified calculated columns from the `'Company Data'` table. Orders = SELECTCOLUMNS('Company Data', "Order ID", [Order_ID], "Revenue", [Sales], "Cost", [Cost], "Profit", [Sales] – [Cost] )
	EXCEPT()	Used to return the rows of one table minus all the rows of another table.	The following example returns `Table3` that contains records from `Table2` minus the records in `Table1`. Table3 = EXCEPT(Table2, Table1)
	DISTINCT(Table[Column])	Used to return one-column table that contains only the distinct values (no duplicates) from the specified column.	The following example returns a column only contains distinct values from the `Category` column from the `Products` table, where any duplicate `Category` records are removed. DISTINCT(Products[Category])
	DISTINCT(Table)	Used to return only unique values in a table by removing duplicate rows.	The following example returns a new table that only contains distinct values from the `Orders` table, where any duplicate records are removed. DISTINCT(Orders)

DAX Statements

DAX statements are used to query information from structured data sources such as tables, databases and models. ¹⁷

There are different types of statements used for various purposes used when writing DAX formulas. It is important to note that each statement type, have specific syntax rules that must be followed to ensure the script executes without any errors.

For more information on DAX statements and their specific syntax rules, refer to the following documentation from Microsoft Statements(DAX) - DAX | Microsoft Docs.

Let’s explore a few examples of DAX statements.

Statement	Description	Example
`EVALUATE()`	This statement is required to execute a DAX query.	EVALUATE(Customer)
`ORDER BY`	This is an optional statement that defines one or more expressions used to sort results of a DAX query. Sorting can be specified either as ascending (`ASC`) order or descending (`DESC`) order.	ORDER BY Customer[Location] DESC

Putting it all together

Watch the following video from Microsoft Power BI which demonstrates how the statements, functions, values, operators etc. you’ve learnt in this topic work together in DAX queries and the output it generates in Power BI Desktop.

Note: You will learn to use Power BI Desktop later on in the next topic. Therefore, your focus at this point should be on learning the syntax and the basic rules for using DAX scripts.

In the following series of examples and activities, you will see how statements, functions, values, operators etc. all work together in DAX queries.

For each question within the following activities, make an attempt to write DAX queries yourself before looking at the solutions and answers provided.

Example 1 questions

Consider the following 'Customers' table:

Customers
Customer ID	Firstname	Lastname	Country	State	Company	Customer Address
1010	Roosevelt	Romero	US	TXS	Dropellet Intl	RRomero@Dropellet.com
1011	Melody	Cruz	US	MIN	Moderock Intl	MCruz@Moderock.com
1012	Francis	Moreno	US	MCH	Claster Intl	FMoreno@Claster.com
1013	Andy	Jordan	AUS	NSW	Tuttadit Pvt Ltd.	AJordan@Tuttadit.com
1012	Jeffry	Warren	AUS	QLD	Ybuwyn Pvt Ltd.	JWarren@Ybuwyn.com
1015	Cameron	Rhodes	AUS	NSW	Tribop Pvt Ltd	CRhodes@Tribop.com

From the Customers table, you want to create the following new table called 'Customer Summary'.

Customer Summary
Customer ID	Customer Name	Customer Address
1010	Roosevelt Romero	RRomero@Dropellet.com
1011	Melody Cruz	MCruz@Moderock.com
1012	Francis Moreno	FMoreno@Claster.com
1013	Andy Jordan	AJordan@Tuttadit.com
1012	Jeffry Warren	JWarren@Ybuwyn.com
1015	Cameron Rhodes	CRhodes@Tribop.com

What DAX statement would you use to do this?
Create a script that can result in listing all customer records that belong to the country code "AUS". The script should also list the result in descending order of the Customer ID.

What DAX function would you use to create the following table called 'US Customers' that lists only the customer records that belong to the country code "US".

US Customers
Customer ID	Firstname	Lastname	Country	State	Company	Customer Address
1010	Roosevelt	Romero	US	TXS	Dropellet Intl	RRomero@Dropellet.com
1011	Melody	Cruz	US	MIN	Moderock Intl	MCruz@Moderock.com
1012	Francis	Moreno	US	MCH	Claster Intl	FMoreno@Claster.com

What DAX function would you use to create another table called 'All except US Customers' that lists all customer records from the Customers table, except for those records in the 'US Customers' table.

Example 1 answers

Customer Summary = SELECTCOLUMNS(Customers, 
  "Customer ID", [Customer ID], 
  "Customer Name", [Firstname] & " " & [Lastname], 
  "Customer Address", [Customer Address]
)

EVALUATE	
FILTER(Customers, [Country] = "AUS")
ORDER BY[Customer ID] DESC

US Customers = FILTER(Customers, Customers[Country] = "US")

All Except US Customers = EXCEPT(Customers, 'US Customers')

Example 2 questions

Consider the following Orders table.

Orders
Order ID	Product ID	Product Name	Sales	Cost	Quantity
001	101002	Colored Pencils	$ 3.50	$ 1.00	5
001	101004	Flourescent Highlighters	$ 5.00	$ 2.50	2
002	101001	Crayons	$ 7.50	$ 3.00	1
002	101003	Dustless Chalk Sticks	$ 2.50	$ 1.50	2
002	101004	Flourescent Highlighters	$ 5.00	$ 2.50	1
003	101003	Dustless Chalk Sticks	$ 2.50	$ 1.50	10
003	101004	Flourescent Highlighters	$ 5.00	$ 2.50	10

Create a measure called Profit that adds up the Sales price for all orders.
Create a measure called 'Distinct Products' that counts the number of different items in the Product ID column.

What DAX statement would you use to create a new table called Products that includes only the Product ID, Product Name, Sales and Cost columns as follows.

Products
Product ID	Product Name	Sales	Cost
101002	Colored Pencils	$ 3.50	$ 1.00
101004	Flourescent Highlighters	$ 5.00	$ 2.50
101001	Crayons	$ 7.50	$ 3.00
101003	Dustless Chalk Sticks	$ 2.50	$ 1.50
101004	Flourescent Highlighters	$ 5.00	$ 2.50
101003	Dustless Chalk Sticks	$ 2.50	$ 1.50
101004	Flourescent Highlighters	$ 5.00	$ 2.50

Create another DAX measure called 'Distinct Products' that only includes unique product records and no duplicates. This should result in a table output as follows.

Distinct Products
Product ID	Product Name	Sales	Cost
101001	Crayons	$ 7.50	$ 3.00
101002	Colored Pencils	$ 3.50	$ 1.00
101003	Dustless Chalk Sticks	$ 2.50	$ 1.50
101004	Flourescent Highlighters	$ 5.00	$ 2.50

Example 2 answers

```
Profit = SUM(Orders[Sales])
```

Distinct Products = DISTINCTCOUNT(Orders[Product ID])

Products = SELECTCOLUMNS(Orders,
  "Product ID", [Product ID],
  "Product Name", [Product Name],
  "Sales", [Sales],
  "Cost", [Cost]
)

```
Distinct Products = DISTINCT(Products)
```

Additional resources

Refer to the following sources to learn more about DAX syntax and how it can be used.

Adding comments in DAX scripts

Adding comments will not affect the performance of the scripts/queries but would make it easier for others to understand the written code and make changes later on if needed.

These examples show the different methods of commenting

Example 1:In-line comments.

Orders = SELECTCOLUMNS('Company Data', --this is an inline comment
  "Customer ID", [Customer ID],
  "Customer Name", [Customer Name],
  "Customer Address", [Customer Address]
)

Method: Start your comment with two hyphens/dashes (--).

See example 1. The comment is embedded in-line using the hyphens indicating that anything after the hyphens is the comment text.

Example 2: Single-line comments

Orders = SELECTCOLUMNS('Company Data',
  "Customer ID", [Customer ID],
  // this is a single-line comment
  "Customer Name", [Customer Name],
  "Customer Address", [Customer Address]
)

Method: Begin the comment with the // symbol.

See example 2. The // symbol indicates that the entire line contains comment text.

Example 3: Commenting-out multiple lines of code

Orders = SELECTCOLUMNS('Company Data',
  "Customer ID", [Customer ID],
  /*
    this is a multi-line comment
    "Customer Name", [Customer Name],
    "Customer Address", [Customer Address]
  */
)

Method: Place the code in between /* and */ symbols.

See example 3. The specific two lines of the script won’t execute because it is commented out.

The final activity for this topic is a set of questions that will help you prepare for your formal assessment.

a couple of professionals validating data

To be able to work with and analyse big data, data needs to be:

first data needs to be gathered from a variety of sources (i.e. extracted)
second, it needs to be organised together (i.e. transformed)
third, data needs to be loaded onto a single centralised repository (i.e. loaded)

These basic processes involved in testing and preparing big data is commonly known as extract, transform and load (ETL).

Note: Throughout this module, you will come across this term ETL often. Therefore, it is important that you understand what it means and why it is important in big data testing.

Watch the following video to understand the ETL process and its benefits.

After watching the video, answer the following questions.

Data validation requires conducting routine checks to ensure that data is correct, meaningful and is free of errors.

Therefore the following protocols are used to support this data validation process.

Testing methodologies
Test scripting

Let’s try to understand each of these data validation protocols in detail.

Testing methodologies

The methodology of big data testing mainly involves three stages¹⁸:

Data staging validation
MapReduce validation
Output validation

Data staging validation

This involves testing whether the data extracted from various sources are uploaded to the analytics platform correctly.

At this stage, you will also need to determine the need to test the complete set of data or a representative sample dataset.

Some of the validation checks performed in this stage include:

verifying that the data is obtained from the correct source system (e.g. files, database, CRM etc.)
verify that the data obtained from the source is accurate and not corrupted in any way (e.g. by performing checksums, hashes, counts of records, format checking etc.)
verify that the data files required for testing are uploaded or imported into the analytical platform correctly without any errors or missing data
verify if the source data synchronises with the data that is uploaded into the analytical platform.

These validation checks are commonly performed during extract, transform load (ETL) testing.

MapReduce(process) validation

This involves compressing the large data set into a more practical, aggregated and compact data sample.

Some of the validation checks performed in this stage include: ¹⁹

verifying the processing of data and the business logic
aligning the dataset to relevant parts of the organisation (e.g. source to target mapping)
validate the MapReduce* process to ensure the correct generation of the “key-value” pairs
validate the data after the MapReduce process
validate the data aggregation and segregation rules implemented
validate that any anomalies in the data are resolved
validate the generated data output to ensure it is correct, valid and that all reporting requirements are met.

* MapReduce – Originally this term came from an algorithm called MapReduce invented by Google that divides a task into small parts and assigns them to many computers. Later the results are collected at one place and integrated to form the result dataset. This process involves two steps the Map task and the Reduce task.²⁰

The Map task takes a set of data and converts it into another set of data, where the individual elements are broken down into tuples (key-value pairs)
The Reduce task takes the output from the Map as an input and combines those data tuples into a smaller set of tuples.

Watch the following video to learn more about the MapReduce process.

Output validation

This involves loading the previously validated data by first unloading and then loading it into the target repository system. This also ensures that the generated reports from the loaded data are displayed as per organisational requirements.

Some of the validation checks performed in this stage include:

validating the dashboard report model to ensure correct visualisations are used to display the required data
validating that the data transformation rules are applied correctly and results are as expected
validate the integrity of the data to ensure it is loaded into the target system successfully
validate the output against the specific source system data to ensure there is no data corruption
validation of reports with the required data and ensuring that all indicators are displayed correctly.

Test scripting

How and when validation checks are performed

You have previously leant that the process of validating data involves routine checks.

One approach that you can use to perform these checks at each stage of the testing process would be to ²¹

manually test the output against the source data to check it is correct
manually validate that the key metrics are calculating as expected.

It is important to understand that each time you release a new version of the big data test solution, this needs to be validated. However, carrying out these validations manually would be a complex, time consuming and burdensome task.

Automated testing

A more practical and efficient approach for carrying out validation checks is to use automated testing (also known as test scripting)²¹.

Running test queries and scripts on the data at each stage to ensure data quality and correctness can help detect errors that otherwise would not be obvious in manual tests.

Once these test scripts are written they can be easily manipulated and reused to perform tests for other similar test scenarios.

These scripts for data validation can be written using a variety of scripting languages depending on the tools and platforms it supports.

The final activity for this sub-topic will help you prepare for your formal assessment.

There are various protocols and techniques used when performance testing the output from big data sources.

Testing big data throughput

What is throughput?

Throughput refers to the number of transactions produced overtime during a test.

Transactions, in this case, would further refer to the background queries and scripts executed by the data analysis platform in the background for generating reports.

It is important to first have a throughput goal that the application needs to be able to handle a specific number of requests per hour.

Testing throughput

Power BI Performance Analyser helps to monitor the elements of a report that contains various visuals, DAX formulas, and queries for their performance. Once the Performance Analyser measures the processing time required to run queries, load visuals in the report this can be analysed to identify any issues and further investigate any delays that affect data throughput.

Read through Microsoft’s guidelines on Power BI Performance Analyser to find out about the protocols and techniques involved when using this tool to do the following:

Capture performance information
Compute the durations
Refreshing visuals
Refreshing an individual visual’s performance data
Saving performance information

We will discuss more on how you can analyse the performance data that you save from the Performance Analyser tool later on in a different topic.

The final activity for this sub-topic will help you prepare for your formal assessment.

It is evident that during big data sample testing you will find a variety of issues during each big data validation phase.

All of these issues would then need to be processed and reported appropriately. There are various protocols and techniques that are followed and used in organisations in order to process and report identified issues.

Protocols and techniques for processing and reporting issues

Let’s explore some of these protocols and techniques.

Report issues related to raw big data

In situations where an issue is related to the raw big data sources, these need to be communicated directly with the client or other relevant stakeholders who are involved in the project.

In a situation where the testers have found major issues with the big data sample received for testing and the testing process cannot progress as it is not feasible, then this needs to be communicated immediately to the client. Various communication protocols can be used in this scenario to inform the client. Such as email, meetings (either face-to-face or online).

It is important to note that organisations have their own standard communication protocols and reporting methods as outlined in their policies and procedure documents (e.g. email format, procedure for client contact, . Therefore, as a general practice, the formal report created by the testers should also follow organisational standard reporting formats.

Report issues related to systems/platforms

If there is an issue with the testing platform (e.g. performance issues, bugs, service unavailable) this needs to be raised with the relevant technical support personnel. Generally, in an organisational setup, this is first addressed by internal technical support team who would then escalate the issue with the vendor support.

If testers are directly dealing with the vendor there may be vendor-specific communication protocols that they would have to follow in order to report issues.

For example, if there’s an issue with the services used in Microsoft Power BI, this should be reported to Microsoft support using the communication protocol specified by the vendor. Refer to Microsoft’s guidelines on this process at Power BI Desktop diagnostics collection - Power BI | Microsoft Docs

Use ticketing systems to track and manage issues

Organisations use various ticketing systems (e.g. JIRA, ServiceNow) so that they can:

track each reported issue in detail
pass it onto the correct person who can solve the problem
have a documented trail of evidence related to the issue and all steps taken to solve the issue
track the time taken to solve the problem
have visibility of the number of issues found in the project.

The protocols and techniques used when reporting an issue is to:

create a new ticket in the system (this will generate a unique ticket ID for the issue)
describe what the issue is and include all necessary details relevant to the issue (e.g. how the issue was found, at what stage of testing was it found, etc.)
categorise the issue according to its priority or urgency (high, medium, low)
assign the report to the appropriate person who will have the expertise to look into the issue further or direct it to the client, third party contact etc. as appropriate
have some indication of the status of the issue (e.g. open, pending, in progress, verified, solved).

Therefore, whatever ticketing system you would use in any organisation, it would usually follow the same protocols and techniques for processing and reporting issues during the big data sample testing process.

Generate formal test results reports

During and after the testing of big data samples, the results of the conducted tests can be collated into formal reports. These reports can then be presented and shared with clients, relevant stakeholders and management.

A formal test results report should include:

relevant evidence of all tests carried out on the big data sample
details of the raw big data sample
sampling strategies used
details of tools and platforms used
validation tests performed and their results at each stage of the testing process using test case scenario documents and results sheets
recommendations and suggested solutions for issues
final outcome of the sample tests.

It is also important to use clear, specific and industry-related terminology to represent test results.

It is important to note that organisations have their own standard report formats and protocols for generating reports such as report templates, graph visualisation templates, style guides etc. Therefore, as a general practice, the formal report created by the testers should also follow organisational standard reporting formats.

The final activity for this sub-topic will help you prepare for your formal assessment.

a group of employees in the office listening to a manager about policies and procedures

You must work within your organisation’s policies and procedures for big data testing. Consider this:

A ‘policy’ is a document which explains why a task should be undertaken
A ‘procedure’ is a document which explains how the task should be completed

Policies and procedures for big data testing are formulated around the following key processes.

Key processes for policies and procedures

For this unit, you must be particularly familiar with the following policies and procedures.

Assembling and obtaining raw big data

This is the first step in the process of testing big data samples. Raw big data is obtained from internal and/or external sources. It is then assembled into a software program (e.g. Apache Hadoop, Power BI Desktop etc) for validation and testing. You should follow your organisation’s policies and procedures when undertaking these tasks.

Policy for obtaining raw big data	Helps to identify the internal or external sources of raw big data required for the job, any legislative requirements related to the data samples and relevant procedures to acquire the raw big data.
Procedure for assembling raw big data	Provides guidelines on how to compile the big data sample for testing from the previously acquired raw dataset.

Isolating sub-standard data

Procedure for isolating sub-standard data	Provides guidelines on how to separate those types of data that do not meet the quality standards set by the organisation and are unsuitable for further use (e.g. in analysis). You should isolate this data – that is, remove it from the dataset.
Procedure for correcting data acquisition paths	Provides guidelines and steps on how to change, update and adjust the data source settings.

Testing data sources and storing test results

Procedure for storing test results and associated support evidence policy	Provides guidelines on how and where to securely save the output and outcomes of the sample data tested. This also includes information on how to store all documents that contain proof of tests performed on the data set so it is secure and can be retrieved when needed.
Procedure for testing transactional data sources	Provides guidelines on how to test data sources that have a time dimension and become historical once the operation is complete.
Procedure for testing non-transactional data sources	Provides guidelines on how to test data sources that are relevant to an organisation for a longer duration.

Data cleansing

Data cleansing is done soon after extract, transform and load (ETL) testing – these policies and procedures will explain how to validate the big data sample, to ensure it is correct

ETL Testing Procedure	Involves a set of procedures for extracting data from different sources, then transforming the data and loading the data into the target system. Often done using software tools.
Data Cleansing Policy	Provides guidelines on how to eliminate bad records, fix any formatting issues, perform de-duplication, and correct any inconsistencies present in the data after the data is collected.
Output Quality Assurance Policy	Provides best practices and guidelines to ensure that the test results of the big data sample are presented clearly and consistently using standard representations (e.g charts, graphs etc). May include guidelines to ensure reports can be generated in a variety of formats that support different devices.

The final activity for this sub-topic will help you prepare for your formal assessment.

Topic Summary

Congratulations on completing your learning for this topic Big data sample testing fundamentals.

In this topic you have learnt the following fundamental concepts behind big data testing.

Legislative requirements
Features and formats of various big data sources
Writing queries and scripts
Big data validation protocols
Performance testing
Processing and reporting issues
Organisational policies and procedures

Check your learning

The final activity for this topic is a set of questions that will help you prepare for your formal assessment.

Assessments

Now that you have completed the basic knowledge required for this module, you are ready to complete the following two assessment events:

Assessment 1 (Online Quiz)
Assessment 2 (Short answer questions).

What’s next?

Next, we will dive into the more practical tasks related to validating assembled or obtained big data samples.

Module Linking

M00147

M00147CL

Main Topic Image

A group of employees meeting at a table with a large dataset

Is Study Guide?

Off

Is Assessment Consultation?

Off