bottom

Data Distribution and Control

probability-distributionRandomly chosen data among reference values is fundamental to test data generation engines. But a pure random choice often neglects the representativeness issue. To reinforce the realism of generated data, the underlying engine should enable probability distribution to be controlled

 

What is Probability Distribution

The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values. For example, consider the content of a shopping cart from an online retailer such as Amazon.com. Each product of a cart is selected by a customer from the available catalog. The list of values is thus the list of products in the catalog. Now, if you consider all the carts created by customers of the web site in a day, you can associate each product with a probability of being selected by a customer, just by counting the number of times it has been selected. This is how we arrive at our probability distribution of products in a cart during a day.

Adding control

The probability distribution control of generated list of values is a mechanism allowing the designer of a generator to control the number of occurrences of each possible value in the list by associating it with a weight.

Principles of Distribution Control

The probability of a value is computed by dividing its own weight by the sum of all the weights of the other possible values in the list. The basis of this mechanism is the weighted list generation rule. In this rule, the designer can define a list of value and for each of them choose a weight. The generation engine will then make a random choice among those values, but ensuring that the number of time it chooses a value matches the weight of that value.

Take for example the following weighted list of colors :

blue 1 0.17
red 2 0.33
black 3 0.5

The probability of occurence of the value "red" will be 2 / ( 1 + 2 + 3) = 0.33 (~ 33 %)

Basic Application of Distribution Control

The basic application of distribution control is for a single field where a list of all the possible values allowed for that field, each with an appropriate weight, is defined in a Weighted List generation rule.