This page covers commonly used statistical functions and how to implement them using the existing functions within Rockset.
Median
The median is the middle value in a dataset when arranged in ascending or descending order.
SELECT
ARRAY_SORT(ARRAY_AGG(foo.data)) [(COUNT(*) + 1) / 2] as median
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
Mode
The mode is the value that appears most frequently in a dataset.
SELECT
foo.data as mode
FROM
UNNEST(array [3,3,3,1,5] AS data) AS foo
GROUP BY
1
ORDER BY
COUNT(foo.data) desc
LIMIT
1
Counter
The example below shows how to output a map of keys and their associated counts or frequencies.
SELECT
MAP_AGG(x.data, x.counts) as counter
FROM
(
SELECT
foo.data,
COUNT(foo.data) AS counts
FROM
UNNEST(array ['apple', 'apple', 'banana'] AS data) AS foo
GROUP BY
1
) x
Percentile
A percentile is a statistical value that indicates the percentage of data points below a given value in a dataset. The example below outputs the percentile of 4.7
in a given dataset of [1,2,3,4,5]
.
SELECT
COUNT_IF(foo.data <= 4.7) * 100.0 / COUNT(*) as percentile
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
Standard Deviation
Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of values.
The population standard deviation represents the dispersion of values in an entire population, whereas sample standard deviation is the estimated dispersion of values in a subset or sample of a population.
The sample standard deviation STDDEV_SAMP is already a built-in function in Rockset.
SELECT
STDDEV_SAMP(foo.data) as sample_stddev
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
The population standard deviation can be calulated using the equation below.
SELECT
SQRT(POWER(STDDEV_SAMP(foo.data), 2) * (COUNT(*)-1)/COUNT(*)) as pop_stddev
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
Variance
Variance is a statistical measure that quantifies the average squared deviation of values from their mean, providing a measure of the spread or dispersion of the data.
The population variance represents the average squared deviation of values from the mean for an entire population, whereas sample variance is the estimated average squared deviation of values from the mean for a subset or sample of a population.
The sample variance can be calculated by squaring the sample standard deviation as below.
SELECT
POWER(STDDEV_SAMP(foo.data), 2) as sample_variance
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
The population variance can be calculated using the equation below.
SELECT
POWER(STDDEV_SAMP(foo.data), 2) * (COUNT(*) -1) / COUNT(*) as pop_variance
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo
Z-Score
Z-score is a statistical measure that represents the number of standard deviations a data point is from the mean of a distribution. The example below outputs the z-score of 4.7
in a given dataset of [1,2,3,4,5]
.
SELECT
(4.7 - AVG(foo.data))/STDDEV_SAMP(foo.data) as z_score
FROM
UNNEST(array [1,2,3,4,5] AS data) AS foo