Statistical Functions

This page covers commonly used statistical functions and how to implement them using the existing functions within Rockset.

Median

The median is the middle value in a dataset when arranged in ascending or descending order.

SELECT
    ARRAY_SORT(ARRAY_AGG(foo.data)) [(COUNT(*) + 1) / 2] as median
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
3

Mode

The mode is the value that appears most frequently in a dataset.

SELECT
    foo.data as mode
FROM
    UNNEST(array [3,3,3,1,5] AS data) AS foo
GROUP BY
    1
ORDER BY
    COUNT(foo.data) desc
LIMIT
    1
3

Counter

The example below shows how to output a map of keys and their associated counts or frequencies.

SELECT
    MAP_AGG(x.data, x.counts) as counter
FROM
    (
        SELECT
            foo.data,
            COUNT(foo.data) AS counts
        FROM
            UNNEST(array ['apple', 'apple', 'banana'] AS data) AS foo
        GROUP BY
            1
    ) x
{"apple":2,"banana":1}

Percentile

A percentile is a statistical value that indicates the percentage of data points below a given value in a dataset. The example below outputs the percentile of 4.7 in a given dataset of [1,2,3,4,5].

SELECT
    COUNT_IF(foo.data <= 4.7) * 100.0 / COUNT(*) as percentile
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
80

Standard Deviation

Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of values.

The population standard deviation represents the dispersion of values in an entire population, whereas sample standard deviation is the estimated dispersion of values in a subset or sample of a population.

The sample standard deviation STDDEV_SAMP is already a built-in function in Rockset.

SELECT
    STDDEV_SAMP(foo.data) as sample_stddev
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
1.5811388300841898

The population standard deviation can be calulated using the equation below.

SELECT
    SQRT(POWER(STDDEV_SAMP(foo.data), 2) * (COUNT(*)-1)/COUNT(*)) as pop_stddev
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
1.4142135623730951

Variance

Variance is a statistical measure that quantifies the average squared deviation of values from their mean, providing a measure of the spread or dispersion of the data.

The population variance represents the average squared deviation of values from the mean for an entire population, whereas sample variance is the estimated average squared deviation of values from the mean for a subset or sample of a population.

The sample variance can be calculated by squaring the sample standard deviation as below.

SELECT
    POWER(STDDEV_SAMP(foo.data), 2) as sample_variance
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
2.5000000000000004

The population variance can be calculated using the equation below.

SELECT
    POWER(STDDEV_SAMP(foo.data), 2) * (COUNT(*) -1) / COUNT(*) as pop_variance
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
2.0000000000000004

Z-Score

Z-score is a statistical measure that represents the number of standard deviations a data point is from the mean of a distribution. The example below outputs the z-score of 4.7 in a given dataset of [1,2,3,4,5].

SELECT
    (4.7 - AVG(foo.data))/STDDEV_SAMP(foo.data) as z_score
FROM
    UNNEST(array [1,2,3,4,5] AS data) AS foo
1.075174404457249