You are currently browsing the Philipp Meier's weblog posts tagged: sql

marketing

## Group by logarithmic count in SQL

The count() aggregate function in sql can be uses to get a histogram of the distribution of values in a table. Assume there is a table with customer data and you want to get a distribution of the distance to each customer:

``` select distance, count(distance) from cust\_with\_dist group by distance; ```

This will show the distance group be the exact values, which is not always desirable, f.e. when the dist column contains floating point data, this would yield the following result:

``` distance | count ----------+------- 0.0 | 142 1.2 | 165 5.5 | 24 7.6 | 17 8.3 | 11 9.3 | 11 9.4 | 6 ... ```

Better would be to arrange the distance column in whole numbers:

``` select distance, count(distance) from (select round(distance) as distance from cust\_with\_dist) as foo group by distance ```

This gives:

``` distance | count ----------+------- 0 | 142 1 | 165 6 | 24 8 | 28 9 | 17 ```

Better for now, but if there is a large scale of values for the distance, e.g. 0 up to 1000km it dosn’t make sense to group the values from 100 to 200 into 1km intervals. A better grouping will be exponential or logarithmic grouping, e.g. into group up to 1, 10, 100, 1000km or 1, 5, 25, 125, 625km. This can be accomplished by transforming the dist column before grouping:

``` select dist, count(dist) from (select round(pow(10,ceil(log(10,distance+0.0001)))) as dist from cust\_with\_dist) as foo group by dist ```

The distance is added 0.0001 to prevent log(0) exceptions, by using the ceil() function the data is grouped into intervals “up-to”, the floor() function would give “as least”. The result is

``` dist | count ------+------- 0 | 142 10 | 235 100 | 111 1000 | 7 ```

If you exchange “10” by “5” you’ll get distances intervals of 0, 5, 125, 625 and so on.  Bear
tools