Link

Data Classification: Histograms, Classification Schemes & Choropleth Mapping

Data Types

There are two kinds of data, that both have two sub-types. Qualitative (descriptive) data includes Nominal and Ordinal Data types. Nominal data is categorical data with no inherent ranking. Ordinal data is categorical data that does have a rank. Quantitative (numeric) data includes Ratio and Interval Data. Interval data is similar to ratio data, but it lacks an absolute zero point. Ratio data has an absolute zero point and the difference between values is meaningful. The death rate is ratio data, it can’t be negative and a rate of 2 is twice as high as a rate of 1.

View Image in New Tab

Rates by Province/State

We can normalize our data by demographic information at different administrative levels (eg. Province, Municipality) because rates varies by administrative divisions. If we want to classify rates, the first step is to look at a histogram. A Histogram shows us the frequency distribution of a given variable. Data is grouped into a set of bins and counted. Histograms can be useful for spotting outliers in a dataset, which can influence our choice of classification scheme.

View Image in New Tab

Classification Methods

We’ll cover five classification methods

1) Equal Interval

  • Data is split to bins of equal width regardless of distribution

2) Quantiles

  • Data is split by percentiles

3) Natural Breaks

  • Data is split using the Jenks algorithm

4) Manual Breaks

  • We define our own splits

5) Standard Deviation

  • Data is split to bins based on distance from the mean

A note on color choices Sequential colormaps are the best choice for representing ratio data (eg. PKR). I suggest you check out color brewer for help picking out color schemes.

Equal Interval

The simplest classification scheme is to just break the data into classes of equal sizes e.g. The minimum is 0.3 and the maximum is 10.6, so we can split that into four bins 2.6 units wide.

Nunavut has the highest Police involved death Rate of any administrative sub-division in Canada or the United States.

View Image in New Tab

View Image in New Tab

Quantiles

The simplest classification scheme that is based on the data’s distribution. The data is ranked and broken up by percentiles:

  • class 1 contains 0-20%, class 2 is 20-40%, class 3 is 40-60%, class 4 is 60-80%, & class 5 is 80-100%

View Image in New Tab

View Image in New Tab

Natural Breaks

Data is split using the Jenks algorithm. This algorithm optimizes the data split into “Natural” classes. The algorithm maximizes within group similarity and between group dissimilarity

View Image in New Tab

View Image in New Tab

Manual Breaks

We can define our own break values to classify data. This allows us to choose more meaningful break values if necessary (round numbers, clean fractions, etc). The choice of manual breaks can influence the way the data is perceived.

View Image in New Tab

View Image in New Tab

Standard Deviation

This distribution-based classification method shows how far a value is from the mean in standard deviations. It can be very informative to a knowledgeable user, but it is not accessible for the general public. The standard deviation classification method converts the data to interval data (deviations above/below the mean). Diverging colormaps are a better choice for interval data in many instances, as they can better highlight what values are above or below the zero point.

View Image in New Tab

View Image in New Tab

Poll Questions:

7) If you want to highlight the severity of police involved deaths in Canada, which classification method would be best?

A) Equal Interval
B) Quantiles
C) Natural Breaks
D) Manual Breaks
E) Standard Deviation

8) What classification method might the RCMP choose to minimize the severity of police involved deaths in Canada?

A) Equal Interval
B) Quantiles
C) Natural Breaks
D) Manual Breaks
E) Standard Deviation