Data Classification

ILOs


Maps are a graphic form for representing objects and phenomena on the surface of the earth.  Human vision has limited capability to extract information from a map that simply depict the raw data.  For example, to show population distribution at the county level for the entire United States, we can use one graphic symbol for each unique population value, resulting in hundreds of different symbols that our eyes simply can't tell apart.  In order to overcome this problem, we can group counties that share similar population values together and assign each group a graphic symbol.  We can limit the number of groups so that we can effectively communicate the information.

This grouping process is called data classification.  There are many techniques for classification.  In fact, classification is a sub-field in mathematics.  At the introductary level, we will mainly deal with three types, i.e., equal interval, quantile, and standard deviation.  It is important to know that classification generalizes the data and inevitably reduce and distort  the information content.  Different maps can be generated simply by using different classification methods and different number of classes.  With data classification, a map can tell some truths as well as lies.

Chapter 4 in Slocum gives excellent descriptions on each of the classification methods.  This note gives some hightlights and complementary discussions.

Below is the table of 1990 population density (number of person per square mile) for selected counties in Central Michigan.  We will use this data set to demonstrate the three data classification procedures.
 
 
"Name" "Popden90"
Lake 14.937
Osceola 35.153
Arenac 40.514
Clare 43.379
Gladwin 42.399
Bay 249.510
Midland 143.315
Isabella 94.548
Newaygo 44.350
Mecosta 65.327
Saginaw 259.817
Montcalm 73.595
Gratiot 68.200
Kent 574.013
Genesee 662.925
Shiawassee 129.034
Ionia 98.281
Clinton 100.742
Livingston 197.534
Ingham 502.594
Eaton 160.411
Barry 86.774

Equal interval

Steps for computing class limits:

1. Determine the class interval

Assume we want to group the 22 counties into 5 groups,

Class interval = range / (number of classes) = (662.925 - 14.937)/5 = 647.988/5 = 129.598

2. Determine the upper limit of each class

The first class starts with the lowest value, which is 14.937.  The upper limit of the first class is calculated as 14.937 + 129.598 = 144.535.

3. Determine the lower limit of each class

For the second class and the subsequent classes, the low limits should be slightly greater than the previou upper limit.  In the case for the second class, the low limit is 0.001 greater than 144.535.  The upper limit is calculated as 144.536 + 129.598 = 274.134.  We will keep using 0.001 as the margin between the upper limit and the lower limit of the subsequent class.

Results of 2 and 3
 
Classes
Class limits
1
14.937 - 144.535
2
144.536 - 274.134
3
274.135 - 403.733
4
403.734 - 533.332
5
533.333 - 662.931

4. Specify the actual class limits for mapping

This step involves the rounding of the class limits.  It is necessary only when the limits calculated in steps 2 and 3 have higher precision than the original data.

5. Determine the class assignment

Use the class limits, we can now assign each county to a particular class.  The result is in the table.
 
"Name" "Popden90" Equal interval
Lake 14.937 1
Osceola 35.153 1
Arenac 40.514 1
Clare 43.379 1
Gladwin 42.399 1
Bay 249.510 2
Midland 143.315 1
Isabella 94.548 1
Newaygo 44.350 1
Mecosta 65.327 1
Saginaw 259.817 2
Montcalm 73.595 1
Gratiot 68.200 1
Kent 574.013 5
Genesee 662.925 5
Shiawassee 129.034 1
Ionia 98.281 1
Clinton 100.742 1
Livingston 197.534 2
Ingham 502.594 4
Eaton 160.411 2
Barry 86.774 1

Below is the corresponding map.

What are the advantages and disadvantages of equal interval classification?  (page. 67)

Quantile

Quantile classification puts equal number of observations (county population density) in each class according to the ranks.

1. Calculate the number of observations in each class

Number of observations = (Total number of observations) / (number of classes)
                                          = 22 / 5 = 4.4
Since the number of observations is not an integer, we can assign the number of classes in the following manner: class 1 - four observations; class 2 - five observations; class 3 - four observations; class 4 - five observations; class 5 - four observations.

2. Rank the data
3. Assign classes

Below is the result.
 
"Name" "Popden90" Quantile class
Lake 14.937 1
Osceola 35.153 1
Arenac 40.514 1
Gladwin 42.399 1
Clare 43.379 2
Newaygo 44.350 2
Mecosta 65.327 2
Gratiot 68.200 2
Montcalm 73.595 2
Barry 86.774 3
Isabella 94.548 3
Ionia 98.281 3
Clinton 100.742 3
Shiawassee 129.034 4
Midland 143.315 4
Eaton 160.411 4
Livingston 197.534 4
Bay 249.510 4
Saginaw 259.817 5
Ingham 502.594 5
Kent 574.013 5
Genesee 662.925 5

And the corresponding map:

What's advantage and disadvantages of quantile classification? (page 67-69)
 

Standard Deviation

In this method, the standard deviation is used to compute the class limits.  One may use any units of standard deviations for calculating the class limits.  The best way to understand this method is to use the following illustration:
 
 

For our example, the standard deviation is ±177.686 and the mean is 167.607.  We can first subtract the mean by 1 unit of standard deviation.  167.607 - 177.686 = -10.08, which is way below the minimum population density in our data set.  There is no need to subtract -10.08 to get another class further to the left of the mean.

We now will add one standard deviation to the mean.  167.607 + 177.686 = 345.293, which still smaller than the maximum population density.  We hence add another unit of standard deviation to 345.293 to produce another class.  Use the procedure we generated the table below:

class 1: -10.08  to 167.607 (one standard deviation below the mean)
class 2: 167.607 (the mean)
class 3: 167.607 to 345.293 (one standard deviation above the mean)
class 4: 345.293 to 522.98 (two standard deviation above the mean)
class 5: 522.98 to 700.266 (three standard deviation above the mean)

Note that the number of classes is determined by the characteristics of the data and is not artificially determined.  One can use a fraction of the standard deviation to calculate the classes, which will produce more classes.

Below is the corresponding map:

What's advantages and disadvantages of standard deviation classification method?  (page 69)
 

Exercises

1. Classify the following data set into three groups using equal interval, quantile, standard deviation (1 standard deviation) respectively.
 
 
Data Equal interval Quantile Standard deviation
1.2
3.4
1
5
0.5
2.3
5.6
3.9
6.7
2.6
8.2
1.5

2. Make a map showing the population density by counties (L:\esri\esridata\us\counties.shp) in the United States, using three different types of classification method.  Use gray level (gray monochromatic) for the legend.  Do not use color.  Edit the maps with careful design.  Publish the maps to the Web.