Classification

Data Classification

ILOs

Understand the need for data classification
Be able to classify data with three methods: equal interval, quantile, and standard deviation
Understand the advantages and disadvantages of different methods of classification (equal interval, quantile, standard deviation)
Be able to use ArcView to generate maps with different classification methods
Be able to appreciate the results of different kinds of classifications

Maps are a graphic form for representing objects and phenomena on the surface of the earth. Human vision has limited capability to extract information from a map that simply depict the raw data. For example, to show population distribution at the county level for the entire United States, we can use one graphic symbol for each unique population value, resulting in hundreds of different symbols that our eyes simply can't tell apart. In order to overcome this problem, we can group counties that share similar population values together and assign each group a graphic symbol. We can limit the number of groups so that we can effectively communicate the information.

This grouping process is called data classification. There are many techniques for classification. In fact, classification is a sub-field in mathematics. At the introductary level, we will mainly deal with three types, i.e., equal interval, quantile, and standard deviation. It is important to know that classification generalizes the data and inevitably reduce and distort the information content. Different maps can be generated simply by using different classification methods and different number of classes. With data classification, a map can tell some truths as well as lies.

Chapter 4 in Slocum gives excellent descriptions on each of the classification methods. This note gives some hightlights and complementary discussions.

Below is the table of 1990 population density (number of person per square mile) for selected counties in Central Michigan. We will use this data set to demonstrate the three data classification procedures.

"Name" "Popden90"

Lake 14.937

Osceola 35.153

Arenac 40.514

Clare 43.379

Gladwin 42.399

Bay 249.510

Midland 143.315

Isabella 94.548

Newaygo 44.350

Mecosta 65.327

Saginaw 259.817

Montcalm 73.595

Gratiot 68.200

Kent 574.013

Genesee 662.925

Shiawassee 129.034

Ionia 98.281

Clinton 100.742

Livingston 197.534

Ingham 502.594

Eaton 160.411

Barry 86.774

Equal interval

Steps for computing class limits:

1. Determine the class interval

Assume we want to group the 22 counties into 5 groups,

Class interval = range / (number of classes) = (662.925 - 14.937)/5 = 647.988/5 = 129.598

2. Determine the upper limit of each class

The first class starts with the lowest value, which is 14.937. The upper limit of the first class is calculated as 14.937 + 129.598 = 144.535.

3. Determine the lower limit of each class

For the second class and the subsequent classes, the low limits should be slightly greater than the previou upper limit. In the case for the second class, the low limit is 0.001 greater than 144.535. The upper limit is calculated as 144.536 + 129.598 = 274.134. We will keep using 0.001 as the margin between the upper limit and the lower limit of the subsequent class.

Results of 2 and 3

Classes Class limits

1
14.937 - 144.535

2
144.536 - 274.134

3
274.135 - 403.733

4
403.734 - 533.332

5
533.333 - 662.931

4. Specify the actual class limits for mapping

This step involves the rounding of the class limits. It is necessary only when the limits calculated in steps 2 and 3 have higher precision than the original data.

5. Determine the class assignment

Use the class limits, we can now assign each county to a particular class. The result is in the table.

"Name" "Popden90" Equal interval

Lake 14.937 1

Osceola 35.153 1

Arenac 40.514 1

Clare 43.379 1

Gladwin 42.399 1

Bay 249.510 2

Midland 143.315 1

Isabella 94.548 1

Newaygo 44.350 1

Mecosta 65.327 1

Saginaw 259.817 2

Montcalm 73.595 1

Gratiot 68.200 1

Kent 574.013 5

Genesee 662.925 5

Shiawassee 129.034 1

Ionia 98.281 1

Clinton 100.742 1

Livingston 197.534 2

Ingham 502.594 4

Eaton 160.411 2

Barry 86.774 1

Below is the corresponding map.

What are the advantages and disadvantages of equal interval classification? (page. 67)

Quantile

Quantile classification puts equal number of observations (county population density) in each class according to the ranks.

1. Calculate the number of observations in each class

Number of observations = (Total number of observations) / (number of classes)
= 22 / 5 = 4.4
Since the number of observations is not an integer, we can assign the number of classes in the following manner: class 1 - four observations; class 2 - five observations; class 3 - four observations; class 4 - five observations; class 5 - four observations.

2. Rank the data
3. Assign classes

Below is the result.

"Name" "Popden90" Quantile class

Lake 14.937 1

Osceola 35.153 1

Arenac 40.514 1

Gladwin 42.399 1

Clare 43.379 2

Newaygo 44.350 2

Mecosta 65.327 2

Gratiot 68.200 2

Montcalm 73.595 2

Barry 86.774 3

Isabella 94.548 3

Ionia 98.281 3

Clinton 100.742 3

Shiawassee 129.034 4

Midland 143.315 4

Eaton 160.411 4

Livingston 197.534 4

Bay 249.510 4

Saginaw 259.817 5

Ingham 502.594 5

Kent 574.013 5

Genesee 662.925 5

And the corresponding map:

What's advantage and disadvantages of quantile classification? (page 67-69)

Standard Deviation

In this method, the standard deviation is used to compute the class limits. One may use any units of standard deviations for calculating the class limits. The best way to understand this method is to use the following illustration:

For our example, the standard deviation is ±177.686 and the mean is 167.607. We can first subtract the mean by 1 unit of standard deviation. 167.607 - 177.686 = -10.08, which is way below the minimum population density in our data set. There is no need to subtract -10.08 to get another class further to the left of the mean.

We now will add one standard deviation to the mean. 167.607 + 177.686 = 345.293, which still smaller than the maximum population density. We hence add another unit of standard deviation to 345.293 to produce another class. Use the procedure we generated the table below:

class 1: -10.08 to 167.607 (one standard deviation below the mean)
class 2: 167.607 (the mean)
class 3: 167.607 to 345.293 (one standard deviation above the mean)
class 4: 345.293 to 522.98 (two standard deviation above the mean)
class 5: 522.98 to 700.266 (three standard deviation above the mean)

Note that the number of classes is determined by the characteristics of the data and is not artificially determined. One can use a fraction of the standard deviation to calculate the classes, which will produce more classes.

Below is the corresponding map:

What's advantages and disadvantages of standard deviation classification method? (page 69)

Exercises

1. Classify the following data set into three groups using equal interval, quantile, standard deviation (1 standard deviation) respectively.

Data Equal interval Quantile Standard deviation

1.2

3.4

1

5

0.5

2.3

5.6

3.9

6.7

2.6

8.2

1.5

2. Make a map showing the population density by counties (L:\esri\esridata\us\counties.shp) in the United States, using three different types of classification method. Use gray level (gray monochromatic) for the legend. Do not use color. Edit the maps with careful design. Publish the maps to the Web.

"Name"	"Popden90"
Lake	14.937
Osceola	35.153
Arenac	40.514
Clare	43.379
Gladwin	42.399
Bay	249.510
Midland	143.315
Isabella	94.548
Newaygo	44.350
Mecosta	65.327
Saginaw	259.817
Montcalm	73.595
Gratiot	68.200
Kent	574.013
Genesee	662.925
Shiawassee	129.034
Ionia	98.281
Clinton	100.742
Livingston	197.534
Ingham	502.594
Eaton	160.411
Barry	86.774

Classes	Class limits
1	14.937 - 144.535
2	144.536 - 274.134
3	274.135 - 403.733
4	403.734 - 533.332
5	533.333 - 662.931

"Name"	"Popden90"	Equal interval
Lake	14.937	1
Osceola	35.153	1
Arenac	40.514	1
Clare	43.379	1
Gladwin	42.399	1
Bay	249.510	2
Midland	143.315	1
Isabella	94.548	1
Newaygo	44.350	1
Mecosta	65.327	1
Saginaw	259.817	2
Montcalm	73.595	1
Gratiot	68.200	1
Kent	574.013	5
Genesee	662.925	5
Shiawassee	129.034	1
Ionia	98.281	1
Clinton	100.742	1
Livingston	197.534	2
Ingham	502.594	4
Eaton	160.411	2
Barry	86.774	1

"Name"	"Popden90"	Quantile class
Lake	14.937	1
Osceola	35.153	1
Arenac	40.514	1
Gladwin	42.399	1
Clare	43.379	2
Newaygo	44.350	2
Mecosta	65.327	2
Gratiot	68.200	2
Montcalm	73.595	2
Barry	86.774	3
Isabella	94.548	3
Ionia	98.281	3
Clinton	100.742	3
Shiawassee	129.034	4
Midland	143.315	4
Eaton	160.411	4
Livingston	197.534	4
Bay	249.510	4
Saginaw	259.817	5
Ingham	502.594	5
Kent	574.013	5
Genesee	662.925	5

Data	Equal interval	Quantile	Standard deviation
1.2
3.4
1
5
0.5
2.3
5.6
3.9
6.7
2.6
8.2
1.5