datatype

We will cover:

Types of
Statistics

Descriptive &
Graphical

Inferential

Variable
Relationships

Statistical
Modeling

Data Dimension
Reduction

Nonparametric
Methods

Quality Control

Time Series

Survival Modeling

ROC Curve

Data Types

General speaking, statistical techniques are determined by the type of data. A basic understanding about the data types is helpful for choosing statistical procedures. In SPSS, a column is for a variable and a row is for a case. There are, generally speaking, two major types of data:

Qualitative variables: The data values are non-numeric categories.
Examples: Blood type, Gender.

Quantitative variables: The data values are counts or numerical measurements. A quantitative variable can be either discrete such as # of students receiving an 'A' in a class, or continuous such as GPA, salary and so on.

Another way of classifying data is by the measurement scales. In statistics, there are four generally used measurement scales:

Nominal data: data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1.

Ordinal data (we sometimes call 'Discrete Data'): data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5.

Continuous data:

	Interval data : data values are ranged in a real interval, which can be as large as from negative infinity to positive infinity. The difference between two values are meaningful, however, the ratio of two interval data is not meaningful. For example temperature, IQ. Today is 1.2 times hotter than yesterday is not much useful nor meaningful.
	Ratio data: Both difference and ratio of two values are meaningful. For example, salary, weight.

NOTE: The statistical procedures mentioned below are demonstrated using movie clips in the Statistical Procedures Page.

Statistical quality control techniques are commonly used in monitoring the process quality. There are typically two major sources of variations occurred in a process. One is the variation due to special causes, and the other is the variation due to system causes. Control charts are commonly used for monitoring the variation due to special causes. Capability analysis is typically used to evaluate the performance of the existing system. Once the capability of a system is assessed, one can then, design further investigation to study the possible factors (causes) that may result in the system variation. There are various tools available in SPSS for quality control, include:

	Capability Analysis for evaluating the performance of the quality characteristic in interval scale. Typical capability indices include Cp, Cpu, Cpl, Cpk, CpM.
	Variable control charts: X-bar/R charts and X-bar/S charts for monitoring variable data (interval data). X-bar chart monitors the average performance of the quality characteristic along the time. The range/s-charts monitor the variability of the quality catachrestic along the time. The assumption is the quality characteristic follows a normal distribution. Caution should be taken for situations where the normality assumption is highly violated.
	Variable control charts: Individual, Moving Range charts for monitoring variable (interval) measurement where each sample is taken only from one individual unit. The point represents the moving average or moving range of at least two consecutive individual measurements. Typical assumption is the number of defectives in a random sample of n items follows a binomial distribution
	Attribute control charts: p-, np-charts for monitoring proportion (p-chart) or number ( np-chart) of defectives of the quality characteristic in each batch of a random sample along the time.
	Attribute control charts: c, u-charts for monitoring the number of defects in the sample (c-chart) or the mean number of defects in one unit of sample (u-chart). The typical assumption is the number of defects in a sample follows a Poisson distribution.

There are two commonly used time series modeling techniques included in SPSS:

	Exponential smoothing for modeling the time series exponential smoothing technique.
	ARIMA model for modeling the time series using autoregressive and moving average techniques. Seasonal effect can be considered. The ARIMA model also allows for covariates.
	SPSS also provides an Expert Modeler to assist users choosing the 'best' ARIMA model for the time series.

Survival modeling techniques are commonly used for modeling life time data or reliability data that may involve with censored data. SPSS provides four procedures for survival modeling:

	Life table: Life table is created by subdividing the study period into smaller time intervals, and count the number of cases being lasted for at least to the time period. The counts are used to estimate the overall probability of the event occurring at different time points and displayed in a tabular form.
	Kaplan-Meier model. This is a nonparametric technique. It is also known as product-limit method for the reason that the method is based on estimating conditional probabilities at each time when an event occurs and computes the product limit of these conditional probabilities to estimate the survival rates at the time. This technique is often used for comparing the effects of treatments on the survival time.
	Cox Regression model: This is a parametric modeling technique that can take into account of covariates. A survival predictive model is built. It is also known as proportional hazard model for the reason that the model assumes that the covariate effects on a hazard function is the same for different factor levels for all time points.
	Cox regression with Time-dependent covariates: This extends the original Cox regression model by allowing covariates that are time-dependent.

To perform a survival analysis in SPSS, go to Analyze, scroll down to Survival Analysis, select the procedure appropriate for your survival data.

ROC Curve is useful for evaluating and comparing the performance of classification models where the response variable is binary (often labeled as Positive and Negative). This is a two-dimensional curve with the Y-axis the sensitivity measure and X-axis (1-specificity). These sensitivity and (1-specificity) measures are computed based a a sequence of cut-off points to be applied to the model for predicting observations into Positive or Negative.

Prior to create the ROC curve, users have already have performed and built more than one predictive models and choose to use ROC Curve for comparing the performance of the models, and have obtained and saved the predicted responses from these competing models.

We In this workshop, we attempt to cover most of the statistical procedures available in SPSS 16. The bottom line is, when you have questions about your design and analysis, contact a statistical consultant for help.

HOME	Table of Contents	Data Editor Window	Syntax Editor Window	Carl Lee Felix Famoye About Us
Chart Editor Window	Output Window	Overview of Data Analysis	Manipulation of Data
Analysis of Data	Projects & Data Sets	Integrate R into SPSS

	For nominal data, if the response is a binary variable (that is only two possible values such as graduate in four years or not), then, use Logistic regression model. If the response has more than two categories, use multinomial logistic regression.
	For ordinal data, if the response follows Poisson distribution, use Poisson regression model. In general, one can use log-linear models for ordinal data.
	In many applications, the relationship between response variable and predictors are not linear, but may be linearized. Generalized linear modeling techniques are useful.
	Some applications involves certain structure of relationship between response and predictor variables. Mixed models may be useful for some of these problems.
	Many medical data or reliability data involves with data values that are not completely observed at the end of the study (right-truncated), or some data have already evolving before the study started (left-truncated). The analysis requires special attention regarding to the information of data being 'truncated'. Survival modeling techniques are useful for modeling these types of data.
	Most of statistical techniques require certain assumptions. Typically, for continuous response, the assumptions may include normality of the response variable, homogeneity of variance and the relationship between Y and X's being linear or not. One should take appropriate data transformation as needed when building statistical mdodels.

	Chi-square
	Two Independent Samples Comparison: The similar parametric procedure is independent t-test.
	K Independent Samples Comparison : The similar parametric procedure is Analysis of Variance.
	Two Related Samples: The similar parametric procedure is the paired t-test.
	K Related Samples: The similar parametric procedure is the Repeated Measure Analysis.

	For nominal data: Frequency, Crosstabs, bar charts and pie charts are common tools.
	For ordinal data: Frequency, Crosstabs, and descriptive statistics, bar charts, pie charts, stem-leaf plots are common tools.
	For continuous data: Descriptive statistics, histograms, boxplots, and scatterplots for two variables are common tools.

	For nominal data, use crosstabs, and choose proper tests for nominal data.
	For ordinal data, use crosstabs, bivariate correlation such as Spearman correlation coefficient..
	For continuous data, use bivariate correlation such as Pearson correlation.

SPSS On-Line Training Workshop