Basic statistics

Posted by Sathish Kumar K (PSGCAS)

Definitions

Statistics
Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.
Variable
Characteristic or attribute that can assume different values
Random Variable
A variable whose values are determined by chance.
Population
All subjects possessing a common characteristic that is being studied.
Census
The collection of data from every element in a population.
Sample
A subgroup or subset of the population.
Parameter
Characteristic or measure obtained from a population.
Statistic (not to be confused with Statistics)
Characteristic or measure obtained from a sample.
Descriptive Statistics
Collection, organization, summarization, and presentation of data.
Inferential Statistics
Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions.
Qualitative Variables (Data)
Variables (data) which assume non-numerical values.
Quantitative Variables (Data)
Variables (data) which assume numerical values.
Discrete Variables (Data)
Variables (data) which assume a finite or countable number of possible values. Usually obtained by counting.
Continuous Variables (Data)
Variables (data) which assume an infinite number of possible values. Usually obtained by measurement.
Nominal Level
Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order or ranking can be imposed on the data.
Ordinal Level
Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist.
Interval Level
Level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless.
Ratio Level
Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure.
Random Sampling
Sampling in which the data is collected using chance methods or random numbers.
Systematic Sampling
Sampling in which data is obtained by selecting every kth object.
Convenience Sampling
Sampling in which data that is readily available is used.
Stratified Sampling
Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques.
Cluster Sampling
Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.
Self-Selected Survey
Sampling in which the respondents themselves decide whether or not to be included.
Observational Study
A study in which the subjects are observed and studied, but no attempt is made to manipulate or modify the subjects.
Experiment
A study in which a treatment is applied, and then its effects on the subjects are studied.
Sampling Error
The difference between the sample result and the true population result that occurs because of chance variation.
Non-sampling Error
An error that occurs because sample data is incorrectly collected, recorded, or analyzed.

Introduction



Population vs Sample

The population includes all objects of interest whereas the sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (mu, sigma) while statistics are usually denoted using Roman letters (x, s).

There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost.

We compute statistics, and use them to estimate parameters. The computation is the first part of the statistics course (Descriptive Statistics) and the estimation is the second part (Inferential Statistics)

Discrete vs Continuous

Discrete variables are usually obtained by counting. There are a finite or countable number of choices available with discrete data. You can't have 2.63 people in the room.

Continuous variables are usually obtained by measuring. Length, weight, and time are all examples of continuous variables. Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 <= x <>

Levels of Measurement

There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have.

  • Nominal is the lowest level. Only names are meaningful here.
  • Ordinal adds an order to the names.
  • Interval adds meaningful differences, but there is no starting point (0).
  • Ratio adds a zero so that ratios are meaningful.

Types of Sampling

There are five types of sampling: Random, Systematic, Convenience, Cluster, and Stratified.

  • Random sampling is analogous to putting everyone's name into a hat and drawing out several names. Each element in the population has an equal chance of occurring. While this is the preferred way of sampling, it is often difficult to do. It requires that a complete list of every element in the population be obtained. Computer generated lists are often used with random sampling. You can generate random numbers using the TI82 calculator.
  • Systematic sampling is easier to do than random sampling. In systematic sampling, the list of elements is "counted off". That is, every kth element is taken. This is similar to lining everyone up and numbering off "1,2,3,4; 1,2,3,4; etc". When done numbering, all people numbered 4 would be used.
  • Convenience sampling is very easy to do, but it's probably the worst technique to use. In convenience sampling, readily available data is used. That is, the first people the surveyor runs into.
  • Cluster sampling is accomplished by dividing the population into groups -- usually geographically. These groups are called clusters or blocks. The clusters are randomly selected, and each element in the selected clusters are used.
  • Stratified sampling also divides the population into groups called strata. However, this time it is by some characteristic, not geographically. For instance, the population might be separated into males and females. A sample is taken from each of these strata using either random, systematic, or convenience sampling.

Generating Random Numbers

You can generate random numbers on the TI-82 calculator using the following sequence. N is the number of different values which could be and S is the minimum number.

     int (N*rand+S)

INT is found under the MATH menu (math num 4). RAND is also found under the MATH menu (math prb 1).

Simulate the rolling of a die (1-6): int (6*rand+1)

Simulate the flipping of a coin (0-1): int (2*rand)

This works because the rand function returns a random number between 0 and 1 (including 0 but not including 1). When it is multiplied by N, it becomes between 0 and N, and then S is added, so it becomes between S and S+N.

If you have two values (A and B) that you need random numbers between, then you can generate them using the following formulas.

     N=B-A+1  
int (N*rand+A)

Notice it is B-A+1 not B-A. Everyone agrees there are 10 numbers between 1 and 10 (inclusive). But, if you take 10-1, you get 9, not 10. Also, in the formula above, replace the N by the actual number of different values.

Since the calculator remembers the last formula put in, and evaluates it when you hit enter, to generate more random numbers, just hit enter again. Each time you hit enter, you will get another random number.

0 comments:

Post a Comment