sampling design

Posted by Sathish Kumar K (PSGCAS)

Sampling Design

How do we draw samples?

SCIENTIFIC/PROBABILITY SAMPLES

Simple Random Sample
This has the best properties.

    RANDOM = each element of the population has an equal chance of inclusion in the sample.

    1. Begin with a SAMPLING FRAME = a list of every element in the population.

    2. Find a Random Number Table or use excel to generate random numbers. You need as many randomly generated numbers as elements in your sample (n).

    3. Pick the first number (throw darts, close your eyes and point, play "musical number generating") and find that element.

    4. Pick another number, and choose that element, until you have your full sample.

      You can do this two ways: with and without replacement. WITH REPLACEMENT = after an item is chosen, it is possible to choose it again, you throw it back in the mix. WITHOUT REPLACEMENT = after you pick an element it is impossible to choose it again, you have one fewer to choose from.

    There are two desirable qualities associated with SRS:

  • EQUAL PROBABILITY = every element has an equal probability of inclusion (the real definition of random).
  • INDEPENDENT SELECTION = every possible combination of elements has an equal probability of constituting a sample. If we want to have ten items in the sample it is as likely to have items 1-10 as 2-11 as 1,3,5,7,9,11,13,15,17,19, etc. That means choosing one element first doesn't have any influence on what other elements get chosen. A violation of this would be a matched pair sample, where you choose a husband and wife together. Inclusion of the husband definitely affects choosing the wife.
  • Advantages of the SRS method of sampling:

    • Assures good representativeness of sample (particularly if large).
    • allows us to make generalizations/inferences. In fact, most of the statistical stuff we'll do later assumes that we've actually done a simple random sample, even if we haven't.
    • avoids biases that are possible in some of the other methods we'll talk about.

    Disadvantages of SRS method:

    • Have to have a list/sampling frame.
    • Have to number the list.
    • both are hard to do when the population is large.

Systematic Sample/Skip Interval Sample

    1. Begin with a numbered sampling frame again.
    2. Choose your random number.
    3. Choose your SAMPLING INTERVAL = number in population divided by number desired in sample, or N/n.
    4. Select the element that corresponds to the random number. Then instead of picking a second random number, etc., count out the interval (N/n) and choose that element. When you get to the end of the list go back to the beginning until you have your full sample.

    • Note, if you get a fraction, round up. If you round down, you might not get to the end of the list, and those elements at the end will not have any probability of inclusion. With rounding up, you will always get through the whole list.

    Advantages of Systematic Sampling method:

    • Easier to do than SRS. You don't have to keep running back to the random number generator.

    Disadvantages of Systematic Sampling:

    • Still need a list/sampling frame that is numbered.
    • Might run into periodicity problem. If the list happened to be arranged by class (1,2,3,4…), you might end up picking all first years. Have to make sure the list is not so structured.

Stratified Sampling

    1. Get a sampling frame.
    2. Arrange it by desired trait. For instance, if we care about class at Wellesley, we might arrange the list by class to insure all classes get represented.
    3. Decide if you want proportional numbers of each group or if you want something else. If proportional, just do a systematic sample with the newly arranged list.
    4. If you want an "oversample" of some group, say you want to know about the first year experience, so you want a disproportionate number of first years, then create your interval separately for each group (say, every third person for the first years, but every fifth for the rest).

    Why might you want this? We will find out that no matter how big a population is, there are minimum sample sizes that allow for good inference. In a regular sample, you might not get enough of a subgroup (first years) to do good statistical inference, so you need to oversample.

    Advantages of Stratified Sampling method:

    • Increases chances that relevant traits will be represented in the sample.
    • Allows for easy oversampling.

    Disadvantages of Stratified Sampling:

    • Once again, you need a good list.
    • You have to know in advance two things: what trait you think is important and what the underlying distribution of that trait is in the population. For instance, if we wanted to talk about why some students are happy and some unhappy at Wellesley, and we decided to stratify our sample by happiness levels, we'd have to make some big assumptions about how many people at Wellesley fall into these categories.

Cluster Sampling

    This is the most commonly used scientific sampling method in the social sciences, like opinion polling, etc.
    Use it when you don't have or need a sampling frame:

    • a list doesn't exist,
    • the list would be too hard to get,
    • or if the population is directly identifiable without a list (for instance, four digit extensions at Wellesley).

    You can do cluster sampling when the elements of the population naturally "cluster" into identifiable patterns, like neighborhoods, organizations, etc. The assumption here is that individuals within a cluster will be fairly homogenous. You have to come up with your clusters carefully!

    1. Take the whole population and divide it into a bunch of smaller clusters. Number the clusters.
    2. Do a simple random or systematic sample of the clusters.
    3. Divide the chosen clusters into smaller ones and number them.
    4. Repeat 2. And so on until you get to individual elements in your sample.

    Advantages of Cluster Sampling method:

    • Less costly.
    • Don't need a list.
    • At start everyone has an approximately equal chance of selection despite the number of steps involved.

    Disadvantages of the Cluster Sample:

    • more possibility of introducing error - drawing the boundaries, etc.
    • increases with the number of steps involved.
    • Have to figure out a balance between number of stages and the number you want in your final sample. For instance, we could get a sample of 2000 Americans by picking 2000 clusters and one person from each, or we could pick 1000 each from 2 clusters. If the clusters aren't drawn well, the second method would be unrepresentative. But if the single person drawn from the first method was weird, it wouldn't matter how good the clusters were.

NON-SCIENTIFIC/NON-PROBABILITY SAMPLES

Convenience Sample

    These are the ones like "man on the street interviews," or whoever walks by. If you looked at folks' clothes in the science center, you did a convenience sample.

    Advantages of convenience samples:

    • easy
    • cheap
    • some possibility of substantive inference, if you can justify, but not statistical inference.
    • Ex: many psych. studies are done with college students as subjects. If the researcher can make the case that the college students are like other people in the relevant characteristics, then it's OK, but you can't use the concept of statistical inference that we'll get to later.

    Disadvantages of ALL non-scientific samples:

    • Can't do statistical inference.

Quota Samples

    When you set beforehand the numbers of specific types of elements you want in the sample, like three M and Ms of every color, even though we know that is not reflective of the underlying population. Or 50% white males, or 50% defective parts when only 10% are really defective. This is like the stratified OVERsample, but it has even more "casualness" to it. You keep drawing until you get enough of the particular type and discard the ones you don't need.

Judgmental Sample

    Recruit subjects according to a specific criteria of interest. For instance, Kristin Luker wanted to talk about abortion activists, so she sought out people who were really involved in California politics over abortion. She didn't want to know about everyone's opinions on abortion, just about the activists. Or you might start with one person that fits the bill and ask for recommendations of other people like her. This is big in studies of political elites (ask a staffer to recommend some friends, etc.).

Self-selection samples

    Call-in, costly, etc. polls. Enough said!

Sample Design

Sample design covers the method of selection, the sample structure and plans for analysing and interpreting the results. Sample designs can vary from simple to complex and depend on the type of information required and the way the sample is selected.

Sample design affects the size of the sample and the way in which analysis is carried out. In simple terms the more precision the market researcher requires, the more complex will be the design and the larger the sample size.

The sample design may make use of the characteristics of the overall market population, but it does not have to be proportionally representative. It may be necessary to draw a larger sample than would be expected from some parts of the population; for example, to select more from a minority grouping to ensure that sufficient data is obtained for analysis on such groups.

Many sample designs are built around the concept of random selection. This permits justifiable inference from the sample to the population, at quantified levels of precision. Random selection also helps guard against sample bias in a way that selecting by judgement or convenience cannot.


Sampling Scheme

A sampling scheme defines what data will be obtained and how A sampling scheme is a detailed description of what data will be obtained and how this will be done. In PPC we are faced with two different situations for developing sampling schemes. The first is when we are conducting a controlled experiment. There are very efficient and exact methods for developing sampling schemes for designed experiments and the reader is referred to the Process Improvement chapter for details.
Passive data collection The second situation is when we are conducting a passive data collection (PDC) study to learn about the inherent properties of a process. These types of studies are usually for comparison purposes when we wish to compare properties of processes against each other or against some hypothesis. This is the situation that we will focus on here.
There are two principles that guide our choice of sampling scheme Once we have selected our response parameters, it would seem to be a rather straightforward exercise to take some measurements, calculate some statistics and draw conclusions. There are, however, many things which can go wrong along the way that can be avoided with careful planning and knowing what to watch for. There are two overriding principles that will guide the design of our sampling scheme.
The first is precision The first principle is that of precision. If the sampling scheme is properly laid out, the difference between our estimate of some parameter of interest and its true value will be due only to random variation. The size of this random variation is measured by a quantity called standard error. The magnitude of the standard error is known as precision. The smaller the standard error, the more precise are our estimates.
Precision of an estimate depends on several factors The precision of any estimate will depend on:
  • the inherent variability of the process estimator
  • the measurement error
  • the number of independent replications (sample size)
  • the efficiency of the sampling scheme.
The second is systematic sampling error (or confounded effects) The second principle is the avoidance of systematic errors. Systematic sampling error occurs when the levels of one explanatory variable are the same as some other unaccounted for explanatory variable. This is also referred to as confounded effects. Systematic sampling error is best seen by example.

Example 1: We want to compare the effect of two different coolants on the resulting surface finish from a turning operation. It is decided to run one lot, change the coolant and then run another lot. With this sampling scheme, there is no way to distinguish the coolant effect from the lot effect or from tool wear considerations. There is systematic sampling error in this sampling scheme.
Example 2: We wish to examine the effect of two pre-clean procedures on the uniformity of an oxide growth process. We clean one cassette of wafers with one method and another cassette with the other method. We load one cassette in the front of the furnace tube and the other cassette in the middle. To complete the run, we fill the rest of the tube with other lots. With this sampling scheme, there is no way to distinguish between the effect of the different pre-clean methods and the cassette effect or the tube location effect. Again, we have systematic sampling errors.

Selecting the Most Appropriate Sampling Strategy

There are four primary sampling strategies:

  • Random sampling
  • Stratified random sampling
  • Systematic sampling
  • Rational sub-grouping

Before determining which strategy will work best, the analyst must determine what type of study is being conducted. There are normally two types of studies: population and process. With a population study, the analyst is interested in estimating or describing some characteristic of the population (inferential statistics).

With a process study, the analyst is interested in predicting a process characteristic or change over time. It is important to make the distinction for proper selection of a sampling strategy. The “I Love Lucy” television show's “Candy Factory” episode can be used to illustrate the difference. For example, a population study, using samples, would seek to determine the average weight of the entire daily run of candies. A process study would seek to know whether the weight was changing over the day.

Random Sampling

Random samples are used in population sampling situations when reviewing historical or batch data. The key to random sampling is that each unit in the population has an equal probability of being selected in the sample. Using random sampling protects against bias being introduced in the sampling process, and hence, it helps in obtaining a representative sample.

In general, random samples are taken by assigning a number to each unit in the population and using a random number table or Minitab to generate the sample list. Absent knowledge about the factors for stratification for a population, a random sample is a useful first step in obtaining samples.

For example, an improvement team in a human resources department wanted an accurate estimate of what proportion of employees had completed a personal development plan and reviewed it with their managers. The team used its database to obtain a list of all associates. Each associate on the list was assigned a number. Statistical software was used to generate a list of numbers to be sampled, and an estimate was made from the sample.

Stratified Random Sampling

Like random samples, stratified random samples are used in population sampling situations when reviewing historical or batch data. Stratified random sampling is used when the population has different groups (strata) and the analyst needs to ensure that those groups are fairly represented in the sample. In stratified random sampling, independent samples are drawn from each group. The size of each sample is proportional to the relative size of the group.

For example, the manager of a lending business wanted to estimate the average cycle time for a loan application process. She knows there are three types (strata) of loans (large, medium and small). Therefore, she wanted the sample to have the same proportion of large, medium and small loans as the population. She first separated the loan population data into three groups and then pulled a random sample from each group.

Systematic Sampling

Systematic sampling is typically used in process sampling situations when data is collected in real time during process operation. Unlike population sampling, a frequency for sampling must be selected. It also can be used for a population study if care is taken that the frequency is not biased.

Systematic sampling involves taking samples according to some systematic rule - e.g., every fourth unit, the first five units every hour, etc. One danger of using systematic sampling is that the systematic rule may match some underlying structure and bias the sample.

For example, the manager of a billing center is using systematic sampling to monitor processing rates. At random times around each hour, five consecutive bills are selected and the processing time is measured.

Rational Sub-Grouping

Rational sub-grouping is the process of putting measurements into meaningful groups to better understand the important sources of variation. Rational sub-grouping is typically used in process sampling situations when data is collected in real time during process operations. It involves grouping measurements produced under similar conditions, sometimes called short-term variation. This type of grouping assists in understanding the sources of variation between subgroups, sometimes called long-term variation.

The goal should be to minimize the chance of special causes in variation in the subgroup and maximize the chance for special causes between subgroups. Sub-grouping over time is the most common approach; sub-grouping can be done by other suspected sources of variation (e.g., location, customer, supplier, etc.)

For example, an equipment leasing business was trying to improve equipment turnaround time. They selected five samples per day from each of three processing centers. Each processing center was formed into a subgroup.

When using sub-grouping, form subgroups with items produced under similar conditions. To ensure items in a subgroup were produced under similar conditions, select items produced close together in time.


Basic statistics

Posted by Sathish Kumar K (PSGCAS)

Definitions

Statistics
Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.
Variable
Characteristic or attribute that can assume different values
Random Variable
A variable whose values are determined by chance.
Population
All subjects possessing a common characteristic that is being studied.
Census
The collection of data from every element in a population.
Sample
A subgroup or subset of the population.
Parameter
Characteristic or measure obtained from a population.
Statistic (not to be confused with Statistics)
Characteristic or measure obtained from a sample.
Descriptive Statistics
Collection, organization, summarization, and presentation of data.
Inferential Statistics
Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions.
Qualitative Variables (Data)
Variables (data) which assume non-numerical values.
Quantitative Variables (Data)
Variables (data) which assume numerical values.
Discrete Variables (Data)
Variables (data) which assume a finite or countable number of possible values. Usually obtained by counting.
Continuous Variables (Data)
Variables (data) which assume an infinite number of possible values. Usually obtained by measurement.
Nominal Level
Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order or ranking can be imposed on the data.
Ordinal Level
Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist.
Interval Level
Level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless.
Ratio Level
Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure.
Random Sampling
Sampling in which the data is collected using chance methods or random numbers.
Systematic Sampling
Sampling in which data is obtained by selecting every kth object.
Convenience Sampling
Sampling in which data that is readily available is used.
Stratified Sampling
Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques.
Cluster Sampling
Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.
Self-Selected Survey
Sampling in which the respondents themselves decide whether or not to be included.
Observational Study
A study in which the subjects are observed and studied, but no attempt is made to manipulate or modify the subjects.
Experiment
A study in which a treatment is applied, and then its effects on the subjects are studied.
Sampling Error
The difference between the sample result and the true population result that occurs because of chance variation.
Non-sampling Error
An error that occurs because sample data is incorrectly collected, recorded, or analyzed.

Introduction



Population vs Sample

The population includes all objects of interest whereas the sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (mu, sigma) while statistics are usually denoted using Roman letters (x, s).

There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost.

We compute statistics, and use them to estimate parameters. The computation is the first part of the statistics course (Descriptive Statistics) and the estimation is the second part (Inferential Statistics)

Discrete vs Continuous

Discrete variables are usually obtained by counting. There are a finite or countable number of choices available with discrete data. You can't have 2.63 people in the room.

Continuous variables are usually obtained by measuring. Length, weight, and time are all examples of continuous variables. Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 <= x <>

Levels of Measurement

There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level didn't have.

  • Nominal is the lowest level. Only names are meaningful here.
  • Ordinal adds an order to the names.
  • Interval adds meaningful differences, but there is no starting point (0).
  • Ratio adds a zero so that ratios are meaningful.

Types of Sampling

There are five types of sampling: Random, Systematic, Convenience, Cluster, and Stratified.

  • Random sampling is analogous to putting everyone's name into a hat and drawing out several names. Each element in the population has an equal chance of occurring. While this is the preferred way of sampling, it is often difficult to do. It requires that a complete list of every element in the population be obtained. Computer generated lists are often used with random sampling. You can generate random numbers using the TI82 calculator.
  • Systematic sampling is easier to do than random sampling. In systematic sampling, the list of elements is "counted off". That is, every kth element is taken. This is similar to lining everyone up and numbering off "1,2,3,4; 1,2,3,4; etc". When done numbering, all people numbered 4 would be used.
  • Convenience sampling is very easy to do, but it's probably the worst technique to use. In convenience sampling, readily available data is used. That is, the first people the surveyor runs into.
  • Cluster sampling is accomplished by dividing the population into groups -- usually geographically. These groups are called clusters or blocks. The clusters are randomly selected, and each element in the selected clusters are used.
  • Stratified sampling also divides the population into groups called strata. However, this time it is by some characteristic, not geographically. For instance, the population might be separated into males and females. A sample is taken from each of these strata using either random, systematic, or convenience sampling.

Generating Random Numbers

You can generate random numbers on the TI-82 calculator using the following sequence. N is the number of different values which could be and S is the minimum number.

     int (N*rand+S)

INT is found under the MATH menu (math num 4). RAND is also found under the MATH menu (math prb 1).

Simulate the rolling of a die (1-6): int (6*rand+1)

Simulate the flipping of a coin (0-1): int (2*rand)

This works because the rand function returns a random number between 0 and 1 (including 0 but not including 1). When it is multiplied by N, it becomes between 0 and N, and then S is added, so it becomes between S and S+N.

If you have two values (A and B) that you need random numbers between, then you can generate them using the following formulas.

     N=B-A+1  
int (N*rand+A)

Notice it is B-A+1 not B-A. Everyone agrees there are 10 numbers between 1 and 10 (inclusive). But, if you take 10-1, you get 9, not 10. Also, in the formula above, replace the N by the actual number of different values.

Since the calculator remembers the last formula put in, and evaluates it when you hit enter, to generate more random numbers, just hit enter again. Each time you hit enter, you will get another random number.

Definition of statistics

Posted by Sathish Kumar K (PSGCAS)


Statistics is the formal science of making effective use of numerical data relating to groups of individuals or experiments. It deals with all aspects of this, including not only the collection, analysis and interpretation of such data, but also the planning of the collection of data, in terms of the design of surveys and experiments.

Definition of Statistics

Statistics like many other sciences is a developing discipline. It is not nothing static. It has gradually developed during last few centuries. In different times, it has been defined in different manners. Some definitions of the past look very strange today but those definitions had their place in their own time. Defining a subject has always been difficult task. A good definition of today may be discarded in future. It is difficult to define statistics.
(1) The kings and rulers in the ancient times were interested in their manpower. They conducted census of population to get information about their population. They used information to calculate their strength and ability for wars. In those days statistics was defined as

“the science of kings, political and science of statecraft”
(2) A.L. Bowley defined statistics as

“statistics is the science of counting”
This definition places the entries stress on counting only. A common man also thinks as if statistics is nothing but counting. This used to be the situation but very long time ago. Statistics today is not mere counting of people, counting of animals, counting of trees and counting of fighting force. It has now grown to a rich methods of data analysis and interpretation.
“statistics are the numerical statement of facts capable of analysis and interpretation and the science of statistics is the study of the principles and the methods applied in collecting, presenting, analysis and interpreting the numerical data in any field of inquiry.”

HI MY FRIENDS

Posted by Sathish Kumar K (PSGCAS)

I am Sathish kumar K. I have finished my PG Statistics in PSGCAS 2010. Statistics is application oriented subject. It is applicable in all the field wherever data is available. As a statistician only can take a correct decision in Industrial and any field. I going to work in National Cancer Registry Programme(ICMR), Bangalore. So friends join with me we will discuss statistics and develop our knowledge..

Keep in touch with me...