Sampling Design
How do we draw samples?
SCIENTIFIC/PROBABILITY SAMPLES
Simple Random Sample
This has the best properties.
- EQUAL PROBABILITY = every element has an equal probability of inclusion (the real definition of random).
- INDEPENDENT SELECTION = every possible combination of elements has an equal probability of constituting a sample. If we want to have ten items in the sample it is as likely to have items 1-10 as 2-11 as 1,3,5,7,9,11,13,15,17,19, etc. That means choosing one element first doesn't have any influence on what other elements get chosen. A violation of this would be a matched pair sample, where you choose a husband and wife together. Inclusion of the husband definitely affects choosing the wife.
- Assures good representativeness of sample (particularly if large).
- allows us to make generalizations/inferences. In fact, most of the statistical stuff we'll do later assumes that we've actually done a simple random sample, even if we haven't.
- avoids biases that are possible in some of the other methods we'll talk about.
- Have to have a list/sampling frame.
- Have to number the list.
- both are hard to do when the population is large.
RANDOM = each element of the population has an equal chance of inclusion in the sample.
1. Begin with a SAMPLING FRAME = a list of every element in the population.
2. Find a Random Number Table or use excel to generate random numbers. You need as many randomly generated numbers as elements in your sample (n).
3. Pick the first number (throw darts, close your eyes and point, play "musical number generating") and find that element.
4. Pick another number, and choose that element, until you have your full sample.
You can do this two ways: with and without replacement. WITH REPLACEMENT = after an item is chosen, it is possible to choose it again, you throw it back in the mix. WITHOUT REPLACEMENT = after you pick an element it is impossible to choose it again, you have one fewer to choose from.
There are two desirable qualities associated with SRS:
Advantages of the SRS method of sampling:
Disadvantages of SRS method:
Systematic Sample/Skip Interval Sample
- Note, if you get a fraction, round up. If you round down, you might not get to the end of the list, and those elements at the end will not have any probability of inclusion. With rounding up, you will always get through the whole list.
- Easier to do than SRS. You don't have to keep running back to the random number generator.
- Still need a list/sampling frame that is numbered.
- Might run into periodicity problem. If the list happened to be arranged by class (1,2,3,4…), you might end up picking all first years. Have to make sure the list is not so structured.
1. Begin with a numbered sampling frame again.
2. Choose your random number.
3. Choose your SAMPLING INTERVAL = number in population divided by number desired in sample, or N/n.
4. Select the element that corresponds to the random number. Then instead of picking a second random number, etc., count out the interval (N/n) and choose that element. When you get to the end of the list go back to the beginning until you have your full sample.
Advantages of Systematic Sampling method:
Disadvantages of Systematic Sampling:
Stratified Sampling
- Increases chances that relevant traits will be represented in the sample.
- Allows for easy oversampling.
- Once again, you need a good list.
- You have to know in advance two things: what trait you think is important and what the underlying distribution of that trait is in the population. For instance, if we wanted to talk about why some students are happy and some unhappy at Wellesley, and we decided to stratify our sample by happiness levels, we'd have to make some big assumptions about how many people at Wellesley fall into these categories.
1. Get a sampling frame.
2. Arrange it by desired trait. For instance, if we care about class at Wellesley, we might arrange the list by class to insure all classes get represented.
3. Decide if you want proportional numbers of each group or if you want something else. If proportional, just do a systematic sample with the newly arranged list.
4. If you want an "oversample" of some group, say you want to know about the first year experience, so you want a disproportionate number of first years, then create your interval separately for each group (say, every third person for the first years, but every fifth for the rest).
Why might you want this? We will find out that no matter how big a population is, there are minimum sample sizes that allow for good inference. In a regular sample, you might not get enough of a subgroup (first years) to do good statistical inference, so you need to oversample.
Advantages of Stratified Sampling method:
Disadvantages of Stratified Sampling:
Cluster Sampling
- a list doesn't exist,
- the list would be too hard to get,
- or if the population is directly identifiable without a list (for instance, four digit extensions at Wellesley).
- Less costly.
- Don't need a list.
- At start everyone has an approximately equal chance of selection despite the number of steps involved.
- more possibility of introducing error - drawing the boundaries, etc.
- increases with the number of steps involved.
- Have to figure out a balance between number of stages and the number you want in your final sample. For instance, we could get a sample of 2000 Americans by picking 2000 clusters and one person from each, or we could pick 1000 each from 2 clusters. If the clusters aren't drawn well, the second method would be unrepresentative. But if the single person drawn from the first method was weird, it wouldn't matter how good the clusters were.
This is the most commonly used scientific sampling method in the social sciences, like opinion polling, etc.
Use it when you don't have or need a sampling frame:
You can do cluster sampling when the elements of the population naturally "cluster" into identifiable patterns, like neighborhoods, organizations, etc. The assumption here is that individuals within a cluster will be fairly homogenous. You have to come up with your clusters carefully!
1. Take the whole population and divide it into a bunch of smaller clusters. Number the clusters.
2. Do a simple random or systematic sample of the clusters.
3. Divide the chosen clusters into smaller ones and number them.
4. Repeat 2. And so on until you get to individual elements in your sample.
Advantages of Cluster Sampling method:
Disadvantages of the Cluster Sample:
NON-SCIENTIFIC/NON-PROBABILITY SAMPLES
Convenience Sample
- easy
- cheap
- some possibility of substantive inference, if you can justify, but not statistical inference.
- Can't do statistical inference.
These are the ones like "man on the street interviews," or whoever walks by. If you looked at folks' clothes in the science center, you did a convenience sample.
Advantages of convenience samples:
Ex: many psych. studies are done with college students as subjects. If the researcher can make the case that the college students are like other people in the relevant characteristics, then it's OK, but you can't use the concept of statistical inference that we'll get to later.
Disadvantages of ALL non-scientific samples:
Quota Samples
When you set beforehand the numbers of specific types of elements you want in the sample, like three M and Ms of every color, even though we know that is not reflective of the underlying population. Or 50% white males, or 50% defective parts when only 10% are really defective. This is like the stratified OVERsample, but it has even more "casualness" to it. You keep drawing until you get enough of the particular type and discard the ones you don't need.
Judgmental Sample
Recruit subjects according to a specific criteria of interest. For instance, Kristin Luker wanted to talk about abortion activists, so she sought out people who were really involved in California politics over abortion. She didn't want to know about everyone's opinions on abortion, just about the activists. Or you might start with one person that fits the bill and ask for recommendations of other people like her. This is big in studies of political elites (ask a staffer to recommend some friends, etc.).
Self-selection samples
Call-in, costly, etc. polls. Enough said!
Sample Design
Sample design covers the method of selection, the sample structure and plans for analysing and interpreting the results. Sample designs can vary from simple to complex and depend on the type of information required and the way the sample is selected.
Sample design affects the size of the sample and the way in which analysis is carried out. In simple terms the more precision the market researcher requires, the more complex will be the design and the larger the sample size.
The sample design may make use of the characteristics of the overall market population, but it does not have to be proportionally representative. It may be necessary to draw a larger sample than would be expected from some parts of the population; for example, to select more from a minority grouping to ensure that sufficient data is obtained for analysis on such groups.
Many sample designs are built around the concept of random selection. This permits justifiable inference from the sample to the population, at quantified levels of precision. Random selection also helps guard against sample bias in a way that selecting by judgement or convenience cannot.
Sampling Scheme | |
A sampling scheme defines what data will be obtained and how | A sampling scheme is a detailed description of what data will be obtained and how this will be done. In PPC we are faced with two different situations for developing sampling schemes. The first is when we are conducting a controlled experiment. There are very efficient and exact methods for developing sampling schemes for designed experiments and the reader is referred to the Process Improvement chapter for details. |
Passive data collection | The second situation is when we are conducting a passive data collection (PDC) study to learn about the inherent properties of a process. These types of studies are usually for comparison purposes when we wish to compare properties of processes against each other or against some hypothesis. This is the situation that we will focus on here. |
There are two principles that guide our choice of sampling scheme | Once we have selected our response parameters, it would seem to be a rather straightforward exercise to take some measurements, calculate some statistics and draw conclusions. There are, however, many things which can go wrong along the way that can be avoided with careful planning and knowing what to watch for. There are two overriding principles that will guide the design of our sampling scheme. |
The first is precision | The first principle is that of precision. If the sampling scheme is properly laid out, the difference between our estimate of some parameter of interest and its true value will be due only to random variation. The size of this random variation is measured by a quantity called standard error. The magnitude of the standard error is known as precision. The smaller the standard error, the more precise are our estimates. |
Precision of an estimate depends on several factors | The precision of any estimate will depend on:
|
The second is systematic sampling error (or confounded effects) | The second principle is the avoidance of systematic errors. Systematic sampling error occurs when the levels of one explanatory variable are the same as some other unaccounted for explanatory variable. This is also referred to as confounded effects. Systematic sampling error is best seen by example. |
Example 1: We want to compare the effect of two different coolants on the resulting surface finish from a turning operation. It is decided to run one lot, change the coolant and then run another lot. With this sampling scheme, there is no way to distinguish the coolant effect from the lot effect or from tool wear considerations. There is systematic sampling error in this sampling scheme. Example 2: We wish to examine the effect of two pre-clean procedures on the uniformity of an oxide growth process. We clean one cassette of wafers with one method and another cassette with the other method. We load one cassette in the front of the furnace tube and the other cassette in the middle. To complete the run, we fill the rest of the tube with other lots. With this sampling scheme, there is no way to distinguish between the effect of the different pre-clean methods and the cassette effect or the tube location effect. Again, we have systematic sampling errors. |