Statistical analysis: sampling design

sampling design

Posted by Sathish Kumar K (PSGCAS)

Sampling Design

How do we draw samples?

SCIENTIFIC/PROBABILITY SAMPLES

Simple Random Sample
This has the best properties.

RANDOM = each element of the population has an equal chance of inclusion in the sample.

1. Begin with a SAMPLING FRAME = a list of every element in the population.

2. Find a Random Number Table or use excel to generate random numbers. You need as many randomly generated numbers as elements in your sample (n).

3. Pick the first number (throw darts, close your eyes and point, play "musical number generating") and find that element.

4. Pick another number, and choose that element, until you have your full sample.

You can do this two ways: with and without replacement. WITH REPLACEMENT = after an item is chosen, it is possible to choose it again, you throw it back in the mix. WITHOUT REPLACEMENT = after you pick an element it is impossible to choose it again, you have one fewer to choose from.

There are two desirable qualities associated with SRS:

EQUAL PROBABILITY = every element has an equal probability of inclusion (the real definition of random).
INDEPENDENT SELECTION = every possible combination of elements has an equal probability of constituting a sample. If we want to have ten items in the sample it is as likely to have items 1-10 as 2-11 as 1,3,5,7,9,11,13,15,17,19, etc. That means choosing one element first doesn't have any influence on what other elements get chosen. A violation of this would be a matched pair sample, where you choose a husband and wife together. Inclusion of the husband definitely affects choosing the wife.

Advantages of the SRS method of sampling:

Assures good representativeness of sample (particularly if large).
allows us to make generalizations/inferences. In fact, most of the statistical stuff we'll do later assumes that we've actually done a simple random sample, even if we haven't.
avoids biases that are possible in some of the other methods we'll talk about.

Disadvantages of SRS method:

Have to have a list/sampling frame.
Have to number the list.
both are hard to do when the population is large.

Systematic Sample/Skip Interval Sample

1. Begin with a numbered sampling frame again.
2. Choose your random number.
3. Choose your SAMPLING INTERVAL = number in population divided by number desired in sample, or N/n.
4. Select the element that corresponds to the random number. Then instead of picking a second random number, etc., count out the interval (N/n) and choose that element. When you get to the end of the list go back to the beginning until you have your full sample.

Note, if you get a fraction, round up. If you round down, you might not get to the end of the list, and those elements at the end will not have any probability of inclusion. With rounding up, you will always get through the whole list.

Advantages of Systematic Sampling method:

Easier to do than SRS. You don't have to keep running back to the random number generator.

Disadvantages of Systematic Sampling:

Still need a list/sampling frame that is numbered.
Might run into periodicity problem. If the list happened to be arranged by class (1,2,3,4…), you might end up picking all first years. Have to make sure the list is not so structured.

Stratified Sampling

1. Get a sampling frame.
2. Arrange it by desired trait. For instance, if we care about class at Wellesley, we might arrange the list by class to insure all classes get represented.
3. Decide if you want proportional numbers of each group or if you want something else. If proportional, just do a systematic sample with the newly arranged list.
4. If you want an "oversample" of some group, say you want to know about the first year experience, so you want a disproportionate number of first years, then create your interval separately for each group (say, every third person for the first years, but every fifth for the rest).

Why might you want this? We will find out that no matter how big a population is, there are minimum sample sizes that allow for good inference. In a regular sample, you might not get enough of a subgroup (first years) to do good statistical inference, so you need to oversample.

Advantages of Stratified Sampling method:

Increases chances that relevant traits will be represented in the sample.
Allows for easy oversampling.

Disadvantages of Stratified Sampling:

Once again, you need a good list.
You have to know in advance two things: what trait you think is important and what the underlying distribution of that trait is in the population. For instance, if we wanted to talk about why some students are happy and some unhappy at Wellesley, and we decided to stratify our sample by happiness levels, we'd have to make some big assumptions about how many people at Wellesley fall into these categories.

Cluster Sampling

This is the most commonly used scientific sampling method in the social sciences, like opinion polling, etc.
Use it when you don't have or need a sampling frame:

a list doesn't exist,
the list would be too hard to get,
or if the population is directly identifiable without a list (for instance, four digit extensions at Wellesley).

You can do cluster sampling when the elements of the population naturally "cluster" into identifiable patterns, like neighborhoods, organizations, etc. The assumption here is that individuals within a cluster will be fairly homogenous. You have to come up with your clusters carefully!

1. Take the whole population and divide it into a bunch of smaller clusters. Number the clusters.
2. Do a simple random or systematic sample of the clusters.
3. Divide the chosen clusters into smaller ones and number them.
4. Repeat 2. And so on until you get to individual elements in your sample.

Advantages of Cluster Sampling method:

Less costly.
Don't need a list.
At start everyone has an approximately equal chance of selection despite the number of steps involved.

Disadvantages of the Cluster Sample:

more possibility of introducing error - drawing the boundaries, etc.
increases with the number of steps involved.
Have to figure out a balance between number of stages and the number you want in your final sample. For instance, we could get a sample of 2000 Americans by picking 2000 clusters and one person from each, or we could pick 1000 each from 2 clusters. If the clusters aren't drawn well, the second method would be unrepresentative. But if the single person drawn from the first method was weird, it wouldn't matter how good the clusters were.

NON-SCIENTIFIC/NON-PROBABILITY SAMPLES

Convenience Sample

These are the ones like "man on the street interviews," or whoever walks by. If you looked at folks' clothes in the science center, you did a convenience sample.

Advantages of convenience samples:

easy
cheap
some possibility of substantive inference, if you can justify, but not statistical inference.

Ex: many psych. studies are done with college students as subjects. If the researcher can make the case that the college students are like other people in the relevant characteristics, then it's OK, but you can't use the concept of statistical inference that we'll get to later.

Disadvantages of ALL non-scientific samples:

Can't do statistical inference.

Quota Samples

When you set beforehand the numbers of specific types of elements you want in the sample, like three M and Ms of every color, even though we know that is not reflective of the underlying population. Or 50% white males, or 50% defective parts when only 10% are really defective. This is like the stratified OVERsample, but it has even more "casualness" to it. You keep drawing until you get enough of the particular type and discard the ones you don't need.

Judgmental Sample

Recruit subjects according to a specific criteria of interest. For instance, Kristin Luker wanted to talk about abortion activists, so she sought out people who were really involved in California politics over abortion. She didn't want to know about everyone's opinions on abortion, just about the activists. Or you might start with one person that fits the bill and ask for recommendations of other people like her. This is big in studies of political elites (ask a staffer to recommend some friends, etc.).

Self-selection samples

Call-in, costly, etc. polls. Enough said!

Sample Design

Sample design covers the method of selection, the sample structure and plans for analysing and interpreting the results. Sample designs can vary from simple to complex and depend on the type of information required and the way the sample is selected.

Sample design affects the size of the sample and the way in which analysis is carried out. In simple terms the more precision the market researcher requires, the more complex will be the design and the larger the sample size.

The sample design may make use of the characteristics of the overall market population, but it does not have to be proportionally representative. It may be necessary to draw a larger sample than would be expected from some parts of the population; for example, to select more from a minority grouping to ensure that sufficient data is obtained for analysis on such groups.

Many sample designs are built around the concept of random selection. This permits justifiable inference from the sample to the population, at quantified levels of precision. Random selection also helps guard against sample bias in a way that selecting by judgement or convenience cannot.

Sampling Scheme
A sampling scheme defines what data will be obtained and how	A sampling scheme is a detailed description of what data will be obtained and how this will be done. In PPC we are faced with two different situations for developing sampling schemes. The first is when we are conducting a controlled experiment. There are very efficient and exact methods for developing sampling schemes for designed experiments and the reader is referred to the Process Improvement chapter for details.
Passive data collection	The second situation is when we are conducting a passive data collection (PDC) study to learn about the inherent properties of a process. These types of studies are usually for comparison purposes when we wish to compare properties of processes against each other or against some hypothesis. This is the situation that we will focus on here.
There are two principles that guide our choice of sampling scheme	Once we have selected our response parameters, it would seem to be a rather straightforward exercise to take some measurements, calculate some statistics and draw conclusions. There are, however, many things which can go wrong along the way that can be avoided with careful planning and knowing what to watch for. There are two overriding principles that will guide the design of our sampling scheme.
The first is precision	The first principle is that of precision. If the sampling scheme is properly laid out, the difference between our estimate of some parameter of interest and its true value will be due only to random variation. The size of this random variation is measured by a quantity called standard error. The magnitude of the standard error is known as precision. The smaller the standard error, the more precise are our estimates.
Precision of an estimate depends on several factors	The precision of any estimate will depend on: the inherent variability of the process estimator the measurement error the number of independent replications (sample size) the efficiency of the sampling scheme.
The second is systematic sampling error (or confounded effects)	The second principle is the avoidance of systematic errors. Systematic sampling error occurs when the levels of one explanatory variable are the same as some other unaccounted for explanatory variable. This is also referred to as confounded effects. Systematic sampling error is best seen by example.
	Example 1: We want to compare the effect of two different coolants on the resulting surface finish from a turning operation. It is decided to run one lot, change the coolant and then run another lot. With this sampling scheme, there is no way to distinguish the coolant effect from the lot effect or from tool wear considerations. There is systematic sampling error in this sampling scheme. Example 2: We wish to examine the effect of two pre-clean procedures on the uniformity of an oxide growth process. We clean one cassette of wafers with one method and another cassette with the other method. We load one cassette in the front of the furnace tube and the other cassette in the middle. To complete the run, we fill the rest of the tube with other lots. With this sampling scheme, there is no way to distinguish between the effect of the different pre-clean methods and the cassette effect or the tube location effect. Again, we have systematic sampling errors. Selecting the Most Appropriate Sampling Strategy There are four primary sampling strategies: Random sampling Stratified random sampling Systematic sampling Rational sub-grouping Before determining which strategy will work best, the analyst must determine what type of study is being conducted. There are normally two types of studies: population and process. With a population study, the analyst is interested in estimating or describing some characteristic of the population (inferential statistics). With a process study, the analyst is interested in predicting a process characteristic or change over time. It is important to make the distinction for proper selection of a sampling strategy. The “I Love Lucy” television show's “Candy Factory” episode can be used to illustrate the difference. For example, a population study, using samples, would seek to determine the average weight of the entire daily run of candies. A process study would seek to know whether the weight was changing over the day. Random Sampling Random samples are used in population sampling situations when reviewing historical or batch data. The key to random sampling is that each unit in the population has an equal probability of being selected in the sample. Using random sampling protects against bias being introduced in the sampling process, and hence, it helps in obtaining a representative sample. In general, random samples are taken by assigning a number to each unit in the population and using a random number table or Minitab to generate the sample list. Absent knowledge about the factors for stratification for a population, a random sample is a useful first step in obtaining samples. For example, an improvement team in a human resources department wanted an accurate estimate of what proportion of employees had completed a personal development plan and reviewed it with their managers. The team used its database to obtain a list of all associates. Each associate on the list was assigned a number. Statistical software was used to generate a list of numbers to be sampled, and an estimate was made from the sample. Stratified Random Sampling Like random samples, stratified random samples are used in population sampling situations when reviewing historical or batch data. Stratified random sampling is used when the population has different groups (strata) and the analyst needs to ensure that those groups are fairly represented in the sample. In stratified random sampling, independent samples are drawn from each group. The size of each sample is proportional to the relative size of the group. For example, the manager of a lending business wanted to estimate the average cycle time for a loan application process. She knows there are three types (strata) of loans (large, medium and small). Therefore, she wanted the sample to have the same proportion of large, medium and small loans as the population. She first separated the loan population data into three groups and then pulled a random sample from each group. Systematic Sampling Systematic sampling is typically used in process sampling situations when data is collected in real time during process operation. Unlike population sampling, a frequency for sampling must be selected. It also can be used for a population study if care is taken that the frequency is not biased. Systematic sampling involves taking samples according to some systematic rule - e.g., every fourth unit, the first five units every hour, etc. One danger of using systematic sampling is that the systematic rule may match some underlying structure and bias the sample. For example, the manager of a billing center is using systematic sampling to monitor processing rates. At random times around each hour, five consecutive bills are selected and the processing time is measured. Rational Sub-Grouping Rational sub-grouping is the process of putting measurements into meaningful groups to better understand the important sources of variation. Rational sub-grouping is typically used in process sampling situations when data is collected in real time during process operations. It involves grouping measurements produced under similar conditions, sometimes called short-term variation. This type of grouping assists in understanding the sources of variation between subgroups, sometimes called long-term variation. The goal should be to minimize the chance of special causes in variation in the subgroup and maximize the chance for special causes between subgroups. Sub-grouping over time is the most common approach; sub-grouping can be done by other suspected sources of variation (e.g., location, customer, supplier, etc.) For example, an equipment leasing business was trying to improve equipment turnaround time. They selected five samples per day from each of three processing centers. Each processing center was formed into a subgroup. When using sub-grouping, form subgroups with items produced under similar conditions. To ensure items in a subgroup were produced under similar conditions, select items produced close together in time.

| Labels: Random Variable, Sample Design, Sampling Design, Sampling Scheme, Simple Random Sample, Stratified Sampling

Statistical analysis

Followers

About me

Categories

Text

sampling design

Sampling Design

Sampling Scheme

Selecting the Most Appropriate Sampling Strategy

Random Sampling

Stratified Random Sampling

Systematic Sampling

Rational Sub-Grouping

0 comments:

Post a Comment

Post a Comment