How to calculate sample size correctly? Statistical sampling

Plan:

1. Problems of mathematical statistics.

2. Types of samples.

3. Selection methods.

4. Statistical distribution of the sample.

5. Empirical distribution function.

6. Polygon and histogram.

7. Numerical characteristics of the variation series.

8. Statistical estimates of distribution parameters.

9. Interval estimates of distribution parameters.

1. Problems and methods of mathematical statistics

Math statistics is a branch of mathematics devoted to methods of collecting, analyzing and processing the results of statistical observational data for scientific and practical purposes.

Let it be necessary to study a set of homogeneous objects with respect to some qualitative or quantitative feature that characterizes these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

Sometimes a complete study is carried out, i.e. each object is examined for the required characteristic. In practice, a complete survey is rarely used. For example, if the population contains very big number objects, then it is physically impossible to conduct a comprehensive survey. If the survey of an object is associated with its destruction or requires large material costs, then conducting a complete survey does not make sense. In such cases, a limited number of objects are randomly selected from the entire population ( sample population) and subject them to study.

The main task of mathematical statistics is to study the entire population using sample data, depending on the goal, i.e. study of the probabilistic properties of a population: distribution law, numerical characteristics, etc. for adoption management decisions in conditions of uncertainty.

2. Types of samples

Population is the set of objects from which the sample is made.

Sample population (sample) is a collection of randomly selected objects.

Volume of population is the number of objects in this collection. The size of the population is denoted by N, selective – n.

Example:

If out of 1000 parts 100 parts are selected for examination, then the volume of the general population N = 1000, and sample size n = 100.

There are two ways to select a sample: after an object is selected and observed, it may or may not be returned to the population. That. samples are divided into repeated and non-repeated.

Repeatcalled sample, in which the selected object (before selecting the next one) is returned to the population.

Repeatlesscalled sample, in which the selected object is not returned to the population.

In practice, repeatless random sampling is usually used.

In order to be able to judge with sufficient confidence about the characteristic of the population of interest based on sample data, it is necessary that the sample objects correctly represent it. The sample must correctly represent the proportions of the population. The sample should be representative (representative).

By force of law large numbers it can be argued that the sample will be representative if it is randomized.

If the size of the population is large enough, and the sample constitutes only a small part of this population, then the distinction between repeated and non-repetitive samples is erased; in the limiting case, when an infinite population is considered and the sample has a finite size, this difference disappears.

Example:

The American journal Literary Review, using statistical methods, conducted a study of forecasts regarding the outcome of the upcoming US presidential election in 1936. Contenders for this post were F.D. Roosevelt and A. M. Landon. Telephone directories were taken as a source for the general population of Americans studied. Of these, 4 million addresses were randomly selected, to which the magazine's editors sent postcards asking them to express their attitude towards the candidates for the presidency. After processing the survey results, the magazine published a sociological forecast that Landon would win the upcoming elections by a large margin. And... I was wrong: Roosevelt won.
This example can be considered an example of a non-representative sample. The fact is that in the United States in the first half of the twentieth century, only the wealthy part of the population who supported Landon’s views had telephones.

3. Selection methods

In practice they are used various ways selection, which can be divided into 2 types:

1. Selection does not require dividing the population into parts (a) simple random non-repeating; b) simple random repeat).

2. Selection, in which the population is divided into parts. (A) typical selection; b) mechanical selection; V) serial selection).

Simple random they call this selection, in which objects are extracted one at a time from the entire population (randomly).

Typicalcalled selection, in which objects are selected not from the entire population, but from each of its “typical” parts. For example, if a part is produced on several machines, then the selection is made not from the entire set of parts produced by all the machines, but from the products of each machine separately. This selection is used when the trait being examined varies noticeably in various “typical” parts of the general population.

Mechanicalcalled selection, in which the general population is “mechanically” divided into as many groups as there are objects that should be included in the sample, and one object is selected from each group. For example, if you need to select 20% of the parts produced by a machine, then every 5th part is selected; if you need to select 5% of the parts - every 20th, etc. Sometimes such selection may not ensure the representativeness of the sample (if every 20th grinded roller is selected, and the cutter is replaced immediately after selection, then all rollers turned with blunt cutters will be selected).

Serialcalled selection, in which objects are selected from the general population not one at a time, but in “series”, which are subjected to a continuous survey. For example, if products are manufactured by a large group of automatic machines, then the products of only a few machines are subjected to a comprehensive examination.

In practice, combined selection is often used, in which the above methods are combined.

4. Statistical distribution of the sample

Let a sample be extracted from the general population, and the value x 1–observed once, x 2 -n 2 times,... x k - n k times. n= n 1 +n 2 +...+n k – sample size. Observed valuesare called options, and the sequence of options written in ascending order is variation series. Numbers of observationsare called frequencies (absolute frequencies), and their relationship to the sample size- relative frequencies or statistical probabilities.

If the number of variants is large or the sample is taken from a continuous population, then the variation series is compiled not from individual point values, but from intervals of values ​​in the population. Such a variation series is called interval. The lengths of the intervals must be equal.

Statistical sample distribution called a list of options and their corresponding frequencies or relative frequencies.

The statistical distribution can also be specified as a sequence of intervals and their corresponding frequencies (the sum of frequencies falling within this interval of values)

A point variation series of frequencies can be represented by a table:

x i
x 1
x 2

x k
n i
n 1
n 2

n k

Similarly, one can imagine a point variation series of relative frequencies.

Moreover:

Example:

The number of letters in a certain text X turned out to be equal to 1000. The first letter encountered was the letter “i”, the second was the letter “i”, the third was the letter “a”, the fourth was “yu”. Then came the letters “o”, “e”, “u”, “e”, “s”.

Let's write down the places they occupy in the alphabet, respectively we have: 33, 10, 1, 32, 16, 6, 21, 31, 29.

After ordering these numbers in ascending order, we get the variation series: 1, 6, 10, 16, 21, 29, 31, 32, 33.

Frequencies of appearance of letters in the text: “a” - 75, “e” - 87, “i” - 75, “o” - 110, “u” - 25, “s” - 8, “e” - 3, “yu” "- 7, "I" - 22.

Let's create a point variation series of frequencies:

Example:

Specified volume sampling frequency distribution n = 20.

Make a point variation series of relative frequencies.

x i

2

6

12

n i

3

10

7

Solution:

Let's find the relative frequencies:


x i

2

6

12

w i

0,15

0,5

0,35

When constructing an interval distribution, there are rules for choosing the number of intervals or the size of each interval. The criterion here is the optimal ratio: with an increase in the number of intervals, representativeness improves, but the volume of data and the time for processing it increase. Difference x max - x min between the largest and smallest values ​​the option is called scope samples.

To count the number of intervals k Typically, the empirical Sturgess formula is used (implying rounding to the nearest convenient integer): k = 1 + 3.322 log n.

Accordingly, the size of each interval h can be calculated using the formula:

5. Empirical distribution function

Let's consider some sample from the general population. Let the statistical frequency distribution of the quantitative characteristic X be known. Let us introduce the notation: n x– the number of observations in which a characteristic value less than x was observed; n total number observations (sample size). Relative frequency of event X<х равна nx/n. If x changes, then the relative frequency also changes, i.e. relative frequencyn x /n- there is a function of x. Because it is found empirically, then it is called empirical.

Empirical distribution function (sampling distribution function) call the function, which determines for each x the relative frequency of the event X<х.


where is the number of options less than x,

n - sample size.

In contrast to the empirical distribution function of a sample, the distribution function F(x) of the population is called theoretical distribution function.

The difference between empirical and theoretical distribution functions is that the theoretical function F (x) determines the probability of event X F*(x) tends in probability to the probability F (x) of this event. That is, for large n F*(x) and F(x) differ little from each other.

That. It is advisable to use the empirical distribution function of the sample to approximate the theoretical (integral) distribution function of the general population.

F*(x) has all the properties F(x).

1. Values F*(x) belong to the interval.

2. F*(x) is a non-decreasing function.

3. If is the smallest option, then F*(x) = 0, for x < x 1 ; if x k is the largest option, then F*(x) = 1, for x > x k.

Those. F*(x) serves to estimate F(x).

If the sample is given by a variation series, then the empirical function has the form:

The graph of an empirical function is called a cumulate.

Example:

Plot an empirical function from the given sampling distribution.


Solution:

Sample size n = 12 + 18 +30 = 60. The smallest option is 2, i.e. at x < 2. Event X<6, (x 1 = 2) наблюдалось 12 раз, т.е. F*(x)=12/60=0.2 at 2 < x < 6. Event X<10, (x 1 =2, x 2 = 6) наблюдалось 12 + 18 = 30 раз, т.е.F*(x)=30/60=0,5 при 6 < x < 10. Because x=10 is the largest option, then F*(x) = 1 at x>10. The desired empirical function has the form:

Cumulates:


Cumulate makes it possible to understand graphically presented information, for example, answer the questions: “Determine the number of observations in which the value of the attribute was less than 6 or not less than 6. F*(6) =0.2 "Then the number of observations in which the value of the observed characteristic was less than 6 is 0.2* n = 0.2*60 = 12. The number of observations in which the value of the observed characteristic was at least 6 is equal to (1-0.2)* n = 0.8*60 = 48.

If an interval variation series is given, then to compile the empirical distribution function, the midpoints of the intervals are found and from them the empirical distribution function is obtained similarly to the point variation series.

6. Polygon and histogram

For clarity, various statistical distribution graphs are built: polynomial and histograms

Frequency range - this is a broken line, the segments of which connect the points ( x 1 ; n 1 ), ( x 2 ; n 2 ),…, ( x k ; n k ), where are the options and are the corresponding frequencies.

Relative frequency polygon - this is a broken line, the segments of which connect the points ( x 1 ; w 1 ), ( x 2 ; w 2 ),…, ( x k ; w k ), where x i are the options, w i are the relative frequencies corresponding to them.

Example:

Construct a polynomial of relative frequencies from the given sampling distribution:

Solution:

In the case of a continuous characteristic, it is advisable to construct a histogram, for which the interval in which all observed values ​​of the characteristic are contained is divided into several partial intervals of length h and for each partial interval n i is found - the sum of the frequencies of the variants falling into the i-th interval. (For example, when measuring a person’s height or weight, we are dealing with a continuous attribute).

Frequency histogram- This is a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio (frequency density).

Square The i-th partial rectangle is equal to the sum of frequencies of the i-th interval variant, i.e. The area of ​​the frequency histogram is equal to the sum of all frequencies, i.e. sample size.

Example:

The results of voltage changes (in volts) in the electrical network are given. Make a variation series, construct a polygon and a frequency histogram if the voltage values ​​are as follows: 227, 215, 230, 232, 223, 220, 228, 222, 221, 226, 226, 215, 218, 220, 216, 220, 225, 212 , 217, 220.

Solution:

Let's create a variation series. We have n = 20, x min =212, x max =232.

Let's apply the Sturgess formula to calculate the number of intervals.

The interval variation series of frequencies has the form:


Frequency Density

212-21 6

0,75

21 6-22 0

0,75

220-224

1,75

224-228

228-232

0,75

Let's build a frequency histogram:

Let's construct a frequency polygon by first finding the midpoints of the intervals:


Relative frequency histogram called a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio w i/h (relative frequency density).

Square The i-th partial rectangle is equal to the relative frequency of the variants falling into the i-th interval. Those. the area of ​​the histogram of relative frequencies is equal to the sum of all relative frequencies, i.e. unit.

7. Numerical characteristics of the variation series

Let's consider the main characteristics of the general and sample populations.

General Secondary is called the arithmetic mean of the characteristic values ​​of the general population.

For different values ​​x 1, x 2, x 3, ..., x n. characteristic of the general population of volume N we have:

If the characteristic values ​​have corresponding frequencies N 1 +N 2 +…+N k =N, then


Sample mean is called the arithmetic mean of the characteristic values ​​of the sample population.

If the characteristic values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Example:

Calculate the sample mean for the sample: x 1 = 51.12; x 2 = 51.07; x 3 = 52.95; x 4 = 52.93; x 5 = 51.1; x 6 = 52.98; x 7 = 52.29; x 8 = 51.23; x 9 = 51.07; x 10 = 51.04.

Solution:

General variance is called the arithmetic mean of the squared deviations of the values ​​of characteristic X of the general population from the general mean.

For different values ​​x 1 , x 2 , x 3 , ..., x N of the characteristic of the general population of volume N we have:

If the characteristic values ​​have corresponding frequencies N 1 +N 2 +…+N k =N, then

General standard deviation (standard) called the square root of the general variance

Sample variance is called the arithmetic mean of the squared deviations of the observed values ​​of a characteristic from the mean value.

For different values ​​x 1 , x 2 , x 3 , ..., x n of the sample population feature of volume n we have:


If the characteristic values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Sample standard deviation (standard) is called the square root of the sample variance.


Example:

The sample population is specified by the distribution table. Find the sample variance.


Solution:

Theorem: The variance is equal to the difference between the mean squares of the attribute values ​​and the square of the overall mean.

Example:

Find the variance of this distribution.



Solution:

8. Statistical estimates of distribution parameters

Let the general population be studied using a certain sample. In this case, it is possible to obtain only an approximate value of the unknown parameter Q, which serves as its estimate. Obviously, estimates may vary from one sample to another.

Statistical assessmentQ* The unknown parameter of the theoretical distribution is called a function f, depending on the observed sample values. The task of statistical estimation of unknown parameters from a sample is to construct a function from the available statistical observation data that would give the most accurate approximate values ​​of the real, unknown to the researcher, values ​​of these parameters.

Statistical estimates are divided into point and interval, depending on the method of their presentation (number or interval).

A point is a statistical estimate parameter Q of the theoretical distribution determined by one value of the parameter Q *=f (x 1, x 2, ..., x n), wherex 1 , x 2 , ..., x n- the results of empirical observations on the quantitative characteristic X of a certain sample.

Such parameter estimates obtained from different samples most often differ from each other. The absolute difference /Q *-Q / is called sampling (estimation) error.

In order for statistical estimates to produce reliable results about the parameters being estimated, they must be unbiased, efficient, and consistent.

Point estimate, the mathematical expectation of which is equal (not equal) to the estimated parameter is called undisplaced (displaced). M(Q *)=Q .

Difference M( Q *)-Q is called bias or systematic error. For unbiased estimates, the bias is 0.

Effective assessment Q *, which for a given sample size n has the smallest possible variance: D min(n = const). The effective estimator has the smallest variance compared to other unbiased and consistent estimators.

Wealthycall this statistical assessment Q *, which for ntends in probability to the estimated parameter Q , i.e. with increasing sample size n the estimate tends in probability to the true value of the parameter Q.

The requirement of consistency is consistent with the law of large numbers: the more initial information about the object under study, the more accurate the result. If the sample size is small, then the point estimate of the parameter can lead to serious errors.

I love it sample (volumen) can be thought of as an ordered setx 1 , x 2 , ..., x n independent identically distributed random variables.

Sample means for different sample sizes n from the same population will be different. That is, the sample average can be considered as a random variable, which means we can talk about the distribution of the sample average and its numerical characteristics.

The sample average satisfies all the requirements imposed on statistical estimates, i.e. gives an unbiased, efficient and consistent estimate of the general mean.

It can be proven that. Thus, the sample variance is a biased estimate of the population variance, underestimating it. That is, with a small sample size it will produce a systematic error. For an unbiased, consistent estimate, it is enough to take the value, which is called the corrected variance. That is

In practice, to estimate the general variance, the corrected variance is used at n < 30. In other cases ( n >30) deviation from hardly noticeable. Therefore, for large values n the offset error can be neglected.

It can also be proven that the relative frequencyn i / n is an unbiased and consistent probability estimate P (X =x i ). Empirical distribution function F*(x ) is an unbiased and consistent estimate of the theoretical distribution function F(x)=P(X< x ).

Example:

Find unbiased estimates of the expected value and variance from the sample table.

x i
n i

Solution:

Sample size n =20.

An unbiased estimate of the mathematical expectation is the sample mean.


To calculate the unbiased variance estimate, we first find the sample variance:

Now let's find the unbiased estimate:

9. Interval estimates of distribution parameters

Interval is a statistical estimate determined by two numerical values ​​- the ends of the interval under study.

Number> 0, for which | Q - Q *|< , characterizes the accuracy of the interval estimate.

Trustedcalled interval , which with a given probabilitycovers unknown parameter value Q . Complementing a confidence interval to the set of all possible values ​​of a parameter Q called critical area. If the critical region is located on only one side of the confidence interval, then the confidence interval is called one-sided: left-sided, if the critical region exists only on the left, and right-handed if only on the right. Otherwise, the confidence interval is called bilateral.

Reliability, or confidence level, estimates Q (using Q *) is the probability with which the following inequality is satisfied: | Q - Q *|< .

Most often, the confidence probability is set in advance (0.95; 0.99; 0.999) and the requirement is imposed on it to be close to one.

Probabilitycalled probability of error, or level of significance.

Let | Q - Q *|< , Then. This means that with probabilityit can be argued that the true value of the parameter Q belongs to the interval. The smaller the deviation, the more accurate the estimate.

The boundaries (ends) of the confidence interval are called confidence limits, or critical limits.

The values ​​of the confidence interval limits depend on the distribution law of the parameter Q*.

Deviation valueequal to half the width of the confidence interval is called assessment accuracy.

Methods for constructing confidence intervals were first developed by the American statistician Yu. Neumann. Estimation accuracy, confidence probability and sample size n connected to each other. Therefore, knowing the specific values ​​of two quantities, you can always calculate the third.

Finding a confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is known.

Let a sample be taken from a general population subject to the law of normal distribution. Let the general standard deviation be known, but the mathematical expectation of the theoretical distribution is unknown a ().

The following formula is correct:

Those. according to a given deviation valuecan be found with what probability the unknown general mean belongs to the interval. And vice versa. From the formula it is clear that with increasing sample size and a fixed value of the confidence probability, the value- decreases, i.e. the accuracy of the assessment increases. With increasing reliability (confidence probability), the value-increases, i.e. the accuracy of the assessment decreases.

Example:

As a result of the tests, the following values ​​were obtained -25, 34, -20, 10, 21. It is known that they obey the law of normal distribution with a standard deviation of 2. Find the estimate a* for the mathematical expectation a. Construct a 90% confidence interval for it.

Solution:

Let's find an unbiased estimate

Then


The confidence interval for a is: 4 – 1.47< a< 4+ 1,47 или 2,53 < a < 5, 47

Finding a confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is unknown.

Let it be known that the general population is subject to the law of normal distribution, where a and. Accuracy of confidence interval covering with reliabilitythe true value of parameter a, in this case, is calculated by the formula:

, where n is the sample size, , - Student’s coefficient (it should be found from the given values n and from the table “Critical points of the Student distribution”).

Example:

As a result of the tests, the following values ​​were obtained -35, -32, -26, -35, -30, -17. It is known that they obey the law of normal distribution. Find the confidence interval for the mathematical expectation a of the population with a confidence probability of 0.9.

Solution:

Let's find an unbiased estimate.

We'll find.

Then

The confidence interval will take the form(-29.2 - 5.62; -29.2 + 5.62) or (-34.82; -23.58).

Finding the confidence interval for the variance and standard deviation of a normal distribution

Let a random sample of volume be taken from a certain general population of values ​​distributed according to the normal lawn < 30, for which sample variances are calculated: biasedand corrected s 2. Then, to find interval estimates with a given reliabilityfor general varianceDgeneral standard deviationThe following formulas are used.


or,

Values- found using a table of critical point valuesPearson distributions.

The confidence interval for the variance is found from these inequalities by squaring all sides of the inequality.

Example:

The quality of 15 bolts was checked. Assuming that the error in their manufacture is subject to the normal distribution law, and the sample standard deviationequal to 5 mm, determine reliablyconfidence interval for an unknown parameter

We represent the boundaries of the interval in the form of a double inequality:

The ends of the two-sided confidence interval for the variance can be determined without performing arithmetic operations for a given level of confidence and sample size using the appropriate table (Limits of confidence intervals for variance depending on the number of degrees of freedom and reliability). To do this, the ends of the interval obtained from the table are multiplied by the corrected variance s 2.

Example:

Let's solve the previous problem in a different way.

Solution:

Let's find the corrected variance:

Using the table “Limits of confidence intervals for the dispersion depending on the number of degrees of freedom and reliability,” we will find the boundaries of the confidence interval for the dispersion atk=14 and: lower limit 0.513 and upper limit 2.354.

Let's multiply the resulting boundaries bys 2 and extract the root (since we need a confidence interval not for the variance, but for the standard deviation).

As can be seen from the examples, the size of the confidence interval depends on the method of its construction and gives similar, but unequal results.

For samples of a sufficiently large size (n>30) the boundaries of the confidence interval for the general standard deviation can be determined by the formula: - a certain number that is tabulated and given in the corresponding reference table.

If 1- q<1, то формула имеет вид:

Example:

Let's solve the previous problem in the third way.

Solution:

Previously founds= 5,17. q(0.95; 15) = 0.46 – found from the table.

Then:

Sample

Sample or sample population- a set of cases (subjects, objects, events, samples), using a certain procedure, selected from the general population to participate in the study.

Sample characteristics:

  • Qualitative characteristics of the sample - who exactly we choose and what sampling methods we use for this.
  • Quantitative characteristics of the sample - how many cases we select, in other words, sample size.

Necessity of sampling

  • The object of study is very extensive. For example, consumers of a global company’s products are represented by a huge number of geographically dispersed markets.
  • There is a need to collect primary information.

Sample size

Sample size- the number of cases included in the sample population. For statistical reasons, it is recommended that the number of cases be at least 30-35.

Dependent and independent samples

When comparing two (or more) samples, an important parameter is their dependence. If a homomorphic pair can be established (that is, when one case from sample X corresponds to one and only one case from sample Y and vice versa) for each case in two samples (and this basis of relationship is important for the trait being measured in the samples), such samples are called dependent. Examples of dependent samples:

  • pairs of twins,
  • two measurements of any trait before and after experimental exposure,
  • husbands and wives
  • and so on.

If there is no such relationship between samples, then these samples are considered independent, For example:

Accordingly, dependent samples always have the same size, while the size of independent samples may differ.

Comparison of samples is made using various statistical criteria:

  • and etc.

Representativeness

The sample may be considered representative or non-representative.

Example of a non-representative sample

  1. A study with experimental and control groups, which are placed in different conditions.
    • Study with experimental and control groups using a pairwise selection strategy
  2. A study using only one group - an experimental group.
  3. A study using a mixed (factorial) design - all groups are placed in different conditions.

Sampling types

Samples are divided into two types:

  • probabilistic
  • non-probabilistic

Probability samples

  1. Simple probability sampling:
    • Simple resampling. The use of such a sample is based on the assumption that each respondent is equally likely to be included in the sample. Based on the list of the general population, cards with respondent numbers are compiled. They are placed in a deck, shuffled and a card is taken out at random, the number is written down, and then returned back. Next, the procedure is repeated as many times as the sample size we need. Disadvantage: repetition of selection units.

The procedure for constructing a simple random sample includes the following steps:

1. it is necessary to obtain a complete list of members of the population and number this list. Such a list, recall, is called a sampling frame;

2. determine the expected sample size, that is, the expected number of respondents;

3. extract as many numbers from the random number table as we need sample units. If there should be 100 people in the sample, 100 random numbers are taken from the table. These random numbers can be generated by a computer program.

4. select from the base list those observations whose numbers correspond to the written random numbers

  • Simple random sampling has obvious advantages. This method is extremely easy to understand. The results of the study can be generalized to the population being studied. Most approaches to statistical inference involve collecting information using a simple random sample. However, the simple random sampling method has at least four significant limitations:

1. It is often difficult to create a sampling frame that would allow simple random sampling.

2. Simple random sampling may result in a large population, or a population distributed over a large geographic area, which significantly increases the time and cost of data collection.

3. The results of simple random sampling are often characterized by low precision and a larger standard error than the results of other probability methods.

4. As a result of using SRS, a non-representative sample may be formed. Although samples obtained by simple random sampling, on average, adequately represent the population, some of them are extremely misrepresentative of the population being studied. This is especially likely when the sample size is small.

  • Simple non-repetitive sampling. The sampling procedure is the same, only the cards with respondent numbers are not returned to the deck.
  1. Systematic probability sampling. It is a simplified version of simple probability sampling. Based on the list of the general population, respondents are selected at a certain interval (K). The value of K is determined randomly. The most reliable result is achieved with a homogeneous population, otherwise the step size and some internal cyclic patterns of the sample may coincide (sampling mixing). Disadvantages: the same as in a simple probability sample.
  2. Serial (cluster) sampling. Selection units are statistical series (family, school, team, etc.). Selected elements are subjected to a complete examination. The selection of statistical units can be organized as random or systematic sampling. Disadvantage: Possibility of greater homogeneity than in the general population.
  3. Regional sampling. In the case of a heterogeneous population, before using probability sampling with any selection technique, it is recommended to divide the population into homogeneous parts, such a sample is called district sampling. Zoning groups can include both natural formations (for example, city districts) and any feature that forms the basis of the study. The characteristic on the basis of which the division is carried out is called the characteristic of stratification and zoning.
  4. "Convenience" sample. The “convenience” sampling procedure consists of establishing contacts with “convenient” sampling units - a group of students, a sports team, friends and neighbors. If you want to get information about people's reactions to a new concept, this type of sampling is quite reasonable. Convenience sampling is often used to pretest questionnaires.

Non-probability samples

Selection in such a sample is carried out not according to the principles of randomness, but according to subjective criteria - availability, typicality, equal representation, etc.

  1. Quota sampling - the sample is constructed as a model that reproduces the structure of the general population in the form of quotas (proportions) of the characteristics being studied. The number of sample elements with different combinations of studied characteristics is determined so that it corresponds to their share (proportion) in the general population. So, for example, if our general population consists of 5,000 people, of which 2,000 are women and 3,000 are men, then in the quota sample we will have 20 women and 30 men, or 200 women and 300 men. Quota samples are most often based on demographic criteria: gender, age, region, income, education, and others. Disadvantages: usually such samples are not representative, because it is impossible to take into account several social parameters at once. Pros: readily available material.
  2. Snowball method. The sample is constructed as follows. Each respondent, starting with the first, is asked for contact information of his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the research objects themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with a high income, respondents belonging to the same professional group, respondents with any similar hobbies/interests, etc.)
  3. Spontaneous sampling – sampling of the so-called “first person you come across”. Often used in television and radio polls. The size and composition of spontaneous samples is not known in advance, and is determined only by one parameter - the activity of respondents. Disadvantages: it is impossible to establish which population the respondents represent, and as a result, it is impossible to determine representativeness.
  4. Route survey – often used when the unit of study is the family. On the map of the locality in which the survey will be carried out, all streets are numbered. Using a table (generator) of random numbers, large numbers are selected. Each large number is considered as consisting of 3 components: street number (2-3 first numbers), house number, apartment number. For example, the number 14832: 14 is the street number on the map, 8 is the house number, 32 is the apartment number.
  5. Regional sampling with selection of typical objects. If, after zoning, a typical object is selected from each group, i.e. an object that is close to the average in terms of most of the characteristics studied in the study, such a sample is called regionalized with the selection of typical objects.

6.Modal sampling. 7.expert sampling. 8. Heterogeneous sample.

Group Building Strategies

The selection of groups for participation in a psychological experiment is carried out using various strategies to ensure that internal and external validity are maintained to the greatest possible extent.

Randomization

Randomization, or random selection, is used to create simple random samples. The use of such a sample is based on the assumption that each member of the population is equally likely to be included in the sample. For example, to make a random sample of 100 university students, you can put pieces of paper with the names of all university students in a hat, and then take 100 pieces of paper out of it - this will be a random selection (Goodwin J., p. 147).

Pairwise selection

Pairwise selection- a strategy for constructing sampling groups, in which groups of subjects are made up of subjects who are equivalent in terms of secondary parameters that are significant for the experiment. This strategy is effective for experiments using experimental and control groups, with the best option being the involvement of twin pairs (mono- and dizygotic), as it allows you to create...

Stratometric sampling

Stratometric sampling- randomization with the allocation of strata (or clusters). With this method of sampling, the general population is divided into groups (strata) with certain characteristics (gender, age, political preferences, education, income level, etc.), and subjects with the corresponding characteristics are selected.

Approximate Modeling

Approximate Modeling- drawing limited samples and generalizing conclusions about this sample to the wider population. For example, with the participation of 2nd year university students in the study, the data of this study applies to “people aged 17 to 21 years”. The admissibility of such generalizations is extremely limited.

Approximate modeling is the formation of a model that, for a clearly defined class of systems (processes), describes its behavior (or desired phenomena) with acceptable accuracy.

Notes

Literature

Nasledov A. D. Mathematical methods of psychological research. - St. Petersburg: Rech, 2004.

  • Ilyasov F. N. Representativeness of survey results in marketing research // Sociological Research. 2011. No. 3. P. 112-116.

see also

  • In some types of studies, the sample is divided into groups:
    • experimental
    • control
  • Cohort

Links

  • The concept of sampling. Main characteristics of the sample. Sampling types

Wikimedia Foundation.

2010.:
  • Synonyms
  • Shchepkin, Mikhail Semenovich

Population

    See what “Selection” is in other dictionaries: sample - a group of subjects representing a specific population and selected for an experiment or study. The opposite concept is the general totality. A sample is a part of the general population. Dictionary of a practical psychologist. M.: AST,... ...

    See what “Selection” is in other dictionaries: Great psychological encyclopedia - sample Part of the general population of elements that is covered by observation (often it is called a sample population, and a sample is the method of sampling observation itself). In mathematical statistics it is accepted... ...

    Sample- (sample) 1. A small quantity of a product, selected to represent its entire quantity. See: sale by sample. 2. A small quantity of goods given to potential buyers to give them the opportunity to carry it out... ... Dictionary of business terms

    Sample- part of the general population of elements that is covered by observation (often it is called a sample population, and a sample is the method of sampling observation itself). In mathematical statistics, the principle of random selection is adopted; This… … Economic and mathematical dictionary

    SAMPLE- (sample) A random selection of a subgroup of elements from the main population, the characteristics of which are used to evaluate the entire population as a whole. The sampling method is used when it is too time-consuming or too expensive to survey the entire population... Economic dictionary

    See what “Selection” is in other dictionaries:- Cm … Synonym dictionary

One of the main components of a well-designed study is defining the sample and what a representative sample is. It's like the cake example. After all, you don’t have to eat the whole dessert to understand its taste? A small part is enough.

So, the cake is population (that is, all respondents who are eligible for the survey). It can be expressed geographically, for example, only residents of the Moscow region. Gender - women only. Or have age restrictions - Russians over 65 years old.

Calculating the population is difficult: you need to have data from the population census or preliminary assessment surveys. Therefore, usually the general population is “estimated”, and from the resulting number they calculate sample population or sample.

What is a representative sample?

Sample– this is a clearly defined number of respondents. Its structure should coincide as much as possible with the structure of the general population in terms of the main characteristics of selection.

For example, if potential respondents are the entire population of Russia, where 54% are women and 46% are men, then the sample should contain exactly the same percentage. If the parameters coincide, then the sample can be called representative. This means that inaccuracies and errors in the study are reduced to a minimum.

The sample size is determined taking into account the requirements of accuracy and economy. These requirements are inversely proportional to each other: the larger the sample size, the more accurate the result. Moreover, the higher the accuracy, the correspondingly more costs are required to conduct the study. And vice versa, the smaller the sample, the less costs it costs, the less accurately and more randomly the properties of the general population are reproduced.

Therefore, to calculate the volume of choice, sociologists invented a formula and created special calculator:

Confidence probability And confidence error

What do the terms " confidence probability" And " confidence error"? Confidence probability is an indicator of measurement accuracy. And the confidence error is a possible error in the research results. For example, with a population of more than 500,00 people (let’s say living in Novokuznetsk), the sample will be 384 people with a confidence probability of 95% and an error of 5% OR (with a confidence interval of 95±5%).

What follows from this? When conducting 100 studies with such a sample (384 people), in 95 percent of cases the answers obtained, according to the laws of statistics, will be within ±5% of the original one. And we will get a representative sample with a minimum probability of statistical error.

After the sample size has been calculated, you can see if there is a sufficient number of respondents in the demo version of the Questionnaire Panel. You can find out more about how to conduct a panel survey.

In the theory of the sampling method, various selection methods and types of sampling have been developed to ensure representativeness. Under selection method understand the procedure for selecting units from the population. There are two selection methods: repeated and non-repetitive. At repeated In sampling, each randomly selected unit, after being surveyed, is returned to the general population and, with subsequent selection, can again be included in the sample. This selection method is based on the “returned ball” scheme: the probability of being included in the sample for each unit of the population does not change regardless of the number of units selected. At repeatable In sampling, each unit selected at random is not returned to the general population after its examination. This selection method is based on the “non-returned ball” scheme: the probability of being included in the sample for each unit of the general population increases as selection proceeds.

Depending on the methodology for forming the sample population, the following main ones are distinguished: types of sampling:

actually random;

mechanical;

typical (stratified, zoned);

serial (nested);

combined;

multi-stage;

multiphase;

interpenetrating.

Actually random sampling is formed in strict accordance with scientific principles and random selection rules. To obtain a random sample itself, the population is strictly divided into sampling units, and then a sufficient number of units are selected in a random repeated or non-repetitive order.

Random order is like drawing lots. In practice, it is most often used when using special tables of random numbers. If, for example, 40 units are to be selected from a population containing 1587 units, then 40 four-digit numbers that are less than 1587 are selected from the table.

In the case when the random sample itself is organized as a repeated sample, the standard error is calculated in accordance with formula (6.1). With the non-repetitive sampling method, the formula for calculating the standard error will be:


where 1 – n/ N– the proportion of units in the general population that were not included in the sample. Since this fraction is always less than one, the error during non-repetitive selection, other things being equal, is always less than during repeated selection. Non-repetitive selection is easier to organize than repeated selection, and it is used much more often. However, the value of the standard error during non-repetitive sampling can be determined using a simpler formula (5.1). Such a replacement is possible if the proportion of units in the general population that were not included in the sample is large and, therefore, the value is close to unity.

Forming a sample in strict accordance with the rules of random selection is practically very difficult, and sometimes impossible, since when using tables of random numbers it is necessary to number all units of the general population. Quite often, the population is so large that it is extremely difficult and impractical to carry out such preliminary work, so in practice other types of samples are used, each of which is not strictly random. However, they are organized in such a way as to ensure maximum approximation to the conditions of random selection.

When clean mechanical sampling the entire general population of units must first of all be presented in the form of a list of selection units, compiled in some order neutral with respect to the trait being studied, for example, alphabetically. Then the list of selection units is divided into as many equal parts as there are units to be selected. Next, according to a pre-established rule not related to the variation of the characteristic under study, one unit is selected from each part of the list. This type of sampling may not always provide random sampling, and the resulting sample may be biased. This is explained by the fact that, firstly, the ordering of units in the general population may have an element of a non-random nature. Secondly, sampling from each part of the population if the reference point is incorrectly established can also lead to bias error. However, in practice it is easier to organize a mechanical sample than a random one, and when conducting sample surveys this type of sampling is most often used. The standard error in mechanical sampling is determined by the formula of the actual random non-repetitive sampling (6.2).

Typical (zoned, stratified) sample has two goals:

ensure representation in the sample of the corresponding typical groups of the general population according to the characteristics of interest to the researcher;

increase the accuracy of sample survey results.

With a typical sample, before its formation begins, the general population of units is divided into typical groups. In this case, a very important point is the correct choice of grouping characteristic. The selected typical groups may contain the same or different numbers of selection units. In the first case, the sample population is formed with an equal share of selection from each group, in the second - with a share proportional to its share in the general population. If a sample is formed with an equal share of selection, it is essentially equivalent to a number of strictly random samples from smaller populations, each of which is a typical group. Selection from each group is carried out in a random (repeated or non-repeated) or mechanical manner. With a typical sample, both with an equal and unequal share of selection, it is possible to eliminate the influence of intergroup variation of the characteristic being studied on the accuracy of its results, since mandatory representation of each of the typical groups in the sample is ensured. Will the standard error of the sample depend on the amount of total variance? 2, and on the value of the average of the group variances?i 2 . Since the average of the group variances is always less than the total variance, all other things being equal, the standard error of a typical sample will be less than the standard error of a random sample itself.

When determining standard errors of a typical sample, the following formulas are used:

When repeating the selection method

With a non-repetitive selection method:

– the average of the group variances in the sample population.

Serial (cluster) sampling- this is a type of formation of a sample population when not units to be surveyed, but groups of units (series, nests) are selected in random order. Within the selected series (nests), all units are examined. Serial sampling is practically easier to organize and conduct than sampling individual units. However, with this type of sampling, firstly, the representation of each of the series is not ensured and, secondly, the influence of inter-series variation of the studied characteristic on the survey results is not eliminated. In the case where this variation is significant, it will lead to an increase in the random error of representativeness. When choosing the type of sample, the researcher must take this circumstance into account. The standard error of serial sampling is determined by the formulas:

With the repeated selection method -


where? is the interseries variance of the sample population; r– number of selected series;

With a non-repetitive selection method -


Where R– number of series in the population.

In practice, certain methods and types of samples are used depending on the purpose and objectives of sample surveys, as well as the possibilities of their organization and conduct. Most often, a combination of selection methods and types of sampling is used. Such samples are called combined. Combination is possible in different combinations: mechanical and serial sampling, typical and mechanical, serial and actually random, etc. Combined sampling is used to ensure the greatest representativeness with the least labor and monetary costs for organizing and conducting the survey.

With a combined sample, the standard error of the sample consists of errors at each stage and can be determined as the square root of the sum of squared errors of the corresponding samples. So, if during a combined sample mechanical and typical samples were used in combination, then the standard error can be determined by the formula


where?1 and? 2 are the standard errors of the mechanical and typical samples, respectively.

Peculiarity multi-stage extraction consists in the fact that the sample population is formed gradually, according to the stages of selection. At the first stage, first stage units are selected using a predetermined method and type of selection. At the second stage, from each unit of the first stage included in the sample, units of the second stage are selected, etc. The number of stages can be more than two. At the last stage, a sample population is formed, units of which are subject to survey. So, for example, for a sample survey of household budgets, at the first stage, territorial subjects of the country are selected, at the second - districts in selected regions, at the third - enterprises or organizations are selected in each municipality, and, finally, at the fourth stage - families are selected in selected enterprises .

Thus, the sample population is formed at the last stage. Multistage sampling is more flexible than other types, although it generally produces less accurate results than a single-stage sample of the same size. However, it has one important advantage, which is that the sampling frame for multi-stage selection needs to be built at each stage only for those units that were included in the sample, and this is very important, since often there is no ready-made sampling frame.

The standard sampling error in multi-stage sampling for groups of different sizes is determined by the formula


where?1, ?2, ?3 , ... – standard errors at different stages;

n1, n2, n3 , .. . – the number of samples at the corresponding selection stages.

If the groups are unequal in volume, then theoretically this formula cannot be used. But if the total proportion of selection at all stages is constant, then in practice the calculation using this formula will not lead to a distortion of the error value.

Essence multiphase sampling consists in the fact that on the basis of the initially formed sample population a subsample is formed, from this subsample the next subsample is formed, etc. The initial sample population represents the first phase, a subsample from it represents the second, etc. It is advisable to use multiphase sampling in cases where If:

different sample sizes are required to study different traits;

the variability of the studied characteristics is not the same and the required accuracy is different;

less detailed information must be collected for all units in the initial sample frame (first phase), and more detailed information must be collected for units in each subsequent phase.

One of the undoubted advantages of multiphase sampling is the fact that information obtained in the first phase can be used as additional information in subsequent phases, information in the second phase can be used as additional information in subsequent phases, etc. This use of information increases the accuracy of the results of the sample survey .

When organizing multiphase sampling, you can use a combination of different methods and types of selection (typical sampling with mechanical sampling, etc.). Multiphase selection can be combined with multistage selection. At each stage, sampling can be multiphase.

The standard error in multiphase sampling is calculated for each phase separately in accordance with the formulas of the selection method and type of sampling with which its sample population was formed.

Interpenetrating excavations- two or more independent samples from the same population, collected in the same way and type. It is advisable to resort to interpenetrating samples if it is necessary to obtain preliminary results of sample surveys in a short period of time. Cross-sampling is effective for assessing survey results. If the results are the same in independent samples, this indicates the reliability of the sample survey data. Cross-sampling can sometimes be used to test the work of different researchers by having each of them survey different samples.

The standard error for interpenetrating samples is determined by the same formula as the typical proportional sample (5.3). Interpenetrating samples, compared to other types, require more labor and money, so the researcher must take this into account when designing a sample survey.

The maximum errors for various selection methods and types of sampling are determined by the formula? = t?, where? is the corresponding standard error.

It often happens that it is necessary to analyze a specific social phenomenon and obtain information about it. Such tasks often arise in statistics and statistical research. It is often impossible to verify a fully defined social phenomenon. For example, how to find out the opinion of the population or all residents of a certain city on any issue? Asking absolutely everyone is almost impossible and very time-consuming. In such cases, we need sampling. This is precisely the concept on which almost all studies and analyzes are based.

What is sampling

When analyzing a specific social phenomenon, it is necessary to obtain information about it. If you take any research, you will notice that not every unit of the totality of the object of study is subject to research and analysis. Only a certain part of this entire totality is taken into account. This process is sampling: when only certain units from a set are examined.

Of course, a lot depends on the type of sample. But there are also basic rules. The main one is that selection from the population must be absolutely random. The population units to be used should not be selected because of any criterion. Roughly speaking, if it is necessary to recruit a population from the population of a certain city and select only men, then there will be an error in the study, because the selection was not carried out randomly, but was selected on the basis of gender. Almost all sampling methods are based on this rule.

Sampling rules

In order for the selected set to reflect the main qualities of the entire phenomenon, it must be built according to specific laws, where the main attention must be paid to the following categories:

  • sample (sample population);
  • population;
  • representativeness;
  • representativeness error;
  • aggregate unit;
  • sampling methods.

Features of selective observation and sampling are as follows:

  1. All results obtained are based on mathematical laws and rules, that is, if the research is carried out correctly and with correct calculations, the results will not be distorted on subjective grounds
  2. It makes it possible to obtain results much faster and with less time and resources by studying not the entire array of events, but only part of them.
  3. It can be used to study various objects: from specific issues, for example, age, gender of the group we are interested in, to the study of public opinion or the level of material security of the population.

Selective observation

Sampling is a statistical observation in which not the entire population of what is being studied is subject to research, but only a certain part of it, selected in a certain way, and the results obtained from studying this part are distributed to the entire population. This part is called the sample population. This is the only way to study a large array of research objects.

But sample observation can only be used in cases where it is necessary to study only a small group of units. For example, in a study of the ratio of men to women in the world, sample observation will be used. For obvious reasons, it is impossible to take into account every inhabitant of our planet.

But with the same study, but not of all the inhabitants of the earth, but of a certain 2 “A” class in a specific school, a certain city, a certain country, it can do without selective observation. After all, it is quite possible to analyze the entire array of the research object. It is necessary to count the boys and girls of this class - this will be the ratio.

Sample and population

In fact, everything is not as difficult as it sounds. In any object of study there are two systems: the general population and the sample population. What is it? All units belong to the general one. And to the sample - those units of the general population that were taken for the sample. If everything is done correctly, then the selected part will constitute a reduced model of the entire (general) population.

If we talk about the general population, then we can distinguish only two types of it: a definite and indefinite general population. Depends on whether the total number of units of a given system is known or not. If it is a specific population, then sampling will be easier because you know what percentage of the total number of units will be sampled.

This point is very necessary in research. For example, if it is necessary to investigate the percentage of low-quality confectionery products at a specific plant. Let us assume that the population has already been determined. It is known for sure that this enterprise produces 1000 confectionery products per year. If you take a sample of 100 random confectionery products from this thousand and send them for examination, then the error will be minimal. Roughly speaking, 10% of all products were subject to research, and based on the results, we can, taking into account the representativeness error, talk about the poor quality of all products.

And if you sample 100 confectionery products from an uncertain population, where in reality there were, say, 1 million units, then the result of the sample and the study itself will be critically implausible and inaccurate. Do you feel the difference? Therefore, the certainty of the population in most cases is extremely important and greatly influences the result of the study.

Representativeness of the population

So now one of the most important questions is what should the sample be? This is the most important point of the study. At this stage, it is necessary to calculate the sample and select units from the total number into it. A population has been correctly selected if certain features and characteristics of the population remain in the sample. This is called representativeness.

In other words, if after selection a part retains the same tendencies and characteristics as the entire number of the sample, then such a population is called representative. But not every specific sample can be selected from a representative population. There are also research objects whose sample simply cannot be representative. This is where the concept of representativeness bias arises. But let's talk about this in more detail a little later.

How to make a sample

So, in order to maximize representativeness, there are three basic sampling rules:


Error (error) of representativeness

The main characteristic of the quality of the selected sample is the concept of “representative error”. What is it? These are certain discrepancies between the indicators of sample and continuous observation. Based on error indicators, representativeness is divided into reliable, ordinary and approximate. In other words, deviations of up to 3%, from 3 to 10% and from 10 to 20%, respectively, are acceptable. Although in statistics it is desirable that the error does not exceed 5-6%. Otherwise, there is reason to talk about insufficient representativeness of the sample. To calculate representativeness bias and how it affects a sample or population, many factors are taken into account:

  1. The probability with which an accurate result must be obtained.
  2. Number of units in the sample population. As mentioned earlier, the fewer units the sample contains, the greater the representativeness error will be, and vice versa.
  3. Homogeneity of the study population. The more heterogeneous a population is, the greater the representativeness bias will be. The ability of a population to be representative depends on the homogeneity of all its constituent units.
  4. The method of selecting units in the sample population.

In specific studies, the percentage of error in the mean value is usually set by the researcher himself, based on the observation program and according to data from previously conducted studies. As a rule, a maximum sampling error (error of representativeness) of 3-5% is considered acceptable.

More is not always better

It is also worth remembering that the main thing when organizing sample observation is to bring its volume to an acceptable minimum. At the same time, one should not strive to excessively reduce the margins of sampling error, as this can lead to an unjustified increase in the volume of sample data and, consequently, to increased costs for conducting sample observations.

At the same time, the size of the representativeness error cannot be excessively increased. Indeed, in this case, although there will be a decrease in the size of the sample population, this will lead to a deterioration in the reliability of the results obtained.

What questions are usually posed to the researcher?

If any research is carried out, it is for some purpose and to obtain some results. When conducting a sample study, the initial questions typically asked are:


Methods for selecting research units in the sample

Not every sample is representative. Sometimes the same characteristic is expressed differently in the whole and in its part. To achieve the requirements of representativeness, it is advisable to use various sampling techniques. Moreover, the use of one method or another depends on specific circumstances. Among these sampling techniques are:

  • random selection;
  • mechanical selection;
  • typical selection;
  • serial (cluster) selection.

Random selection is a system of measures aimed at randomly selecting units in the population, when the probability of being included in the sample is equal for all units in the population. This technique is advisable to use only in the case of homogeneity and a small number of inherent characteristics. Otherwise, some characteristics risk not being reflected in the sample. The characteristics of random selection underlie all other methods of sampling.

With mechanical selection of units is carried out at a certain interval. If it is necessary to form a sample of specific crimes, you can remove every 5th, 10th or 15th card from all statistical cards of registered crimes, depending on their total number and available sample sizes. The disadvantage of this method is that before selection it is necessary to have a complete record of population units, then ranking must be carried out and only after that sampling can be carried out at a certain interval. This method takes a long time, which is why it is not often used.

Typical (zoned) selection is a type of sampling in which the general population is divided into homogeneous groups according to a certain characteristic. Sometimes researchers use other terms instead of “groups”: “districts” and “zones”. Then, from each group, a certain number of units are randomly selected in proportion to the group’s share in the total population. Typical selection is often carried out in several stages.

Serial sampling is a method in which the selection of units is carried out in groups (series) and all units of the selected group (series) are subject to examination. The advantage of this method is that sometimes it is more difficult to select individual units than series, for example, when studying an individual who is serving a sentence. Within selected areas and zones, a study of all units without exception is used, for example, a study of all persons serving sentences in a particular institution.