Statistics

Measures of Central Tendency

Back to curriculum

Calculate the mean, median, and mode of the following data set: 4, 8, 6, 5, 3, 2, 8, 9, 2, 5.

Calculate the mean, median, and mode of the following data set: 12, 15, 22, 9, 7, 12, 19, 20, 12.

Calculate the mean, median, and mode of the following data set:

$$3, 7, 9, 12, 15, 18, 20, 20, 21$$Calculate the mean, median, and mode of the following data set:

$$11, 15, 17, 18, 18, 21, 23, 24, 25, 29$$Calculate the weighted mean of the following grades and their respective weights:

Grades: $$70, 80, 90, 100$$ Weights: $$0.1, 0.3, 0.4, 0.2$$Calculate the mean, median, and mode of the following data set:

$$8, 12, 15, 16, 19, 20, 22, 24, 27$$Calculate the mean, median, and mode of the following data set:

$$1, 3, 3, 6, 8, 9, 9, 9, 12$$Calculate the mean, median, and mode of the following data set:

$$30, 31, 32, 34, 36, 39, 42, 44, 45$$Given the following ages of a group of people: $$25, 30, 30, 32, 35, 36, 38, 40, 42, 45$$, calculate the mean and median age.

Calculate the mean, median, and mode of the following data set:

$$5, 10, 10, 15, 20, 25, 30, 30, 35, 40, 40$$Calculate the mean of the following data set, given that the sum of the numbers is $$270$$ and there are $$10$$ numbers:

Calculate the mean, median, and mode of the following data set:

$$19, 23, 24, 24, 25, 28, 29, 30, 30, 32, 33, 34$$Calculate the mean, median, and mode for the following grouped data:

\begin{array}{|c|c|} \hline Interval & Frequency \\ \hline 10-19 & 5 \\ 20-29 & 12 \\ 30-39 & 8 \\ 40-49 & 3 \\ \hline \end{array}Calculate the mean, median, and mode for the following grouped data:

\begin{array}{|c|c|} \hline Interval & Frequency \\ \hline 0-9 & 7 \\ 10-19 & 15 \\ 20-29 & 20 \\ 30-39 & 10 \\ 40-49 & 3 \\ \hline \end{array}Introduction

Definition of measures of central tendency

Importance of central tendency in data analysis

Mean, median, and mode

Mean (Arithmetic Mean)

Definition: $$\text{Mean} = \frac{\sum_{i=1}^n x_i}{n}$$

Calculation for ungrouped and grouped data

Properties of the mean

Median

Definition: The middle value when data is arranged in ascending or descending order

Calculation for ungrouped data (even and odd number of data points)

Calculation for grouped data

Mode

Definition: The value that occurs most frequently in the dataset

Calculation for ungrouped and grouped data

Properties of the mode

Skewness and Central Tendency Measures

Skewness: measure of the asymmetry of the probability distribution

Relation between skewness and mean, median, and mode

Interpretation of skewness

Exercises

Solutions

Exercises

Calculate the mean, median, and mode for the following data: $$4, 7, 9, 5, 7, 2, 8$$

Calculate the mean, median, and mode for the following data: $$12, 18, 20, 12, 25, 30, 12, 18$$ Calculate the mean, median, and mode for the following grouped data:

\begin{array}{|c|c|} \hline \text{Interval} & \text{Frequency} \\ \hline 10-19 & 5 \\ 20-29 & 12 \\ 30-39 & 8 \\ 40-49 & 3 \\ \hline \end{array}

Solutions

Mean: $$\text{Mean} = \frac{4 + 7 + 9 + 5 + 7 + 2 + 8}{7} = \frac{42}{7} = 6$$ Median: Arrange the data in ascending order: $$2, 4, 5, 7, 7, 8, 9$$. The middle value is 7.

Mode: The number 7 appears twice, more frequently than any other value, so the mode is 7.

Mean: $$\text{Mean} = \frac{12 + 18 + 20 + 12 + 25 + 30 + 12 + 18}{8} = \frac{147}{8} = 18.375$$ Median: Arrange the data in ascending order: $$12, 12, 12, 18, 18, 20, 25, 30$$. The middle values are 18 and 18, so the median is 18.

Mode: The numbers 12 and 18 both appear three times, more frequently than any other value, so there are two modes: 12 and 18.

Mean for grouped data: $$\text{Mean} \approx 25.75$$ Median for grouped data: $$\text{Median} \approx 27.5$$ Mode for grouped data: $$\text{Mode} \approx 26.36$$ Continuing the lesson on Measures of Central Tendency, let's delve deeper into skewness and its relationship to mean, median, and mode.

Skewness and Central Tendency Measures (continued)

Types of Skewness:

Symmetrical distribution: mean, median, and mode coincide

Positively skewed (right-skewed) distribution: mean > median > mode

Negatively skewed (left-skewed) distribution: mean < median < mode

Example 7: Given the following data: $$1, 2, 2, 2, 3, 4, 7, 9$$

Mean: $$\text{Mean} = \frac{1 + 2 + 2 + 2 + 3 + 4 + 7 + 9}{8} = \frac{30}{8} = 3.75$$ Median: Arrange the data in ascending order (already arranged). Since there are 8 data points (even), the median is the average of the middle values (2 and 3): $$\text{Median} = \frac{2 + 3}{2} = 2.5$$

Mode: The number 2 appears three times, more frequently than any other value, so the mode is 2.

Since the mean > median > mode, the data is positively skewed (right-skewed).

Exercises (continued)

Determine the skewness of the following data: $$5, 7, 10, 10, 12, 15, 20, 22$$ Determine the skewness of the following data: $$-4, -2, 0, 2, 4, 4, 4, 6$$ Solutions (continued)

Mean: $$\text{Mean} = \frac{5 + 7 + 10 + 10 + 12 + 15 + 20 + 22}{8} = \frac{101}{8} = 12.625$$ Median: Arrange the data in ascending order (already arranged). Since there are 8 data points (even), the median is the average of the middle values (10 and 10): $$\text{Median} = \frac{10 + 10}{2} = 10$$

Mode: The number 10 appears twice, more frequently than any other value, so the mode is 10.

Since the mean > median > mode, the data is positively skewed (right-skewed).

Mean: $$\text{Mean} = \frac{-4 + (-2) + 0 + 2 + 4 + 4 + 4 + 6}{8} = \frac{14}{8} = 1.75$$ Median: Arrange the data in ascending order (already arranged). Since there are 8 data points (even), the median is the average of the middle values (0 and 2): $$\text{Median} = \frac{0 + 2}{2} = 1$$

Mode: The number 4 appears three times, more frequently than any other value, so the mode is 4.

Since the mean < median < mode, the data is negatively skewed (left-skewed).

Measures of Dispertion

Back to curriculum

Calculate the range, variance, and standard deviation for the following dataset: $$\{10, 15, 20, 25, 30\}$$.

Given the data set $$\{5, 7, 9, 12, 15, 18\}$$, calculate the interquartile range (IQR).

Find the variance and standard deviation of the following dataset: $$\{2, 4, 6, 8, 10\}$$.

Calculate the mean deviation for the dataset: $$\{3, 5, 8, 10, 12\}$$.

The variance of a dataset is $$25$$. What is the standard deviation?

Calculate the range of the following dataset: $$\{1, 6, 9, 12, 15, 20\}$$.

Given the dataset $$\{1, 3, 3, 6, 8, 9\}$$, calculate the interquartile range (IQR).

A dataset has a standard deviation of $$3$$. What is the variance?

Calculate the mean deviation for the dataset: $$\{12, 15, 18, 21, 24\}$$.

Calculate the range, variance, and standard deviation for the following dataset: $$\{5, 8, 10, 12, 15\}$$.

Measures of dispersion are used to describe the spread or variability of data in a dataset. They help to understand how data is distributed around the mean (central tendency) of the dataset. In this lesson, we will discuss four common measures of dispersion: range, interquartile range (IQR), variance, and standard deviation.

1. Range

The range is the simplest measure of dispersion. It is calculated by finding the difference between the highest and lowest values in the dataset. Mathematically, the range can be expressed as:

Range = Maximum value - Minimum value

The range gives a quick idea of the spread of the data but does not provide information about how the data is distributed between the minimum and maximum values.

2. Interquartile Range (IQR)

The interquartile range (IQR) is a measure of dispersion that describes the range within which the central 50% of the data lies. It is calculated by finding the difference between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile).

IQR = Q3 - Q1

The IQR is less sensitive to extreme values (outliers) compared to the range and provides a better idea of the data spread around the central tendency.

3. Variance

Variance is a measure of dispersion that describes the average of the squared differences of each data point from the mean. It provides an idea of how much the data points deviate from the mean. The formula for the variance (denoted as $\sigma^2$ for population variance and $s^2$ for sample variance) is:

For population variance: $$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$

For sample variance: $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

where $$x_i$$ represents each data point, $$\mu$$ is the population mean, $$\bar{x}$$ is the sample mean, N is the population size, and n is the sample size. Note that the sample variance uses (n-1) in the denominator as a correction factor, known as Bessel's correction, to provide an unbiased estimate of the population variance.

4. Standard Deviation

The standard deviation is the most widely used measure of dispersion. It is the square root of the variance and is denoted by $$\sigma$$ for the population standard deviation and s for the sample standard deviation. The standard deviation has the same unit as the data points, making it easier to interpret than variance.

For population standard deviation: $$\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}$$

For sample standard deviation: $$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$$ The standard deviation provides a measure of the average deviation of each data point from the mean. A larger standard deviation indicates greater dispersion in the data.

In summary, measures of dispersion are essential for understanding the variability of data in a dataset. The range and IQR provide a quick idea of data spread, while variance and standard deviation provide more detailed information about data distribution around the mean.

Probability

Back to curriculum

Given the following bivariate data set, calculate the covariance.

X: 2, 4, 6, 8, 10

Y: 5, 8, 11, 14, 17

Calculate the correlation coefficient of the example in Exercise 1.

Given the following bivariate data set, calculate the covariance.

X: 1, 2, 3, 4, 5

Y: 10, 8, 6, 4, 2

Calculate the correlation coefficient for the bivariate data in Exercise 3.

The correlation coefficient between the heights and weights of a group of students is 0.8. Does this indicate a strong or weak relationship between the two variables?

For the given bivariate data set, calculate the least squares regression line:

X: 1, 2, 3, 4, 5

Y: 2, 4, 6, 8, 10

For the given bivariate data set, calculate the least squares regression line:

X: 1, 3, 5, 7, 9

Y: 2, 6, 10, 14, 18

Calculate the coefficient of determination for the bivariate data in Exercise 6.

The heights (in inches) and weights (in pounds) of five people are recorded as follows:

Heights: 62, 65, 68, 70, 72

Weights: 120, 130, 150, 160, 170

Calculate the correlation coefficient between heights and weights.

Calculate the least squares regression line for the bivariate data in Exercise 9.

\section{Lesson: Probability}
Probability is a measure of the likelihood of a particular event occurring. It is a fundamental concept in statistics and is used to study the randomness and uncertainty of events. This lesson will cover the basics of probability, including the definition, probability rules, and probability distributions.
\subsection{1. Definition of Probability}
Probability is a numerical value between 0 and 1 that represents the likelihood of a specific event happening. An event with a probability of 0 means that it is impossible, while an event with a probability of 1 means that it is certain. The probability of an event A can be denoted as P(A).
The probability of an event can be calculated by dividing the number of favorable outcomes by the total number of possible outcomes:
\[
P(A) = \frac{\text{number of favorable outcomes}}{\text{total number of possible outcomes}}
\]
\subsection{2. Probability Rules}
There are three basic rules of probability that help to analyze events:
\subsubsection{Rule 1: The probability of an event is always between 0 and 1.}
\[
0 \leq P(A) \leq 1
\]
\subsubsection{Rule 2: The probability of the entire sample space (i.e., the set of all possible outcomes) is equal to 1.}
\[
P(S) = 1
\]
where S represents the sample space.
\subsubsection{Rule 3: The probability of the complement of an event is equal to 1 minus the probability of the event.}
\[
P(A^c) = 1 - P(A)
\]
where $A^c$ represents the complement of event A (i.e., the event that A does not occur).
\subsection{3. Probability Distributions}
A probability distribution is a function that describes the probabilities of all possible outcomes in a sample space. There are two types of probability distributions: discrete and continuous.
\subsubsection{Discrete Probability Distributions}
Discrete probability distributions are used when the sample space consists of a finite or countably infinite number of discrete outcomes. The probability mass function (PMF) is used to describe the probabilities of the outcomes in a discrete probability distribution.
\[
P(X = x) = P(x)
\]
where X is a discrete random variable, and P(x) represents the probability that X takes on the value x.
\subsubsection{Continuous Probability Distributions}
Continuous probability distributions are used when the sample space consists of an uncountably infinite number of continuous outcomes. The probability density function (PDF) is used to describe the probabilities of the outcomes in a continuous probability distribution.
\[
f_X(x)
\]
where X is a continuous random variable, and $f_X(x)$ represents the probability density function of X.
For continuous probability distributions, the probability of a single value is always zero. To find the probability of an interval, we use the cumulative distribution function (CDF):
\[
P(a \leq X \leq b) = F_X(b) - F_X(a)
\]
where $F_X(x)$ is the cumulative distribution function of X.
\subsection{4. Conditional Probability}
Conditional probability is the probability of an event occurring given that another event has already occurred. It is denoted as $P(A | B)$, which represents the probability of event A occurring given that event B has occurred.
The formula for conditional probability is:
\[
P(A | B) = \frac{P(A \cap B)}{P(B)}
\]
where $P(A \cap B)$ is the probability of both events A and B occurring together, and $P(B)$ is the probability of event B occurring.
\subsection{5. Independent and Dependent Events}
Two events are said to be independent if the occurrence of one event does not affect the probability of the occurrence of the other event. If events A and B are independent, then:
\[
P(A \cap B) = P(A)P(B)
\]
If the above equation does not hold, events A and B are dependent.
\subsection{6. Bayes' Theorem}
Bayes' theorem is a fundamental concept in probability theory that relates the conditional probabilities of two events. It is used to update the probability of an event based on new evidence or information.
Bayes' theorem is given by:
\[
P(A | B) = \frac{P(B | A)P(A)}{P(B)}
\]
where $P(A | B)$ is the posterior probability of event A given that event B has occurred, $P(B | A)$ is the likelihood of event B given that event A has occurred, $P(A)$ is the prior probability of event A, and $P(B)$ is the marginal probability of event B.
\section{Summary}
In this lesson, we have covered the basic concepts of probability, including the definition, probability rules, probability distributions, conditional probability, independent and dependent events, and Bayes' theorem. Understanding these concepts is essential for the study of statistics and dealing with uncertainty and randomness in real-world problems.
\subsection{7. Permutations and Combinations}
Permutations and combinations are essential concepts in probability that deal with the arrangement of objects or the selection of objects from a set.
\subsubsection{Permutations}
A permutation is an arrangement of objects in a specific order. The number of permutations of n objects taken r at a time is denoted by $P(n, r)$ or $_nP_r$ and can be calculated using the formula:
\[
P(n, r) = \frac{n!}{(n-r)!}
\]
where n! (n factorial) is the product of all positive integers up to n.
\subsubsection{Combinations}
A combination is a selection of objects without considering the order. The number of combinations of n objects taken r at a time is denoted by $C(n, r)$ or $_nC_r$ and can be calculated using the formula:
\[
C(n, r) = \frac{n!}{r!(n-r)!}
\]
\subsection{8. The Law of Large Numbers}
The Law of Large Numbers is a fundamental theorem in probability theory that states that the average of a large number of independent and identically distributed random variables converges to the expected value as the number of variables goes to infinity. In simple terms, as the sample size increases, the sample mean approaches the population mean.
\subsection{9. The Central Limit Theorem}
The Central Limit Theorem is another crucial theorem in probability theory that states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the shape of the original distribution. This theorem is the foundation for many statistical methods, such as hypothesis testing and confidence intervals.
\section{Conclusion}
This lesson provided an overview of essential probability concepts, including probability rules, probability distributions, conditional probability, independent and dependent events, Bayes' theorem, permutations, combinations, the Law of Large Numbers, and the Central Limit Theorem. These concepts form the foundation of statistical analysis and are crucial for understanding and interpreting data in various fields, such as science, engineering, economics, and social sciences.

Descriptive Stat - Bivariate DataProbability

Back to curriculum

A random sample of 100 students' test scores has a mean of 78 and a standard deviation of 12. Estimate the population mean with a 95% confidence interval.

A random sample of 225 car owners is asked about their monthly fuel consumption. The sample has a mean of 110 liters and a standard deviation of 15 liters. Estimate the population mean with a 99% confidence interval.

In a survey, it is found that 175 out of 500 people prefer brand A over brand B. Estimate the population proportion with a 90% confidence interval.

A random sample of 36 light bulbs has a mean lifetime of 1200 hours and a standard deviation of 200 hours. Estimate the population mean with a 95% confidence interval.

A random sample of 49 apples has a mean weight of 150 grams and a standard deviation of 20 grams. Estimate the population mean with a 99% confidence interval.

In a survey of 800 people, 520 say they prefer coffee over tea. Estimate the population proportion with a 90% confidence interval.

A random sample of 81 oranges has a mean diameter of 7.5 cm and a standard deviation of 1.5 cm. Estimate the population mean with a 95% confidence interval.

A random sample of 121 smartphones has a mean battery life of 16 hours and a standard deviation of 4 hours. Estimate the population mean with a 99% confidence interval.

In a survey of 1000 people, 250 say they prefer online shopping over in-store shopping. Estimate the population proportion with a 95% confidence interval.

A random sample of 64 laptops has a mean price of $950 and a standard deviation of $150. Estimate the population mean with a 95% confidence interval.

\section{Lesson: Descriptive Statistics with Bivariate Data}
In this lesson, we will explore descriptive statistics for bivariate data. Bivariate data consists of pairs of observations (x, y), where each pair represents values for two related variables. Descriptive statistics for bivariate data help us understand the relationship between these two variables.
\subsection{1. Scatter Plots}
A scatter plot is a graphical representation of bivariate data that helps visualize the relationship between two variables. Each point on the scatter plot represents a pair of observations (x, y).
\subsection{2. Covariance}
Covariance is a measure of the joint variability of two variables. It indicates the direction of the linear relationship between the variables. The covariance between variables X and Y can be calculated using the formula:
\[
cov(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}
\]
where $x_i$ and $y_i$ are individual observations, $\bar{x}$ and $\bar{y}$ are the respective means of variables X and Y, and n is the number of observations.
\subsection{3. Correlation Coefficient}
The correlation coefficient, also known as Pearson's correlation coefficient, is a measure of the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a strong negative relationship, 1 indicates a strong positive relationship, and 0 indicates no relationship. The correlation coefficient between variables X and Y can be calculated using the formula:
\[
r_{XY} = \frac{cov(X, Y)}{s_X s_Y}
\]
where $r_{XY}$ is the correlation coefficient between variables X and Y, $cov(X, Y)$ is the covariance between X and Y, and $s_X$ and $s_Y$ are the standard deviations of variables X and Y, respectively.
\subsection{4. Least-Squares Regression Line}
The least-squares regression line, also known as the line of best fit, is a straight line that best represents the relationship between two variables in a scatter plot. It minimizes the sum of the squared differences between the observed values and the values predicted by the line.
The equation of the least-squares regression line is:
\[
\hat{y} = a + b(x)
\]
where $\hat{y}$ is the predicted value of the dependent variable Y for a given value of the independent variable x, a is the y-intercept, and b is the slope of the line.
The slope b and the y-intercept a can be calculated using the formulas:
\[
b = \frac{cov(X, Y)}{s_X^2}
\]
and
\[
a = \bar{y} - b \bar{x}
\]
\section{Summary}
In this lesson, we have covered the basics of descriptive statistics with bivariate data, including scatter plots, covariance, correlation coefficient, and the least-squares regression line. These concepts help us understand the relationship between two variables and make predictions based on the observed data.

Estimation

Back to curriculum

A researcher claims that the average time high school students spend on homework is more than 2 hours per day. You take a sample of 36 students and find a mean of 2.5 hours with a standard deviation of 0.8 hours. Test this claim at a 0.05 significance level.

A car manufacturer claims that their new electric car has a mean range of 300 miles on a full charge. You take a random sample of 25 cars and find a mean range of 290 miles with a standard deviation of 20 miles. Test this claim at a 0.01 significance level.

A researcher claims that the average height of male students in a high school is different from 5'8" (68 inches). You take a random sample of 100 male students and find a mean height of 5'7" (67 inches) with a standard deviation of 3 inches. Test this claim at a 0.05 significance level.

A factory claims that the mean weight of their chocolate bars is at least 3.5 ounces. A sample of 16 chocolate bars is taken, and the mean weight is found to be 3.2 ounces with a standard deviation of 0.4 ounces. Test this claim at a 0.1 significance level.

A school district claims that the proportion of students who pass the standardized math exam is 0.8. In a sample of 200 students, 150 pass the exam. Test this claim at a 0.05 significance level.

A company claims that at least 90% of their customers are satisfied with their products. In a sample of 100 customers, 85 report being satisfied. Test this claim at a 0.1 significance level.

A company claims that their new lightbulbs last an average of 10,000 hours. A sample of 9 lightbulbs is tested, and the mean lifetime is found to be 9,600 hours with a standard deviation of 800 hours. Test this claim at a 0.05 significance level.

A restaurant claims that their average waiting time for a table is less than 15 minutes. A random sample of 20 waiting times is taken, and the mean waiting time is found to be 13 minutes with a standard deviation of 4 minutes. Test this claim at a 0.1 significance level.

A manufacturer claims that their batteries have a mean lifetime of at least 50 hours. A sample of 15 batteries is tested, and the mean lifetime is found to be 48 hours with a standard deviation of 6 hours. Test this claim at a 0.05 significance level.

A school district claims that less than 10% of their students drop out before graduating. In a sample of 200 students, 25 drop out before graduating. Test this claim at a 0.05 significance level.

\section{Lesson: Estimation}
In this lesson, we will explore the concept of estimation in statistics. Estimation is the process of approximating a population parameter using sample data. The two main types of estimation are point estimation and interval estimation.
\subsection{1. Point Estimation}
Point estimation is the process of using a single value, calculated from the sample data, to estimate an unknown population parameter. The most common point estimators are the sample mean, sample proportion, and sample variance.
\subsubsection{Sample Mean}
The sample mean $\bar{x}$ is used to estimate the population mean $\mu$. It is calculated as:
\[
\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
\]
where $x_i$ are the individual observations and n is the number of observations.
\subsubsection{Sample Proportion}
The sample proportion $\hat{p}$ is used to estimate the population proportion p. It is calculated as:
\[
\hat{p} = \frac{x}{n}
\]
where x is the number of successes in the sample and n is the number of observations.
\subsubsection{Sample Variance}
The sample variance $s^2$ is used to estimate the population variance $\sigma^2$. It is calculated as:
\[
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}
\]
\subsection{2. Interval Estimation}
Interval estimation is the process of using a range of values, calculated from the sample data, to estimate an unknown population parameter. The most common type of interval estimation is the confidence interval.
\subsubsection{Confidence Interval}
A confidence interval is a range of values within which the population parameter is likely to fall, with a certain level of confidence. The confidence level is usually expressed as a percentage (e.g., 95\% confidence level).
For the population mean, the confidence interval can be calculated as:
\[
\bar{x} \pm z \frac{\sigma}{\sqrt{n}}
\]
where $\bar{x}$ is the sample mean, z is the z-score corresponding to the desired level of confidence, $\sigma$ is the population standard deviation, and n is the number of observations. If the population standard deviation is unknown, the sample standard deviation s can be used instead, and a t-score from the t-distribution is used in place of the z-score.
For the population proportion, the confidence interval can be calculated as:
\[
\hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
\]
where $\hat{p}$ is the sample proportion, z is the z-score corresponding to the desired level of confidence, and n is the number of observations.
\section{Summary}
In this lesson, we have covered the basics of estimation in statistics, including point estimation, interval estimation, and confidence intervals. Estimation is a fundamental concept in statistics, as it allows us to use sample data to make inferences about unknown population parameters.

Hypothesis Testing

Back to curriculum

\section{Lesson: Hypothesis Testing}
In this lesson, we will explore the concept of hypothesis testing in statistics. Hypothesis testing is a systematic procedure for deciding whether the results of a study support a particular theory or practical innovation.
\subsection{1. Null and Alternative Hypotheses}
In hypothesis testing, we start by stating two opposing hypotheses:
$$\textbf{Null Hypothesis \((H_0)\):}$$ This hypothesis states that there is no significant difference or effect. It is the hypothesis we aim to test.\\
$$\textbf{Alternative Hypothesis \((H_a or H_1)\):}$$ This hypothesis states the opposite of the null hypothesis and represents the effect or difference we expect to observe.
\subsection{2. Test Statistic}
A test statistic is a numerical value calculated from the sample data that is used to determine if the null hypothesis should be rejected. The choice of the test statistic depends on the type of data and the assumed probability distribution.
\subsubsection{Z-test}
A Z-test is used when the population variance is known, and the sample size is large (usually n > 30). The Z-test statistic is calculated as:
\[
Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}
\]
where $$\bar{x}$$ is the sample mean, $$\mu_0$$ is the hypothesized population mean, $$\sigma$$ is the population standard deviation, and n is the number of observations.
\subsubsection{T-test}
A T-test is used when the population variance is unknown, and the sample size is small (usually n $$\leq$$ 30). The T-test statistic is calculated as:
\[
t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}
\]
where $$\bar{x}$$ is the sample mean, $$\mu_0$$ is the hypothesized population mean, s is the sample standard deviation, and n is the number of observations.
\subsection{3. P-value and Significance Level}
The P-value is the probability of obtaining a test statistic as extreme or more extreme than the one calculated from the sample data, assuming the null hypothesis is true. The P-value is compared to a pre-determined significance level ($$\alpha$$) to decide whether to reject or fail to reject the null hypothesis.
$$\textbf{If the P-value \(\leq \alpha\):}$$ This hypothesis states that there is no significant difference or effect. It is the hypothesis we aim to test.\\
$$\textbf{If the P-value > \(\alpha\):}$$ This hypothesis states the opposite of the null hypothesis and represents the effect or difference we expect to observe.
\subsection{4. Types of Errors}
In hypothesis testing, there are two types of errors:
$$\textbf{Type I Error:}$$ This hypothesis states that there is no significant difference or effect. It is the hypothesis we aim to test.\\
$$\textbf{Type II Error:}$$ This hypothesis states the opposite of the null hypothesis and represents the effect or difference we expect to observe.
\section{Summary}
In this lesson, we have covered the basics of hypothesis testing, including null and alternative hypotheses, test statistics, P-values, significance levels, and types of errors. Hypothesis testing is a fundamental concept in statistics, as it allows us to make inferences and decisions based on sample data.