## What is Statistics?

Previously in our Introduction to Probability article, we saw that probability is the study of some model and figuring its behavior. Can we work our way backwards? Given some behavior (or dataset), how can we characterize the model in which it came from?

### Inferential vs. Description Statistics

Statistics aims to help figure out the original model in which some sample observational data came from. This method of characterizing or making decisions about a population by using information from a sample drawn from that population is known as Inferential Statistics.

Descriptive Statistics, on the other hand, aims to simply describe the data we have at hand. This is used when the case you are studying represents the entire population, and you do not need to generalize beyond these cases. Take, for example, the midterm results for a class. The teacher and students would be interested in the average, median, and range of test scores, and not necessarily need to extrapolate the results to characterize some larger population.

### Statistic vs. Parameter

The definition of a statistic is a number or metric that describes a sample, such as \(\bar{x}\), which denotes the mean of a sample set. A parameter, on the other hand, is a metric that describes a population, such as \( \mu \), which describes the population mean.

### Population vs. Samples

The observation data we have to work with is known as our sample. We call it a sample because it's a sampling of some population that we're trying to characterize.

A brief example of this would be the US Census. Instead of surveying every single US citizen regarding how many children they have, they'll send out a survey mail for only about 1% of the US population, and extrapolate their findings to figure the average number of children across the country.

## Sampling Methods

Sampling is the process of selecting, manipulating, and analyzing a representative subset of data points to identify patterns and trends in the larger dataset. There are a variety of sampling methods - be sure to understand the dataset well enough to select teh correct one!

- Simple Random Sampling: Randomly picking samples from the whole population.
- Stratified Sampling: Subsets of datasets or population are created based on some common factor, then a sampling is taken from each group..
- Cluster Sampling: Larger dataset is divided into subsets based on some factor, then a random sampling of clusters is analyzed.
- Multistage Sampling: Divides the larger population into a number of clusters, then this process is repeated per cluster, until a sampling is taken.
- Systematic Sampling: Sample is created by setting an interval at which to extract data from the larger population (eg. every tenth row from a dataset).

## Measures of Central Tendencies

Let's now go over the most common metrics used in statistics to measure central tendencies.

### Arithmetic Mean

The average is simply the sum of the total values divided by the number of values.

$$ \mu = \frac{1}{n} \sum_{i=1}^{n} x_i $$

The mean of a population is denoted as \( \mu \), while the sample mean is denoted by \( \bar{x} \).

If this is your first time seeing some of these Greek symbols, don't worry! The \( \Sigma \) symbol simply means add all the elements \( x_i \) from \( i=1 \) up to \( n \).

### Geometric Mean

Similar to arithmetic mean, but instead of adding, you multiply, and instead of dividing by the total, you take that nth root of the total product value. The geometric mean can be useful for a set of numbers whose values are meant to be multiplied or are exponential in nature.

$$ \sqrt[n]{\prod_{i=1}^{n} x_i} $$

The \( \prod \) symbol, is the capital greek letter Pi. It means to multiply all elements together instead of adding like in \( \sum \).

### Median

The arithmetic mean is not always a good representation of your sample. Take for example the set \( {1,2,3,4,5,2000} \). The average of these numbers is \(335.83\), which is far above 80% of the values in your set. Because of this, some datasets prefer using the median for central tendencies.

The median is the middle value of your ranked list in ascending or descending order. If there is an even number of variables, take the average of the middle two. The median is a better measure of central tendency when you have extreme values in your dataset, as in the case above.

### Mode

The mode is simply the most commonly occurring value in your dataset. The mode is commonly used in ordinal or categorical data, where vote matters. For example, if you took a survey of favorite flavors of ice cream, where 1 = chocolate chip, 2 = cookies and cream, 3 = vanilla, the mode of your dataset would give you your most popular flavor.

## Measures of Dispersion

Now that we know the common metrics for central tendencies, let's talk about disperson. Dispersion refers to how "spread out" your data values are. Measurements of this field is important because depending on how tight or spread out your data is, the implications can be different.

For example, a population of people with BMI 20-25 versus another population of people with BMI 15-35. Although both populations may have the same mean, median and mode for BMI, because the disperson is so much greater in the second population, their societal and nutritional needs are different.

### Range

The range of a dataset is simply the maximum value minus the minimum. Although simple to calculate, the range has its downsides, as it's susceptible to outliers. Additionally, it does not say anything about the values in your dataset.

Range can be useful in measuring the differences of highs and lows within some dataset, like a classroom's test scores.

### Interquartile Range

The interquartile range takes a look at the range of the middle 50% of your dataset. In other words, it takes the 75% percentile value minus the 25% percentile value.

$$ {1,2,2,3,4,5,6,7,7} $$

To obtain the interquartile range:

- Rank your dataset, from smallest to largest. Our \(n=9\). Typically in data science, you'll see that \( n \) signifies the number of elements in your datset.
- Calculate the rank of the 25th percentile number. \( \frac{nk}{100}; k=25; 2.25 \)
- If the value is not an integer, take the largest number smaller than the value. Thus, \( 25% \text{rank} = 2 \)
- Calculate the 75th percentile. \( \frac{nk}{100}; k=75; 6.75 \)
- If the value is not an integer, take the largest number smaller than the value. \( 75% \text{rank} = 6 \)
- Take the difference between the two values. \( 5-2 = 3 \)

### Variance

The variance of a population is denoted as \( \sigma^2 \) or \( Var(x) \).

$$ \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 $$

Notice that the square of the difference between the mean and sample is squared. If these values weren't squared, the sum will approximately be 0, since then we'd just be adding values higher and lower than the mean. The squaring of the values allows us to capture how much dispersion there is from the mean.

### Coefficient of Variance

Consider you have two datasets representing the thickness of a collection of books: one in centimeters, the other in millimeters.

$$ \{ 2.3\text{cm}, 4.3\text{cm}, 4.3\text{cm}, 5.3\text{cm}, 6.3\text{cm} \} $$

$$ s = 1.33 $$

$$ \{ \text{mm}, \text{mm}, 43\text{mm}, 53 \text{mm}, 63 \text{mm} \} $$

$$ s = 13.3 $$

The two datasets are technically the same, but you'll notice their standard deviations are different. The first set in centimeters has \( s = 1.33 \) while the second set has \( s = 13.3 \). Is there a way we can measure dispersion and compare without having to consider units?

To do so, we can use Coefficient of Variant (CV), which is as follows:

$$ CV = \frac{s}{\bar{x}} \times 100 $$

By dividing by the mean of the dataset, CV is unitless, allowing it to be used as a dispersion metric that can be compared among other datasets.

#### Sample Variance and Standard Deviation

The sample variance is \( s^2 \) while the standard deviation is \( s \).

$$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $$

Notice that we're dividing by \( n-1 \) for the sample variance, which is different from the \( n \) we divided by when calculating sample variance. The reason why is as follows. Typically, when calculating the variance of a sample, you'd expect to see more variation since it's only of a sample, and not of the entire population. Having said this, the greater the \( n \) is, the more negligible the \( -1 \) becomes.

Now that we have the metrics to describe samples and populations, let's look at distributions.

## Normal Distribution

In statistics, you'll come across a variety of graphs displaying distribution. Some of these distributions are used to describe the population, others are used to show model behavior. Let's look at statistics' most common distribution, the Normal Distribution.

The term normal distribution (also known as Gaussian Distribution in honor of the eighteenth-century physicist and mathematician Karl Guass) is one of the most commonly used distributions in statistics. The distribution has a bell-shaped curve, and can take on many different forms, depending on the mean \(\mu\) and standard deviation \(\sigma\).

The figure below shows the spread of pizza delivery times across three major cities. The distributions all have a mean of 0, but their standard deviations vary.

The reason why the Normal Distribution is so common is that every-day datasets follow this pattern.

The characteristics of a normal distribution include:

- Symmetry
- Unimodality
- Continuous range from \(-\infty\) and \(+\infty\).
- Total area under the curve of 1.

Because of these characteristics, we can assume:

- 68% of the data will fall within one standard deviation (\(\sigma\)) of the mean (\(\mu\)).
- 95% of the data will fall within two standard deviations (\(\sigma\)) of the mean (\(\mu\)).
- 99% of the data will fall within three standard deviations (\(\sigma\)) of the mean (\(\mu\)).

### Standard Normal Distribution aka Z Distribution

A Standard Normal Distribution is a normal distribution with \(\mu=0\) and \(\sigma=1\).

You can convert any normal distribution by converting all values to z-scores:

$$ Z = \frac{x-\mu}{\sigma} $$

## Central Limit Theorem

The Central Limit Theorem states that the sampling distribution of the sample mean approximates the normal distribution, regardless of the distribution of the population from which the samples are drawn from, as long as the sampling is sufficiently large. The Central Limit Theorem can be summed up with:

$$ \bar{X} \dot\sim N(\mu, \frac{\sigma^2}{n}) $$

The \(\dot\sim\) reads as "approximately distributed." This statement reads, "the mean of \(X\) is approximately normally distributed with mean \(\mu\) and variance \(\sigma^2/n\)."

This theorem states that if we repeatedly drew samples of a given size, calculated the mean of each sample, and plotted the distribution of those means, the result will take a normal distribution, no matter the distribution in which the samples originated.

## Hypothesis Testing

Hypothesis Testing is the formal method of testing whether some observations are statistically significant.

The null hypothesis, denoted as \( H_0 \), is the hypothesis statement in which no significant differences between two observations are observed. The alternative hypothesis, denoted as \(H_A\) on the other hand, is the hypothesis that is contrary to the null hypothesis.

With the null and alternative hypotheses defined, here are the steps to testing for statistical significance.

- Come up with a hypothesis that can be statistically tested.
- Formally state the null and alternative hypothesis.
- Decide on an appropriate statistical test that you can model after.
- Perform statistical calculations and obtain a p-value.
- Either reject the null hypothesis (finding significance), or fail to reject the null hypothesis.

The result of statistical test is a p-value, which is defined as the probability of obtaining results as extreme as those observed than due to sheer chance. A low p-value indicates that there is likely a difference in the observed groups, as any randomly selected data wouldn't show much difference.

Most researchers use an arbitrary p-value cutoff (denoted as \( \alpha \)) of 0.05. Anything below this threshold is deemed "significant."

### P-Values

Of course, just because your statistical test doesn't report a p-value of less than 0.05 doesn't necessarily mean it's not significant. In this case, the statistical report would simply state there's not enough evidence to reject the null hypothesis *based on your observed data*.

If there is a significant difference, but it was missed by your observational data and test, a Type II error, false negative, or \( \beta error \) has occurrred.

\( H_0 \) true | \( H_A \) true | |
---|---|---|

Fail to reject \(H_0\) | True Positive | False NegativeType II error |

Reject \( H_0 \) | False PositiveType I | True Negative |