Link: Supervised learning

The main idea of linear regression

  1. Use Least squares to Fitting a line to data in linear regression
  2. Calculated R-squared
  3. Calcualte a P-value for

Terminology

SS(mean) sum of squares around the mean

Var(mean) variation around the mean

In this way, var() can be viewed as the average SS.

SS(fit) sum of squares of around the least-squares fit

Var(fit) variation around the line

How does it work

  1. Meausre

  2. Measure

  3. Plug in and we get

    Note that when is 0, then = 1

  4. Use p-value to determine if is statistically significant When there’re only two data points for the line, = 0. So we need a p-value to identify things like this. P-value for comes from , the F-value.

F-value

The larger F-value is, it indicates y can explain more variation in x.

is also known as degree of freedom.

Steps to covert F-value to p-value:

  1. Generate a set of random data

  2. Calculate the mean and ss(mean)

  3. Plug-in and get value, plot the value in histogram

  4. Repeat many times for random datasets

  5. Do the same for the original dataset, and get . p-value = the number of more extreme values than divided by all F values

F-distributions

In above example, the sample of red is smaller than blue while other parameters are the same. The blue line is steeper, and thus, when we calculate the probability of the extreme values for getting p-value, the p-value would be smaller. Therefore, more sample size leads to a smaller p-value.