What is regression
Regression is a way to predict a number. You take one or more input measurements and use them to estimate another number. In plain terms, regression answers questions like: how much will sales be given ad spend, or what price will a house sell for based on size and location.
It is not about proving cause. It is about finding a pattern that lets you make predictions.
Basic idea
You have:
- A dependent variable. This is the thing you want to predict. Call it Y.
- One or more independent variables. These are the predictors. Call them X.
A simple regression equation looks like: y = a + b x + e
- a is the intercept, the baseline value.
- b is the slope, how much y changes when x goes up by one.
- e is the error term, the part you cannot predict.
In multiple regression you have: y = a + b1 x1 + b2 x2 + ... + e
Common types of regression
- Simple linear regression. One x, one y, linear relation.
- Multiple linear regression. Many x variables, outcome is still a number.
- Logistic regression. Used when the outcome is binary, like default or no default. The model predicts a probability.
- Regularized regression. Ridge and Lasso add penalties to avoid overfitting.
- Nonlinear regression and splines. Used if relationships are not straight lines.
How regression finds the line
Most regression methods pick parameters so the errors are small. The usual method is least squares. It chooses a and b to minimize the sum of squared errors. Think of it as fitting the line that keeps the points closest on average.
For a single x, the slope b equals covariance(x,y) divided by variance(x). That is a simple formula that shows slope reflects how x and y move together.
Key assumptions to check
Regression works best when these are roughly true:
- Linearity. The relation between x and y should be roughly straight, or made straight with transformations.
- Independence. Observations should not be correlated with each other.
- Constant variance. The spread of errors should be similar across values of x.
- No extreme multicollinearity. Predictor variables should not be near-perfect copies of each other.
- Errors roughly normal. This matters for some tests and confidence intervals.
If these fail, predictions can still be okay, but inference and standard errors become unreliable.
Diagnostics and metrics
- Residuals. The errors y - y_hat. Plot them to look for patterns.
- R-squared. Fraction of variance in y explained by the model. Higher is better, but it rises with more variables.
- Adjusted R-squared. Penalizes adding useless predictors.
- RMSE (root mean squared error). Average size of the error in the units of y.
- MAE (mean absolute error). Average absolute error.
- AIC / BIC. Used for model comparison with penalty for complexity.
Check the residual plot, QQ plot for normality, and variance inflation factor for multicollinearity.
Overfitting and underfitting
- Underfit. Model is too simple. It misses patterns.
- Overfit. Model is too complex. It learns noise in the training data and fails on new data.
Fix overfitting with cross validation, holdout sets, or regularization like Ridge or Lasso.
Practical steps to build a regression model
- Define the question and the target variable.
- Gather data and choose predictors.
- Clean data: handle missing values and outliers.
- Split data into train and test sets.
- Fit a model on training data.
- Check diagnostics and assumptions.
- Evaluate on test data using RMSE or MAE.
- Iterate and simplify the model when possible.
Simple finance example
Predicting next month sales from ad spend:
- Y = next month sales.
- X = ad spend this month.
Fit y = a + b x. If b = 2.5, each extra dollar in ad spend predicts $2.50 more in sales. But check the residuals. If residuals increase with x you may need a transformation. If data are from the same stores across time, check for autocorrelation.
For credit scoring use logistic regression to predict default probability. For pricing assets use multi-variable regression with risk factors but be careful: correlation does not imply a factor causes returns.
When regression is the wrong tool
- You need causal proof. Regression alone does not prove cause.
- Outcome is not numeric. Use classification methods if outcome is categories.
- Data are highly nonlinear and you do not want to transform variables. Then consider tree-based models.
Short checklist
- Do the predictors make sense?
- Are residuals patternless?
- Is model validated on held-out data?
- Is the model simple enough to interpret?
Regression is simple, powerful, and the right place to start. It helps turn data into numbers you can act on. Use it to predict, to test ideas, and to guide further analysis.