import torch
Polynomial Curve Fitting and doing Linear Regression from Scratch (!! Four ways to Solve a Least Squares Problem)
This blog is a reading follow up of the book Pattern Recognition in Machine Learning by Chris Bishop. In his first example in the chapter he explores curve fitting. There are lot of equations directly at you, and if you understand it easy and beautiful. I come from more of a sklearn, and pytorch setting the world of Machine learning is kind of black box who just use pytorch nn bindings or sklearn linear regresion or xgboost for work. But the the Introduction to Gilbert strangs lecture or book like PRML will clean your mind and will act as detox for your brain which is filled with hundreds of abstraction and libraries. And the detox from hundreds of libraries will make you understand the TRUE WORLD made by us humans. While implementinhg this I realised that we just nead to find Pseudo-Inverses of Weight Matrix, which would also scale to test data. . . its just not all about fancy SGD optimization rather it can be about just finding Inverses of a Matrix to solve your world problems
1.1. Example: Polynomial Curve Fitting
= torch.linspace(0, 1, 20)
x_train = torch.sin(2*3.14*x_train) + torch.normal(0,0.2,x_train.shape,) ## added noise as per the book
y_train
### testing data
= torch.linspace(0, 1, 200)
x_test = torch.sin(2*3.14*x_test) y_test
plot training data
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train)
Fit the Linear model on Training data
Part below directly comes from Gilbert Strang’s Lectures at MIT (Lecture 09 - Four Ways to Solve Least Square Problems)[https://youtu.be/ZUU57Q3CFOU?si=WSYz5c10C4AKISjD] believe me world is just filled with Least Square Problems everywhere
To solve for the optimal weight matrix ( W ), given the expression ( (XW - Y)^T (XW - Y) ), we can proceed by first expanding the expression and then minimizing it with respect to ( W ).
Step 1: Expand the expression
Given: \[ L(W) = (XW - Y)^T (XW - Y) \]
Expanding this: \[ L(W) = (XW)^T(XW) - (XW)^T Y - Y^T (XW) + Y^T Y \]
Simplifying further: \[ L(W) = W^T X^T XW - W^T X^T Y - Y^T XW + Y^T Y \]
Since ( W^T X^T Y ) and ( Y^T XW ) are scalars, they are equal. Therefore: \[ L(W) = W^T X^T XW - 2W^T X^T Y + Y^T Y \]
Step 2: Minimize the expression with respect to ( W )
To find the optimal ( W ), take the derivative of ( L(W) ) with respect to ( W ) and set it to zero: \[ \frac{\partial L(W)}{\partial W} = 2X^T XW - 2X^T Y = 0 \]
Simplifying: \[ X^T XW = X^T Y \]
Finally, solve for ( W ): \[ W = (X^T X)^{-1} X^T Y \]
Final Answer
The optimal weight matrix ( W ) is given by: \[ \boxed{W = (X^T X)^{-1} X^T Y} \]
This result is commonly known as the Normal Equation in linear regression.
= torch.arange(5).float()
powers = x_train.unsqueeze(-1).pow(powers) ## NxM
x_poly = torch.linalg.pinv(x_poly) @ y_train.view(-1,1) ## MxN * Nx1 == Mx1 w
w.shape, x_test.shape
(torch.Size([5, 1]), torch.Size([200]))
= x_test.unsqueeze(-1).pow(powers)
x_poly_test = x_poly_test @ w y_pred
import matplotlib.pyplot as plt
="none", edgecolor="b", s=50, label="training data")
plt.scatter(x_train, y_train, facecolor="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_test, c
plt.legend() plt.show()
for i, degree in enumerate([0, 1, 3, 9]):
2, 2, i + 1)
plt.subplot(= torch.arange(degree).float()
powers = x_train.unsqueeze(-1).pow(powers) ## NxM
x_poly = torch.linalg.pinv(x_poly) @ y_train.view(-1,1) ## MxN * Nx1 == Mx1
w = x_test.unsqueeze(-1).pow(powers)
x_poly_test = x_poly_test @ w
y_pred
="none", edgecolor="b", s=50, label="training data")
plt.scatter(x_train, y_train, facecolor="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_test, c="r", label="fitting")
plt.plot(x_test, y_pred, c-1.5, 1.5)
plt.ylim("M={}".format(degree), xy=(-0.15, 1)) plt.annotate(
1.1. Example: Polynomial Curve Fitting with Regularization
One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form.
Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regularizer is called ridge regression (Hoerl and Kennard, 1970). In the context of neural networks, this approach is known as weight decay
To solve for the optimal weight matrix ( W ) given the objective function ( (XW - Y)^T (XW - Y) + W^T W ), follow these steps:
Objective Function
The objective function is: \[ L(W) = (XW - Y)^T (XW - Y) + W^T W \]
Step 1: Expand the Expression
Expanding ( (XW - Y)^T (XW - Y) ): \[ (XW - Y)^T (XW - Y) = (XW)^T XW - (XW)^T Y - Y^T XW + Y^T Y \]
Since ( (XW)^T Y ) and ( Y^T XW ) are scalars and equal, this simplifies to: \[ (XW - Y)^T (XW - Y) = W^T X^T XW - 2W^T X^T Y + Y^T Y \]
Adding ( W^T W ): \[ L(W) = W^T X^T XW - 2W^T X^T Y + Y^T Y + W^T W \]
Combining like terms: \[ L(W) = W^T (X^T X + I) W - 2W^T X^T Y + Y^T Y \]
Step 2: Minimize the Expression with Respect to ( W )
Take the derivative of ( L(W) ) with respect to ( W ) and set it to zero: \[ \frac{\partial L(W)}{\partial W} = 2(X^T X + I)W - 2X^T Y = 0 \]
Solving for ( W ): \[ (X^T X + I)W = X^T Y \] \[ W = (X^T X + I)^{-1} X^T Y \]
Final Answer
The optimal weight matrix ( W ) is: \[ \boxed{W = (X^T X + I)^{-1} X^T Y} \]
This result includes a regularization term ( W^T W ), which is common in Ridge Regression or Tikhonov Regularization.
for i, degree in enumerate([0, 1, 3, 9]):
2, 2, i + 1)
plt.subplot(= torch.arange(degree).float()
powers = x_train.unsqueeze(-1).pow(powers) ## NxM
x_poly = x_poly.shape
N,M
= (torch.linalg.inv(x_poly.T@x_poly + torch.eye(M))@x_poly.T) @ y_train.view(-1,1) ## MxN * Nx1 == Mx1
w = x_test.unsqueeze(-1).pow(powers)
x_poly_test = x_poly_test @ w
y_pred
="none", edgecolor="b", s=50, label="training data")
plt.scatter(x_train, y_train, facecolor="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_test, c="r", label="fitting")
plt.plot(x_test, y_pred, c-1.5, 1.5)
plt.ylim("M={}".format(degree), xy=(-0.15, 1)) plt.annotate(