Gradient descent is a foundational optimization technique widely used in machine learning to find the minimum of a function. The core idea is to iteratively adjust variables in the opposite direction of the gradient (slope) to minimize the function’s output.
The Gradient Descent Formula
The general update formula for gradient descent is:
\[
x_{n+1} = x_n - \alpha f'(x_n)
\]
where:
\(x_n\) is the current guess
\(f'(x_n)\) is the gradient at \(x_n\)
\(\alpha\) is the learning rate (a small positive constant that controls the step size)
The idea is that the steeper the slope, the larger the update, moving us closer to the minimum.
Single-Variable Gradient Descent
Let’s replicate the slides seen in class and minimize a single-variable function, \(f(x) = x^2\). Imagine we start with an initial guess that the minimum is at \(x = -4\) (although we know this is not the true minimum, from class we have seen this happens at \(x = 0\)). We will iteratively improve this guess by calculating the derivative (gradient) and adjusting the guess.
Understanding the Derivative’s Role
In gradient descent, the derivative (or gradient) tells us the slope of the function at any given point. Here’s the intuition:
If the derivative is positive: The function is sloping upwards, so we should move “downhill” by decreasing our guess.
If the derivative is negative: The function is sloping downwards, so we should move “downhill” by increasing our guess.
This approach ensures that we move towards the minimum at each step.
Example: Minimizing \(f(x) = x^2\)
Using function \(f(x) = x^2\), first we have to calculate the derivative of this function \(f'(x) = 2x\)
#Define the function and its derivative f<-function(x){x^2}f_prime<-function(x){2*x}
This is how they look:
library(ggplot2)# Create a sequence of x valuesx_vals<-seq(-10, 10, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
library(patchwork)A<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x, color ="f(x) = x^2"), size =1)+#geom_line(aes(y = f_prime_x, color = "f'(x) = 2x"), size = 1) +labs(title ="Plot of f(x) = x^2 ", x ="x", y ="f(x)")+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+scale_color_manual("", values =c("f(x) = x^2"="blue", "f'(x) = 2x"="red"))+theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
B<-ggplot(data, aes(x =x))+#geom_line(aes(y = f_x, color = "f(x) = x^2"), size = 1) +geom_line(aes(y =f_prime_x, color ="f'(x) = 2x"), size =1)+labs(title ="Plot of f'(x) = 2x", x ="x", y ="f'(x)")+ylim(-20, 100)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+scale_color_manual("", values =c("f(x) = x^2"="blue", "f'(x) = 2x"="red"))+theme_minimal()A+B
# Set initial valuesalpha<-0.8# Learning rate x<--10# Initial guessx_store<-NULL
As previosuly explained, the general update formula for gradient descent is:
\[
x_{n+1} = x_n - \alpha f'(x_n)
\]
As you can see, this code iteratively adjusts \(x\) based on the gradient at each step, converging towards the minimum. To find the minimum of a function, gradient descent updates the current guess x_n by a step that depends on the gradient (slope) at x_n.
# Gradient descent loopfor(iin1:10){gradient<-f_prime(x)# Update parametersx<-x-alpha*gradient# Store values for plottingx_store[i]<-xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n")}
Step 1 : x = 6 f(x) = 36
Step 2 : x = -3.6 f(x) = 12.96
Step 3 : x = 2.16 f(x) = 4.6656
Step 4 : x = -1.296 f(x) = 1.679616
Step 5 : x = 0.7776 f(x) = 0.6046618
Step 6 : x = -0.46656 f(x) = 0.2176782
Step 7 : x = 0.279936 f(x) = 0.07836416
Step 8 : x = -0.1679616 f(x) = 0.0282111
Step 9 : x = 0.100777 f(x) = 0.010156
Step 10 : x = -0.06046618 f(x) = 0.003656158
This code iteratively adjusts \(x\) based on the gradient at each step, converging towards the minimum.
Plotting it all together
Approximations<-data.frame(x =x_store, y =f(x_store), dy =f_prime(x_store))%>%add_rownames()# Assign a "darkening" group
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
Approximations$rowname<-as.numeric(Approximations$rowname)A<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x), color ="blue", size =1)+labs(title ="Plot of f(x) = x^2 ", x ="x", y ="f(x)")+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =y, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()B<-ggplot(data, aes(x =x))+geom_line(aes(y =f_prime_x), color ="blue", size =1)+labs(title ="Plot of f'(x) = 2x", x ="x", y ="f'(x)")+ylim(-20, 100)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =dy, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()A+B
Note
A larger learning rate \(\alpha\) can make the steps too large, causing the algorithm to “overshoot” the minimum, while a smaller learning rate may result in slow convergence.
Your turn! (to hide?)
Use gradient descent to minimize the following functions. For each function, start with different initial guesses and observe how the algorithm converges to a minimum.
\[
f(x) = x^2 + 3x + 5
\]
# Define the function and its derivative f<-function(x){x^2+3*x+5}f_prime<-function(x){2*x+3}
Why not plot it?
# Create a sequence of x valuesx_vals<-seq(-15, 15, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
library(patchwork)A<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x, color ="f(x) = x^2 + 3*x + 5"), size =1)+#geom_line(aes(y = f_prime_x, color = "f'(x) = 2x + 1"), size = 1) +labs(title ="Plot of f(x) = x^2 + 3*x + 5", x ="x", y ="f(x)")+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+scale_color_manual("", values =c("f(x) = x^2 + 3*x + 5"="blue"))+theme_minimal()B<-ggplot(data, aes(x =x))+#geom_line(aes(y = f_x, color = "f(x) = x^2"), size = 1) +geom_line(aes(y =f_prime_x, color ="f'(x) = 2*x + 3"), size =1)+labs(title ="Plot of f'(x) = 2*x + 3", x ="x", y ="f'(x)")+ylim(-30, 100)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+scale_color_manual("", values =c("f'(x) = 2*x + 3"="red"))+theme_minimal()A+B
# Parametersalpha<-0.01x<-2# Initial guess
# Gradient descent loopfor(iin1:10){x<-x-alpha*f_prime(x)cat("Step", i, ": x =", x, "f(x) =", f(x), "\n")}
Step 1 : x = 1.93 f(x) = 14.5149
Step 2 : x = 1.8614 f(x) = 14.04901
Step 3 : x = 1.794172 f(x) = 13.60157
Step 4 : x = 1.728289 f(x) = 13.17185
Step 5 : x = 1.663723 f(x) = 12.75914
Step 6 : x = 1.600448 f(x) = 12.36278
Step 7 : x = 1.538439 f(x) = 11.98211
Step 8 : x = 1.477671 f(x) = 11.61652
Step 9 : x = 1.418117 f(x) = 11.26541
Step 10 : x = 1.359755 f(x) = 10.9282
Part 3: Local vs Global Minima
Gradient descent may not always reach the global minimum, especially if the function has multiple minima. The algorithm might “get stuck” in a local minimum, particularly if the initial guess is close to one.
Consider the function \(f(x) = x^4 + x^3 - 2x^2\), which has both local and global minima.
for(iin1:10){x<-x-alpha*f_prime(x)x_store[i]<-xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n")}
Step 1 : x = 0.97 f(x) = -0.08383419
Step 2 : x = 0.9440661 f(x) = -0.1467667
Step 3 : x = 0.9214345 f(x) = -0.1948753
Step 4 : x = 0.9015272 f(x) = -0.2322205
Step 5 : x = 0.8838971 f(x) = -0.2615931
Step 6 : x = 0.8681921 f(x) = -0.2849583
Step 7 : x = 0.8541308 f(x) = -0.303729
Step 8 : x = 0.841485 f(x) = -0.3189397
Step 9 : x = 0.8300673 f(x) = -0.3313601
Step 10 : x = 0.8197226 f(x) = -0.3415714
Approximations<-data.frame(x =x_store, y =f(x_store), dy =f_prime(x_store))%>%add_rownames()# Assign a "darkening" group)
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
Approximations$rowname<-as.numeric(Approximations$rowname)# Create a sequence of x valuesx_vals<-seq(-15, 15, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
A_big<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x), color ="blue", size =1)+labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ", x ="x", y ="f(x)")+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =y, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()# Create a sequence of x valuesx_vals<-seq(-5, 5, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
A<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x), color ="blue", size =1)+labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ", x ="x", y ="f(x)")+ylim(-5, 5)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =y, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()B<-ggplot(data, aes(x =x))+geom_line(aes(y =f_prime_x), color ="blue", size =1)+labs(title ="Plot of f'(x) = 4*x^3 + 3*x^2 - 4*x", x ="x", y ="f'(x)")+ylim(-20, 90)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =dy, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()(A_big|A)
We see this function has 2 minimums! One is a local minimim and the other the global minimum. Depending on where we start (intialization point) we will end up in one or the other.
Exercise! Try different initial guesses, record which initial guess leads to global minimum.
for(iin1:number_iterations){x<-x-alpha*f_prime(x)x_store[i]<-xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n")}
Step 1 : x = -0.5675 f(x) = -0.7231592
Step 2 : x = -0.642653 f(x) = -0.920852
Step 3 : x = -0.7250915 f(x) = -1.156317
Step 4 : x = -0.813674 f(x) = -1.424506
Step 5 : x = -0.9062562 f(x) = -1.712375
Step 6 : x = -0.9996069 f(x) = -1.998821
Step 7 : x = -1.08963 f(x) = -2.258633
Step 8 : x = -1.171997 f(x) = -2.470269
Step 9 : x = -1.243079 f(x) = -2.62357
Step 10 : x = -1.300817 f(x) = -2.722108
Step 11 : x = -1.345069 f(x) = -2.778691
Step 12 : x = -1.377285 f(x) = -2.808136
Step 13 : x = -1.39977 f(x) = -2.822284
Step 14 : x = -1.414967 f(x) = -2.828682
Step 15 : x = -1.425001 f(x) = -2.831453
Step 16 : x = -1.43152 f(x) = -2.832617
Step 17 : x = -1.43571 f(x) = -2.833097
Step 18 : x = -1.438384 f(x) = -2.833291
Step 19 : x = -1.440082 f(x) = -2.83337
Step 20 : x = -1.441158 f(x) = -2.833402
Step 21 : x = -1.441838 f(x) = -2.833414
Step 22 : x = -1.442267 f(x) = -2.833419
Step 23 : x = -1.442538 f(x) = -2.833421
Step 24 : x = -1.442709 f(x) = -2.833422
Step 25 : x = -1.442817 f(x) = -2.833422
Step 26 : x = -1.442885 f(x) = -2.833422
Step 27 : x = -1.442928 f(x) = -2.833422
Step 28 : x = -1.442955 f(x) = -2.833422
Step 29 : x = -1.442972 f(x) = -2.833422
Step 30 : x = -1.442982 f(x) = -2.833422
Step 31 : x = -1.442989 f(x) = -2.833422
Step 32 : x = -1.442993 f(x) = -2.833422
Step 33 : x = -1.442996 f(x) = -2.833422
Step 34 : x = -1.442998 f(x) = -2.833422
Step 35 : x = -1.442999 f(x) = -2.833422
Step 36 : x = -1.442999 f(x) = -2.833422
Step 37 : x = -1.443 f(x) = -2.833422
Step 38 : x = -1.443 f(x) = -2.833422
Step 39 : x = -1.443 f(x) = -2.833422
Step 40 : x = -1.443 f(x) = -2.833422
Step 41 : x = -1.443 f(x) = -2.833422
Step 42 : x = -1.443 f(x) = -2.833422
Step 43 : x = -1.443 f(x) = -2.833422
Step 44 : x = -1.443 f(x) = -2.833422
Step 45 : x = -1.443 f(x) = -2.833422
Step 46 : x = -1.443 f(x) = -2.833422
Step 47 : x = -1.443 f(x) = -2.833422
Step 48 : x = -1.443 f(x) = -2.833422
Step 49 : x = -1.443 f(x) = -2.833422
Step 50 : x = -1.443 f(x) = -2.833422
Step 51 : x = -1.443 f(x) = -2.833422
Step 52 : x = -1.443 f(x) = -2.833422
Step 53 : x = -1.443 f(x) = -2.833422
Step 54 : x = -1.443 f(x) = -2.833422
Step 55 : x = -1.443 f(x) = -2.833422
Step 56 : x = -1.443 f(x) = -2.833422
Step 57 : x = -1.443 f(x) = -2.833422
Step 58 : x = -1.443 f(x) = -2.833422
Step 59 : x = -1.443 f(x) = -2.833422
Step 60 : x = -1.443 f(x) = -2.833422
Step 61 : x = -1.443 f(x) = -2.833422
Step 62 : x = -1.443 f(x) = -2.833422
Step 63 : x = -1.443 f(x) = -2.833422
Step 64 : x = -1.443 f(x) = -2.833422
Step 65 : x = -1.443 f(x) = -2.833422
Step 66 : x = -1.443 f(x) = -2.833422
Step 67 : x = -1.443 f(x) = -2.833422
Step 68 : x = -1.443 f(x) = -2.833422
Step 69 : x = -1.443 f(x) = -2.833422
Step 70 : x = -1.443 f(x) = -2.833422
Step 71 : x = -1.443 f(x) = -2.833422
Step 72 : x = -1.443 f(x) = -2.833422
Step 73 : x = -1.443 f(x) = -2.833422
Step 74 : x = -1.443 f(x) = -2.833422
Step 75 : x = -1.443 f(x) = -2.833422
Step 76 : x = -1.443 f(x) = -2.833422
Step 77 : x = -1.443 f(x) = -2.833422
Step 78 : x = -1.443 f(x) = -2.833422
Step 79 : x = -1.443 f(x) = -2.833422
Step 80 : x = -1.443 f(x) = -2.833422
Step 81 : x = -1.443 f(x) = -2.833422
Step 82 : x = -1.443 f(x) = -2.833422
Step 83 : x = -1.443 f(x) = -2.833422
Step 84 : x = -1.443 f(x) = -2.833422
Step 85 : x = -1.443 f(x) = -2.833422
Step 86 : x = -1.443 f(x) = -2.833422
Step 87 : x = -1.443 f(x) = -2.833422
Step 88 : x = -1.443 f(x) = -2.833422
Step 89 : x = -1.443 f(x) = -2.833422
Step 90 : x = -1.443 f(x) = -2.833422
Step 91 : x = -1.443 f(x) = -2.833422
Step 92 : x = -1.443 f(x) = -2.833422
Step 93 : x = -1.443 f(x) = -2.833422
Step 94 : x = -1.443 f(x) = -2.833422
Step 95 : x = -1.443 f(x) = -2.833422
Step 96 : x = -1.443 f(x) = -2.833422
Step 97 : x = -1.443 f(x) = -2.833422
Step 98 : x = -1.443 f(x) = -2.833422
Step 99 : x = -1.443 f(x) = -2.833422
Step 100 : x = -1.443 f(x) = -2.833422
Approximations<-data.frame(x =x_store, y =f(x_store), dy =f_prime(x_store))%>%add_rownames()# Assign a "darkening" group)
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
Approximations$rowname<-as.numeric(Approximations$rowname)# Create a sequence of x valuesx_vals<-seq(-15, 15, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
A_big<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x), color ="blue", size =1)+labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ", x ="x", y ="f(x)")+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =y, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()# Create a sequence of x valuesx_vals<-seq(-5, 5, length.out =100)# Compute y values for both functionsdata<-data.frame( x =x_vals, f_x =f(x_vals), f_prime_x =f_prime(x_vals))head(data)
A<-ggplot(data, aes(x =x))+geom_line(aes(y =f_x), color ="blue", size =1)+labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ", x ="x", y ="f(x)")+ylim(-5, 5)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =y, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()B<-ggplot(data, aes(x =x))+geom_line(aes(y =f_prime_x), color ="blue", size =1)+labs(title ="Plot of f'(x) = 4*x^3 + 3*x^2 - 4*x", x ="x", y ="f'(x)")+ylim(-20, 90)+geom_hline(yintercept =0, color ="black")+# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black")+geom_point(data =Approximations, aes(x =x, y =dy, fill =rowname), size =3,pch=21,colour ="black", show.legend =FALSE)+scale_fill_gradient(low ="yellow", high ="red", na.value =NA)+theme_minimal()(A_big|A)
What happens at initial point x = 0? And if you increase the learning rate a lot? Does it mean it gets to the minimum faster? But which one? What is another parameter you can modify? ITERATIONS!
As you have seen:
The choice of learning rate \(\alpha\) is crucial:
If \(\alpha\) too large, the algorithm might oscillate and fail to converge.
If \(\alpha\) too slow, requiring more iterations.
Example of high learning rate
Setting \(\alpha = 0.5\)
Part 4: Multivariable Gradient Descent
Multivariable gradient descent is an extension of the single-variable case. Instead of using a single derivative, we calculate the gradient vector, which consists of the partial derivatives of the function with respect to each variable.
Let’s apply gradient descent starting from an initial guess of \(\vec{x} = (1, 2)\) with a learning rate of \(\alpha = 0.01\)
# Define the function and its partial derivatives f<-function(x, y){x*sin(y)+x^2}f_x<-function(x, y){sin(y)+2*x}f_y<-function(x, y){x*cos(y)}
Another way to do it in R is using the function Deriv()
library(Deriv)# Define the function f(x, y) = x* sin(y) + x^2f<-function(x, y){x*sin(y)+x^2}# Compute the gradient symbolically (exact expressions)grad_f<-Deriv(expression(x*sin(y)+x^2), c("x", "y"))print(grad_f)
expression(c(x = 2 * x + sin(y), y = x * cos(y)))
#expression(c(x = 2 * x + sin(y), y = x * cos(y))) Computes the partial derivatives for you!!!!!gradient<-function(x, y){eval(grad_f)}
As you can see is the partial derivatives above written, directly calculated
# Initial parametersalpha<-0.01# Learning rate iterations<-100# Number of iterations x<-1# Initial guess for x y<-2# Initial guess for y# Store results for plottingresults<-data.frame(Iteration =0, x =x, y =y, f_value =f(x, y))# Gradient descent loop
for(iin1:iterations){grad<-gradient(x =x, y =y)# Evaluate gradient x<-x-alpha*grad[1]y<-y-alpha*grad[2]results<-rbind(results, data.frame(Iteration =i, x =x, y =y, f_value =f(x, y)))# would be the same as:#x <- x - alpha*f_x(x, y) #y <- y - alpha* f_y(x, y) }# Display first few iterationshead(results)
#Generate grid data for 3D surface plot x_vals<-seq(-2, 2, length.out =50)y_vals<-seq(-1, 3, length.out =50)z_vals<-outer(x_vals, y_vals, Vectorize(f))#evaluate x and y values in function f# 3D plotpersp3D(x =x_vals, y =y_vals, z =z_vals, col ="lightblue", theta =30, phi =20, expand =0.6, shade =0.5, main ="Gradient Descent Path on f(x, y) = x*sin(y) + x^2", xlab ="x", ylab ="y", zlab ="f(x, y)")# Overlay gradient descent pathpoints3D(results$x, results$y, results$f_value, col ="red", pch =19, add =TRUE)lines3D(results$x, results$y, results$f_value, col ="red", add =TRUE)
# Plot the value of f(x, y) over iterations ggplot(results, aes(x =Iteration, y =f_value))+geom_line(color ="blue")+labs( title ="Convergence of Gradient Descent on f(x, y) = x*sin(y) + x^2", x ="Iteration", y ="f(x, y)")+theme_minimal()
Final Remarks
Gradient descent is a versatile optimization technique, but it’s not without limitations:
It may converge to local minima rather than the global minimum. It is sensitive to the choice of learning rate and initial guess.
Variants of gradient descent, like stochastic gradient descent (SGD) and momentum-based methods, are often used to address these issues in large-scale machine learning tasks. Understanding and experimenting with gradient descent is crucial for developing an intuition about optimization in machine learning and algorithms. Use the exercises above to deepen your understanding of this fundamental technique.
Trty it out with our own MSE!
library(mlbench)# Load the datasetdata("PimaIndiansDiabetes")# Select the first 80 rows and extract the variablesdata_subset<-PimaIndiansDiabetes[1:80, ]x<-data_subset$agey<-data_subset$glucose
# Gradient descent parametersalpha<-0.0001# Learning rateiterations<-1000# Number of iterationsm<--3# Initial guess for slopec<-40# Initial guess for intercept# Lists to store m, c, and MSE values for plottingm_path<-numeric(iterations)c_path<-numeric(iterations)mse_history<-numeric(iterations)
# Define the MSE functionmse<-function(m, c, x, y){y_pred<-m*x+cmean((y-y_pred)^2)#same thing: (1 / n) * sum((y - (m * x + c))^2)}# Define the gradients of MSE with respect to m and cmse_gradient_m<-function(m, c, x, y){-2/length(y)*sum(x*(y-(m*x+c)))}mse_gradient_c<-function(m, c, x, y){-2/length(y)*sum(y-(m*x+c))}#Remember can also apply the Deriv function grad_f<-Deriv(expression((1/n)*sum((y-(m*x+c))^2)), c("m", "c"))print(grad_f)
expression({
.e2 <- y - (c + m * x)
c(m = sum(-(2 * (x * .e2)))/n, c = sum(-(2 * .e2))/n)
})
# Perform gradient descentfor(iin1:iterations){# Compute gradientsgrad_m<-mse_gradient_m(m, c, x, y)grad_c<-mse_gradient_c(m, c, x, y)# Update parametersm<-m-alpha*grad_mc<-c-alpha*grad_c# Store values for plottingm_path[i]<-mc_path[i]<-cmse_history[i]<-mse(m, c, x, y)# Print progress every 100 iterationsif(i%%100==0){cat("Iteration:", i, "Slope (m):", m, "Intercept (c):", c, "MSE:", mse_history[i], "\n")}}
# Plot the data points and the fitted lineggplot(data_subset, aes(x =age, y =glucose))+geom_point()+geom_abline(intercept =c, slope =m, color ="blue", size =1)+labs(title ="Linear Fit using Gradient Descent", x ="Age", y ="Glucose")+theme_minimal()
# Plot MSE over iterationsmse_df<-data.frame(iteration =1:iterations, MSE =mse_history)ggplot(mse_df, aes(x =iteration, y =MSE))+geom_line(color ="red")+labs(title ="MSE over Iterations", x ="Iteration", y ="Mean Squared Error")+theme_minimal()
# Create a grid of values for m and cm_values<-seq(-5, 5, length.out =50)c_values<-seq(0, 200, length.out =50)# Initialize a matrix to store MSE valuesmse_matrix<-outer(m_values, c_values, Vectorize(function(m, c)mse(m, c, x, y)))# Plot the MSE surfacepersp3D(x =m_values, y =c_values, z =mse_matrix, theta =45, phi =0, xlab ="Slope (m)", ylab ="Intercept (c)", zlab ="MSE", main ="MSE Surface with Gradient Descent Path",ticktype ="detailed")# Add the gradient descent pathpoints3D(m_path, c_path, z =sapply(1:iterations, function(i)mse(m_path[i], c_path[i], x, y)), col ="red", pch =20, add =TRUE, cex =0.5)
Exercise Questions
Experiment with the Learning Rate Try changing the value of alpha (learning rate) to see its effect on convergence. Question: What happens if alpha is too high? Does the MSE converge smoothly, or does it oscillate? Question: What happens if alpha is too low? How does it affect the number of iterations required to reach a stable value?
Change the Initial Guess Try different initial values for m. For example, use m = 5 or m = -3. Question: Does the algorithm converge to the same solution? How does the initial value of m affect the convergence?
Extend to Optimize Both m and c Modify the code to perform gradient descent on both the slope (m) and intercept (c). Hint: You’ll need to add a derivative function for c and update c in each iteration as well. Question: How does optimizing both m and c simultaneously compare to optimizing only m?
```{r}library(tidyverse)library(plot3D)```# Gradient DescentGradient descent is a foundational **optimization** technique widely used in machine learning to find the **minimum** of a function. [The core idea is to iteratively adjust variables in the opposite direction of the gradient (slope) to minimize the function’s output]{style="color:blue;"}.## The Gradient Descent FormulaThe general update formula for gradient descent is:$$x_{n+1} = x_n - \alpha f'(x_n) $$where:- $x_n$ is the current guess- $f'(x_n)$ is the gradient at $x_n$- $\alpha$ is the learning rate (a small positive constant that controls the step size)The idea is that the steeper the slope, the larger the update, moving us closer to the minimum.## Single-Variable Gradient DescentLet’s replicate the slides seen in class and minimize a single-variable function, $f(x) = x^2$. Imagine we start with an initial guess that the minimum is at $x = -4$ (although we know this is not the true minimum, from class we have seen this happens at $x = 0$). We will iteratively improve this guess by calculating the derivative (gradient) and adjusting the guess.### Understanding the Derivative's RoleIn gradient descent, the derivative (or gradient) tells us the slope of the function at any given point. Here’s the intuition:- *If the derivative is positive*: The function is sloping upwards, so we should move "downhill" by decreasing our guess.- *If the derivative is negative*: The function is sloping downwards, so we should move "downhill" by increasing our guess.This approach ensures that we move towards the minimum at each step.### Example: Minimizing $f(x) = x^2$Using function $f(x) = x^2$, first we have to calculate the derivative of this function $f'(x) = 2x$```{r}#Define the function and its derivative f <-function(x) { x^2}f_prime <-function(x) {2* x}```This is how they look:```{r}library(ggplot2)# Create a sequence of x valuesx_vals <-seq(-10, 10, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)library(patchwork)A <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x, color ="f(x) = x^2"), size =1) +#geom_line(aes(y = f_prime_x, color = "f'(x) = 2x"), size = 1) +labs(title ="Plot of f(x) = x^2 ",x ="x",y ="f(x)") +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +scale_color_manual("", values =c("f(x) = x^2"="blue", "f'(x) = 2x"="red")) +theme_minimal()B <-ggplot(data, aes(x = x)) +#geom_line(aes(y = f_x, color = "f(x) = x^2"), size = 1) +geom_line(aes(y = f_prime_x, color ="f'(x) = 2x"), size =1) +labs(title ="Plot of f'(x) = 2x",x ="x",y ="f'(x)") +ylim(-20, 100) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +scale_color_manual("", values =c("f(x) = x^2"="blue", "f'(x) = 2x"="red")) +theme_minimal()A + B``````{r}# Set initial valuesalpha <-0.8# Learning rate x <--10# Initial guessx_store <-NULL```As previosuly explained, the general update formula for gradient descent is:$$x_{n+1} = x_n - \alpha f'(x_n) $$As you can see, this code iteratively adjusts $x$ based on the gradient at each step, converging towards the minimum. To find the minimum of a function, gradient descent updates the current guess x_n by a step that depends on the gradient (slope) at x_n.```{r}# Gradient descent loopfor (i in1:10) { gradient <-f_prime(x) # Update parameters x <- x - alpha * gradient# Store values for plotting x_store[i] <- xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n") }```This code iteratively adjusts $x$ based on the gradient at each step, converging towards the minimum.Plotting it all together```{r}Approximations <-data.frame(x = x_store, y =f(x_store), dy =f_prime(x_store)) %>%add_rownames()# Assign a "darkening" groupApproximations$rowname <-as.numeric(Approximations$rowname )A <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x), color ="blue", size =1) +labs(title ="Plot of f(x) = x^2 ",x ="x",y ="f(x)") +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = y, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()B <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_prime_x), color ="blue", size =1) +labs(title ="Plot of f'(x) = 2x",x ="x",y ="f'(x)") +ylim(-20, 100) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = dy, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()A + B```::: callout-note## NoteA larger learning rate $\alpha$ can make the steps too large, causing the algorithm to "overshoot" the minimum, while a smaller learning rate may result in slow convergence.:::## Your turn! (to hide?)Use gradient descent to minimize the following functions. For each function, start with different initial guesses and observe how the algorithm converges to a minimum.$$f(x) = x^2 + 3x + 5$$```{r}# Define the function and its derivative f <-function(x) { x^2+3*x +5}f_prime <-function(x) {2*x +3}```Why not plot it?```{r}# Create a sequence of x valuesx_vals <-seq(-15, 15, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)library(patchwork)A <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x, color ="f(x) = x^2 + 3*x + 5"), size =1) +#geom_line(aes(y = f_prime_x, color = "f'(x) = 2x + 1"), size = 1) +labs(title ="Plot of f(x) = x^2 + 3*x + 5",x ="x",y ="f(x)") +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +scale_color_manual("", values =c("f(x) = x^2 + 3*x + 5"="blue")) +theme_minimal()B <-ggplot(data, aes(x = x)) +#geom_line(aes(y = f_x, color = "f(x) = x^2"), size = 1) +geom_line(aes(y = f_prime_x, color ="f'(x) = 2*x + 3"), size =1) +labs(title ="Plot of f'(x) = 2*x + 3",x ="x",y ="f'(x)") +ylim(-30, 100) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +scale_color_manual("", values =c( "f'(x) = 2*x + 3"="red")) +theme_minimal()A + B``````{r}# Parametersalpha <-0.01x <-2# Initial guess``````{r}# Gradient descent loopfor (i in1:10) { x <- x - alpha *f_prime(x) cat("Step", i, ": x =", x, "f(x) =", f(x), "\n") }```## Part 3: Local vs Global MinimaGradient descent may not always reach the global minimum, especially if the function has multiple minima. The algorithm might "get stuck" in a local minimum, particularly if the initial guess is close to one.Consider the function $f(x) = x^4 + x^3 - 2x^2$, which has both local and global minima.```{r}f <-function(x) { x^4+ x^3-2*(x^2) }f_prime <-function(x){4*(x^3) +3*(x^2) -4*x}```Parameters```{r}alpha <-0.01x <-1x_store <-NULL``````{r}for (i in1:10) { x <- x - alpha *f_prime(x) x_store[i] <- xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n") }``````{r}Approximations <-data.frame(x = x_store, y =f(x_store), dy =f_prime(x_store)) %>%add_rownames()# Assign a "darkening" group)Approximations$rowname <-as.numeric(Approximations$rowname )# Create a sequence of x valuesx_vals <-seq(-15, 15, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)A_big <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x), color ="blue", size =1) +labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ",x ="x",y ="f(x)") +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = y, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()# Create a sequence of x valuesx_vals <-seq(-5, 5, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)A <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x), color ="blue", size =1) +labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ",x ="x",y ="f(x)") +ylim(-5, 5) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = y, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()B <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_prime_x), color ="blue", size =1) +labs(title ="Plot of f'(x) = 4*x^3 + 3*x^2 - 4*x",x ="x",y ="f'(x)") +ylim(-20, 90) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = dy, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()(A_big | A) (A | B )```We see this function has 2 minimums! One is a local minimim and the other the global minimum. Depending on where we start (intialization point) we will end up in one or the other.Exercise! Try different initial guesses, record which initial guess leads to global minimum.Parameters```{r}alpha <-0.03x <--0.5x_store <-NULLnumber_iterations <-100``````{r}for (i in1:number_iterations) { x <- x - alpha *f_prime(x) x_store[i] <- xcat("Step", i, ": x =", x, "f(x) =", f(x), "\n") }``````{r}Approximations <-data.frame(x = x_store, y =f(x_store), dy =f_prime(x_store)) %>%add_rownames()# Assign a "darkening" group)Approximations$rowname <-as.numeric(Approximations$rowname )# Create a sequence of x valuesx_vals <-seq(-15, 15, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)A_big <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x), color ="blue", size =1) +labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ",x ="x",y ="f(x)") +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = y, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()# Create a sequence of x valuesx_vals <-seq(-5, 5, length.out =100)# Compute y values for both functionsdata <-data.frame(x = x_vals,f_x =f(x_vals),f_prime_x =f_prime(x_vals))head(data)A <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_x), color ="blue", size =1) +labs(title ="Plot of f(x) = x^4 + x^3 - 2*x^2 ",x ="x",y ="f(x)") +ylim(-5, 5) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = y, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()B <-ggplot(data, aes(x = x)) +geom_line(aes(y = f_prime_x), color ="blue", size =1) +labs(title ="Plot of f'(x) = 4*x^3 + 3*x^2 - 4*x",x ="x",y ="f'(x)") +ylim(-20, 90) +geom_hline(yintercept =0, color ="black") +# Add horizontal line at y = 0geom_vline(xintercept =0, color ="black") +geom_point(data = Approximations, aes(x = x, y = dy, fill = rowname), size =3,pch=21,colour ="black", show.legend =FALSE) +scale_fill_gradient(low ="yellow", high ="red", na.value =NA) +theme_minimal()(A_big | A) (A | B )```What happens at initial point x = 0? And if you increase the learning rate a lot? Does it mean it gets to the minimum faster? But which one? What is another parameter you can modify? ITERATIONS!As you have seen:The choice of learning rate $\alpha$ is crucial:- If $\alpha$ too large, the algorithm might oscillate and fail to converge.- If $\alpha$ too slow, requiring more iterations.Example of high learning rateSetting $\alpha = 0.5$## Part 4: Multivariable Gradient DescentMultivariable gradient descent is an extension of the single-variable case. Instead of using a single derivative, we calculate the gradient vector, which consists of the partial derivatives of the function with respect to each variable.Consider the function:$$f(x, y) = xsin(y) + x^2$$The gradient of this function is:$$\nabla f(x, y) = \left\langle \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right\rangle = \langle \sin(y) + 2x, x \cos(y) \rangle$$Let's apply gradient descent starting from an initial guess of $\vec{x} = (1, 2)$ with a learning rate of $\alpha = 0.01$```{r}# Define the function and its partial derivatives f <-function(x, y) { x*sin(y) + x^2}f_x <-function(x, y) {sin(y) +2*x }f_y <-function(x, y){ x*cos(y)}```Another way to do it in R is using the function Deriv()```{r}library(Deriv) # Define the function f(x, y) = x* sin(y) + x^2f <-function(x, y) { x*sin(y) + x^2 }# Compute the gradient symbolically (exact expressions)grad_f <-Deriv(expression(x*sin(y) + x^2), c("x", "y"))print(grad_f)#expression(c(x = 2 * x + sin(y), y = x * cos(y))) Computes the partial derivatives for you!!!!!gradient <-function(x, y) { eval(grad_f) }```As you can see is the partial derivatives above written, directly calculated```{r}# Initial parametersalpha <-0.01# Learning rate iterations <-100# Number of iterations x <-1# Initial guess for x y <-2# Initial guess for y# Store results for plottingresults <-data.frame(Iteration =0, x = x, y = y, f_value =f(x, y))# Gradient descent loop``````{r}for (i in1:iterations) { grad <-gradient(x = x, y = y)# Evaluate gradient x <- x - alpha* grad[1] y <- y - alpha * grad[2] results <-rbind(results, data.frame(Iteration = i, x = x, y = y, f_value =f(x, y))) # would be the same as:#x <- x - alpha*f_x(x, y) #y <- y - alpha* f_y(x, y) }# Display first few iterationshead(results)``````{r}#Generate grid data for 3D surface plot x_vals <-seq(-2, 2, length.out =50) y_vals <-seq(-1, 3, length.out =50) z_vals <-outer(x_vals, y_vals, Vectorize(f)) #evaluate x and y values in function f# 3D plotpersp3D(x = x_vals, y = y_vals, z = z_vals, col ="lightblue", theta =30, phi =20, expand =0.6, shade =0.5, main ="Gradient Descent Path on f(x, y) = x*sin(y) + x^2", xlab ="x", ylab ="y", zlab ="f(x, y)")# Overlay gradient descent pathpoints3D(results$x, results$y, results$f_value, col ="red", pch =19, add =TRUE)lines3D(results$x, results$y, results$f_value, col ="red", add =TRUE)``````{r}# Plot the value of f(x, y) over iterations ggplot(results, aes(x = Iteration, y = f_value)) +geom_line(color ="blue") +labs( title ="Convergence of Gradient Descent on f(x, y) = x*sin(y) + x^2", x ="Iteration", y ="f(x, y)" ) +theme_minimal()```## Final RemarksGradient descent is a versatile optimization technique, but it’s not without limitations:It may converge to local minima rather than the global minimum. It is sensitive to the choice of learning rate and initial guess.Variants of gradient descent, like stochastic gradient descent (SGD) and momentum-based methods, are often used to address these issues in large-scale machine learning tasks. Understanding and experimenting with gradient descent is crucial for developing an intuition about optimization in machine learning and algorithms. Use the exercises above to deepen your understanding of this fundamental technique.# Trty it out with our own MSE!```{r}library(mlbench)# Load the datasetdata("PimaIndiansDiabetes")# Select the first 80 rows and extract the variablesdata_subset <- PimaIndiansDiabetes[1:80, ]x <- data_subset$agey <- data_subset$glucose``````{r}# Gradient descent parametersalpha <-0.0001# Learning rateiterations <-1000# Number of iterationsm <--3# Initial guess for slopec <-40# Initial guess for intercept# Lists to store m, c, and MSE values for plottingm_path <-numeric(iterations)c_path <-numeric(iterations)mse_history <-numeric(iterations)``````{r}# Define the MSE functionmse <-function(m, c, x, y) { y_pred <- m * x + cmean((y - y_pred)^2)#same thing: (1 / n) * sum((y - (m * x + c))^2)}# Define the gradients of MSE with respect to m and cmse_gradient_m <-function(m, c, x, y) {-2/length(y) *sum(x * (y - (m * x + c)))}mse_gradient_c <-function(m, c, x, y) {-2/length(y) *sum(y - (m * x + c))}#Remember can also apply the Deriv function grad_f <-Deriv(expression((1/ n)*sum((y - (m * x + c))^2)), c("m", "c"))print(grad_f)``````{r}# Perform gradient descentfor (i in1:iterations) {# Compute gradients grad_m <-mse_gradient_m(m, c, x, y) grad_c <-mse_gradient_c(m, c, x, y)# Update parameters m <- m - alpha * grad_m c <- c - alpha * grad_c# Store values for plotting m_path[i] <- m c_path[i] <- c mse_history[i] <-mse(m, c, x, y)# Print progress every 100 iterationsif (i %%100==0) {cat("Iteration:", i, "Slope (m):", m, "Intercept (c):", c, "MSE:", mse_history[i], "\n") }}``````{r}# Plot the data points and the fitted lineggplot(data_subset, aes(x = age, y = glucose)) +geom_point() +geom_abline(intercept = c, slope = m, color ="blue", size =1) +labs(title ="Linear Fit using Gradient Descent", x ="Age", y ="Glucose") +theme_minimal()# Plot MSE over iterationsmse_df <-data.frame(iteration =1:iterations, MSE = mse_history)ggplot(mse_df, aes(x = iteration, y = MSE)) +geom_line(color ="red") +labs(title ="MSE over Iterations", x ="Iteration", y ="Mean Squared Error") +theme_minimal()``````{r}# Create a grid of values for m and cm_values <-seq(-5, 5, length.out =50)c_values <-seq(0, 200, length.out =50)# Initialize a matrix to store MSE valuesmse_matrix <-outer(m_values, c_values, Vectorize(function(m, c) mse(m, c, x, y)))# Plot the MSE surfacepersp3D(x = m_values, y = c_values, z = mse_matrix, theta =45, phi =0, xlab ="Slope (m)", ylab ="Intercept (c)", zlab ="MSE",main ="MSE Surface with Gradient Descent Path",ticktype ="detailed" )# Add the gradient descent pathpoints3D(m_path, c_path, z =sapply(1:iterations, function(i) mse(m_path[i], c_path[i], x, y)), col ="red", pch =20, add =TRUE, cex =0.5)```Exercise Questions1. Experiment with the Learning Rate Try changing the value of alpha (learning rate) to see its effect on convergence. Question: What happens if alpha is too high? Does the MSE converge smoothly, or does it oscillate? Question: What happens if alpha is too low? How does it affect the number of iterations required to reach a stable value?2. Change the Initial Guess Try different initial values for m. For example, use m = 5 or m = -3. Question: Does the algorithm converge to the same solution? How does the initial value of m affect the convergence?3. Extend to Optimize Both m and c Modify the code to perform gradient descent on both the slope (m) and intercept (c). Hint: You'll need to add a derivative function for c and update c in each iteration as well. Question: How does optimizing both m and c simultaneously compare to optimizing only m?