Breast Cancer Diagnosis Predictions via Logistic Regression
The Diagnostic Wisconsin Breast Cancer Database (DWBCD) has compiled a dataset from 569 patients, where each patient underwent a fine needle aspiration biopsy procedure with the goal of studying the abnormal-appearing tissue to examine features which could potentially determine the presence of cancerous cells. All features were measured via a digitized image of the biomass and consist of characteristics such as area, perimeter, texture, concavity, and other physical insights. In this article, we will implement a popular machine learning technique used for determining the classification of a target variable, given continuous data.
The target variable in our case is whether cancer is present in a patient or not, and as such, will be modeled by a binary variable ‘1’ if present, and ‘0’ if not present. This means that our target variable is modeled by a Bernoulli distribution, whose singular parameter ‘p’ (i.e. the probability of observing ‘1’) is influenced by continuous input features. As a result, we must find a way of modeling this parameter as a function of both our features and the weights which we want to optimize in the training phase of constructing our model. A popular class of functions to model the output of some combination of weights and features are the sigmoids.
Why the sigmoid class? There are many convenient properties offered by this class of functions, such as the closed-form solutions for the derivative and integral of the functions, which are simple combinations of the functions themselves. This can greatly help reduce complexity as we can use previously computed and saved values without having to undergo further numerical computations. Such sparings in complexity are essential in machine learning contexts as often we must deal with ‘big data.’ Other useful properties include the strict convexity on one side of the domain (and strict concavity of the other half of the domain), along with the property that specifically for the logistic sigmoid (in the context of logistic regression), it yields the conditional class probabilities of our predictions; meaning that our output can be interpreted practically as a probability given the range of the function being constrained to (0, 1). As a result, we will choose the logistic sigmoid as our representative from this class of functions moving forward.
Now that we know our Bernoulli parameter will be modeled by a logistic sigmoid acting on a combination of our feature inputs and weights, we can proceed to more explicitly define our model. One assumption we will be making, is that the data is independently and identically distributed, and the reason for this is for convenience and heuristics that confirm that such an assumption greatly reduces the complexity of our model while simultaneously having a negligible effect on the accuracy.
As stated before, the logistic sigmoid allows us to model our conditional class predictions, and as a result, we are interested in finding a way to maximize the probability of our target variable, given our observed data and the (to be optimized) weights for our model. This should give us a hint that our loss function should take the form of the above equation, however with some further modifications. An interesting result is that when taking the negative logarithm of the above equation, we arrive at a famous expression known as the binary cross entropy function.
With such a loss function, we know that by finding the weights which minimize the negative log-likelihood (equivalent to maximizing the probability of the observed target training variables given our observed training data), we can use these weights to combine them with our test features and send them through our sigmoid in order to get a value which, as discussed already, will yield a prediction for our target variable y. If this value is greater than 0.5, we know it is more likely that this sample of test features characterize the presence of cancerous cells. Furthermore, the output of this sigmoid also offers a useful interpretation. A value close to 0.5 lets us know that our model is not as confident as compared to a value of 0.99. In the end, we look to model a Bernoulli distribution, which means that our confidence values must be quantized to either zero or one.
Up to now, we have discussed where our prediction model and loss function comes from and why. We can collect our findings by organizing them into the following functions to be utilized in our regression procedure.
def sigmoid(x) def negative_log_likelihood(X, y, w) def compute_regularized_loss(X, y, w, reg) def predict(X_test, w_opt) def get_gradient(X_train, y_train, w, mini_batch_indices, reg)
The first function simply computes the element-wise logistic sigmoid function and is essential for future predictions. The second function computes the negative log-likelihood exactly as described in the above discussion. In addition, you can see from the third function that we will be working with the same loss function as discussed, however with an additional L2-regularization parameter to counteract potential overfitting (and because regularization heuristically leads to better results). The fourth function uses the optimal weights found during training, linearly applies them to our features, sends the output through our sigmoid function, and quantizes the result to either zero or one in order to obtain a prediction for the breast cancer diagnosis. You will notice now when looking at the fourth function, that we are interested in computing the gradient of our regularized loss function. If you have read the article on linear regression, you might note the difference here that in our case, we do not have a closed-form solution for our optimal weights. This is because such a solution does not exist, and therefore we must result to numerical optimization methods to find the optimal weights. Another fact you might notice is the use of minibatches. We will be testing both the calculation of the entire gradient, along with the calculation of partial gradients in order to examine the efficiency-accuracy tradeoff. The gradient-descent optimization procedure (along with a per-line explanation) for finding the optimal weights can be found in the following.
def logistic_regression(X_train, y_train, num_steps, learning_rate, mini_batch_size, reg): # Number of training instances n_train = X_train.shape[0] # Initialize the parameters to zeros w = np.zeros(X_train.shape[1]) for step in range(num_steps): # Shuffle the data permuted_idx = np.random.permutation(n_train) # Go over each mini-batch and update the paramters for idx in range(0, n_train, mini_batch_size): # Get the random indices to be included in the mini batch mini_batch_indices = permuted_idx[idx:idx+mini_batch_size] gradient = get_gradient(X_train, y_train, w, mini_batch_indices, reg) # Update the parameters w = w - learning_rate * gradient return w
Now we that we have our loss function, prediction, and training mechanism, we can proceed to test our regression algorithm. We start by splitting the data with the following training-to-test ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
We then train our model using minibatches of size 50 (recall the total number of samples is 569), as well as train using the entire sample set, in order to receive two sets of weights. In addition, we train with a total number of steps equal to 8000, a learning rate of 1e-5, and a regularization strength of 0.1. To test the accuracy of the predictions using these sets of weights, we import and test them using the following functions from sklearn.metrics.
accuracy = accuracy_score(y_test, predict(X_test, w_full)) f1 = f1_score(y_test, predict(X_test, w_full)) accuracy_m = accuracy_score(y_test, predict(X_test, w_minibatch)) f1_m = f1_score(y_test, predict(X_test, w_minibatch))
Executing our model, we find the following accuracy results.
>> Accuracy of Model: 0.9239766081871345 >> f1-Score of Model: 0.9383886255924171 >> Accuracy of Model (using minibatches): 0.9415204678362573 >> f1-Score of Model (using minibatches): 0.9532710280373833
It might be surprising to some to see that the weights obtained from minibatch training achieve a slightly higher accuracy, considering that our minibatch gradient does not point in the direction of steepest descent given the total batch manifold. However, by purposely not considering the total batch gradient, we are, in a sense, injecting noise in each gradient update which can help escape saddle points or local minima (especially useful in non-convex optimization, as is our case). The discussed noise can be visualized in the following plot where we track the loss function every 50 iterations.
A disadvantage of using minibatches approach can be our need to loop over our dataset multiple times (as seen in our optimization procedure) in order to find a good solution. This disadvantage can be quantified in our case by measuring the runtime of both training procedures, offering insights into the runtime vs. convergence speed tradeoff, where we find the following.
>> Training runtime without minibatches: 1.1250622000079602 sec >> Training runtime with minibatches: 1.9864051999757066 sec