Deep Dive into XGBoost

I’ve always been aware of the power of XGBoost, especially with tabular datasets, but I’ve never explored its use in great detail. This article provides my comprehensive overview of XGBoost; the algorithm and package in Python.

xgboost

Ensemble Learning

Ensemble learning is a technique that combines multiple individual models, aggregating their predictions, to produce better predictions than a single model alone. XGBoost is a form of ensemble-based learning, so it feels natural to begin by describing this effective method of building models. Two common ensemble-based methods are bagging and boosting.

Bagging (Boostrap Aggregating)

Bagging involves training the same model on multiple bootstrap samples (i.e., randomly sampled with replacement from the training data). This produces several individual models, each trained on variations of the training set. The final prediction is the aggregate of the individual model predictions - usually their average (for regression) and majority (for classification).

bagging

Boosting

Boosting is a sequential ensemble method where ‘weak’ models are trained one after another. Each new model focuses on the errors made by the previous models, placing greater weight on examples that were previous mispredicted. This ensures the ensemble incrementally improves its performance as more models are added. The final predictor is the weighted sum of all ‘weak’ models - not just the last one.

boosting

Bagging vs Boosting

Property	Bagging	Boosting
Training	Parallel (independent models)	Sequential (each depends on previous)
Focus	Reduce variance w/ independent, diverse models	Reduce bias by iteratively improving upon weaknesses
How models differ	Resampled data	Error focus
Final prediction	Average / vote	Weighted sum
Typical base learner	Strong (e.g. full trees)	Weak (e.g. shallow trees/stumps)
Algorithms	Random Forest	AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost

History of Boosting

AdaBoost

We begin the Boosting story with AdaBoost, where misclassified samples are assigned higher weights for the next model, so the next model pays more attention to them. This idea doesn’t generalise well to common scenarios nowadays; regression, arbitrary loss, or sparse data.

Gradient Boosting

Instead of simply reweighting samples to focus on mistakes, Gradient Boosting reframes AdaBoost as an optimisation problem. Rather than having each iterative model predict the target directly, each new tree predicts the residuals (mistakes) of the current ensemble’s predictions.

Example: House price prediction

As a starting point, suppose the model predicts the same value for every house, say $350k. The true prices might be:

y = [$200k, $300k, $400k, $500k]

Then the initial predictions are:

y_hat = [$350k, $350k, $350k, $350k]

and the initial residuals are:

r_i = y_i - y_hat_i = [$-150k, $-50k, $+50k, $+150k]

These residuals represent what the current model has failed to explain. Under squared error loss, they are also equal to the negative gradient of the loss with respect to the predictions.

For squared error, the loss for a single sample is:

L(y, y_hat) = 0.5 * (y - y_hat)^2

If you take the derivative of this with respect to y_hat, you get:

dL/dy_hat = (y_hat - y)

The negative gradient is therefore the residual:

-(dL/dy_hat) = y - y_hat = r_i

So when Gradient Boosting trains the next weak model (usually a shallow tree), it does **not train it on the original house prices y. Instead:

The features (e.g. bedrooms, size, suburb) stay the same.
The targets become the residuals r_i.

We’re no longer predicting the full price. We’re predicting corrections that tell us how far our existing model is from a perfect one.

After training this tree h_1(x), we update the model:

F_1(x) = F_0(x) + η * h_1(x)

where η (eta) is the learning rate. Because h_1(x) approximates the negative gradient (the direction of steepest descent), adding it reduces the loss.

Gradient Boosting repeats this process:

Compute residuals r_i = y_i - F_{m-1}(x_i) (for squared loss; more generally, use negative gradients).
Fit a shallow tree h_m(x) to (x_i, r_i).
Update the model: F_m(x) = F_{m-1}(x) + η * h_m(x).

Each tree is a small correction step in the direction that most reduces the loss. This is mathematically the same idea as gradient descent, except instead of updating weights directly (like in a neural network), Gradient Boosting updates the function itself by adding trees.

Analogy

Think of writing an essay and getting feedback:

You write a first draft (initial model).
The teacher marks mistakes and comments (these are like residuals).
You fix only those mistakes (fit a model to the residuals).
You submit again and get a new set of corrections.
You fix those and repeat.

You don’t rewrite the entire essay from scratch each time — you only correct what remains wrong. After enough rounds of focused corrections, the essay becomes strong.

Gradient Boosting works the same way: each tree is a small correction pass over the current model. Individually, the trees are weak, but together they form a powerful predictor.

eXtreme Gradient Boosting (XGBoost)

XGBoost is an industry-favourite for its speed, scalability, and flexibility. It extends on the classic Gradient Boosting idea of having each new tree predict the residuals (mistakes) of the current ensemble’s predictions with the following:

Reguluarisation
Optimisation
Flexibility

Here is an excellent article that describe XGBoost under-the-hood.

Here is another article that details practical applications of XGBoost and Optuna for hyperparameter tuning.