I am an early career data scientist and a former college athlete. Living in Arizona and needing a competitive fix in the absence of team sports I have predictably taken up golfing. With this new interest and my continuously developing data science skill set, I decided golf would be the perfect subject for my next project.

As I started my golf journey a common refrain I would hear on the course was “Drive for show; putt for dough”. The idea is that hitting the ball high and far off the tee is impressive, but being able to sink your putts is where the money is made.

I wanted to dig a little deeper into this old-timer wisdom and see if there was any truth to it.

I pulled PGA Tour data for the 2022 season from ESPN. Immediately this introduces bias. If we’re investigating the truth behind “Drive for show; putt for dough” for all golfers, then sampling the top 200 PGA earners is not a very representative sample. It is important to note that distinction for the rest of this investigation. Our findings based on this data should be generalized to top tour professionals.

## The Plan

My plan for this investigation is relatively straightforward. First, create a machine-learning model that projects earnings based on player statistics. Then, calculate feature importance to see which statistics are most important in accurately predicting a golfer’s earnings.

The goal of this investigation is to answer the question of whether driving or putting is more important for making money. The above process should accomplish this goal.

I would like to stress the importance of discipline in the interpretation of different model outputs. Too often we can get excited by the outputs of fancy tools and rush to conclusions before critically thinking about what the output means.

It is easy to lie with statistics. Sometimes people do this knowingly, by selecting methodologies that generate outputs that support their existing position. Other times people do this unknowingly when they don’t fully understand the mechanics behind their methodologies and make a mistake in their interpretation.

## Execution

First I defined my target variable based on the question I was looking to answer. This is an important step that has the potential to be overlooked. I question which skill makes the most money. Not which skill wins the most or generates the most birdies. Even though those things are correlated, I want to be very specific about ensuring my target aligns with the question I’m looking to answer. So my target variable was the average earnings per event the player participated in.

With my target defined, I started with a quick and brief EDA (exploratory data analysis). To start I took a look at the summary statistics of the target variable (earnings per event).

`df['EarnPerEvent'].describe()`

`count 200.000000`

mean 87799.921100

std 97377.659397

min 8431.480000

25% 29670.155000

50% 52589.805000

75% 102830.462500

max 561876.400000

I observe a right-skewed distribution. The arithmetic mean exceeds the median indicating skewness. Also, the magnitude of the maximum hints at the existence of outliers.

`plt.figure(figsize = (8,6))`

plt.boxplot(df['EarnPerEvent'])

plt.xlabel('PGA Top 200 Earners of 2022')

plt.ylabel('Earnings Per Event')

plt.title('Earnings Per Event Box and Whisker Plot')

plt.show()

The box and whisker plot confirms the existence of outliers. You can see Scottie Scheffler as the circle at the top averaging over half a million dollars per event he played in.

What should be done about the outliers? I’m leaving them in. The high earnings are reflective of the prize money structure. Prizes are tiered in a nonlinear fashion. Since the outliers reflect the intentional design choice of the game applied fairly to all the participants it makes sense to leave them in.

After investigating the target variable, I created a correlation matrix with the feature inputs. I found a strong correlation among all the statistical categories.

This makes sense. Professional golfers excel at putting, driving, hitting greens in regulation, and any other statistical category that measures competency. We wouldn’t expect to see a random distribution of these stats that we use as feature inputs for our model.

The strong correlation among the independent variables would create multicollinearity concerns in a linear model. To avoid that, I instead opted to use a tree-based model.

Before building the model I need to create the training and testing data sets. I included one extra consideration when splitting. Due to the presence of outliers and the small size of my data set (200 records), I felt it prudent to use a quartile cut. This means I grouped my data into four groups based on the target (earnings per event) and made sure each of these groups was proportionally represented on each side of the split.

`#Initiate a qcut using q=4 to divide the data into quartiles based on the dependant variable for a stratified train test split`

df_reduced['qcut'] = list(pd.qcut(df_reduced['EarnPerEvent'], 4, labels = False))training_data, testing_data = train_test_split(df_reduced, test_size = .2, random_state = 42, stratify = df_reduced['qcut'])

X_train = training_data[features]

y_train = training_data[target]

X_test = testing_data[features]

y_test = testing_data[target]

Now I’m ready to build the model. I start with the most basic XBGRegressor:

`xg_regressor =xg.XGBRegressor()`

xg_model = xg_regressor.fit(X_train,y_train)y_pred = xg_model.predict(X_test)

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)

print("Mean squared error on test set is: ", mse)

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

print("R squared score on test set is: ", r2)

`Mean squared error on test set is: 3222627145.0248227`

R squared score on test set is: 0.6887608434100394

I look at r-squared and mean-squared error as my model performance evaluation metrics. With the default settings, the model performs nicely. 69% of the variance in the target can be explained by the variance in the feature set.

We can do better. Next, I use grid search to pick better hyperparameters for the model and then I run the tuned model.

`from sklearn.model_selection import GridSearchCV`

params = {'eta': [0.2],

'max_depth': [10,20,100],

'reg_lambda': [1.3,1.4,1.5,1.6,1.7],

'n_estimators': [100,200,500],

}

reg = GridSearchCV(estimator = xg_regressor,

param_grid = params,

scoring = 'r2',

verbose = 0)

reg.fit(X_train,y_train)

`#Manually input best params into the regressor`

xg_regressor_tuned =xg.XGBRegressor(n_estimators = 500, max_depth = 20, eta = 0.2, reg_lambda = 1.4)

xg_model_tuned = xg_regressor_tuned.fit(X_train,y_train)y_pred_tuned = xg_model_tuned.predict(X_test)

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred_tuned)

print("Mean squared error on test set is: ", mse)

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred_tuned)

print("R squared score on test set is: ", r2)

`Mean squared error on test set is: 2321677595.292609`

R squared score on test set is: 0.7757739433963858

Great! Our performance metrics have improved and although not shown here the performance on the test set fairly closely matches the performance on the training set. A large discrepancy between the two signals over or underfitting which we don’t want.

Now that we can predict how much a golfer would make based on their stats, let’s see which of the stats are most important in the generation of that prediction.

For my feature importance rankings, I’m going to be looking at SHAP values. To learn more about SHAP values I highly recommend Christoph Molnar’s book on interpretable machine learning.

`Molnar, C. (2022). Interpretable Machine Learning:`

A Guide for Making Black Box Models Explainable (2nd ed.).

christophm.github.io/interpretable-ml-book/

SHAP values are based on cooperative game theory. The general idea behind a SHAP value is to assign each feature a value indicating how much it contributed to the prediction versus a baseline reference value. SHAP values take into account all possible feature combinations including the presence and absence of a feature.

This value will be enough to look at relative importance among features. We can use it to see whether putting stats or driving stats are more important to the output of the model. We cannot attempt to interpret this value like a regression coefficient and say that an X decrease in number of putts leads to an X increase in earnings per event.

Here are the SHAP values for our model:

`import shap`

explainer = shap.Explainer(xg_model_tuned.predict,X_test)

shap_values = explainer(X_test)

shap.plots.bar(shap_values, max_display = 10)

A quick feature dictionary as we begin to discuss results:

- TOP10: The number of top 10 finishes
- WINS: Number of wins
- PUTTS: Average putts per hole
- DACC: Driving accuracy percentage (fairways hit)
- BIRDS: Birdies per round
- DDIS: Average driving distance
- CUTS: Number of cuts made
- SAND: Sand save percentage
- GIR: Greens in regulation

From the plot above we can see that PUTTS is a relatively more important feature than both DACC and DDIS when predicting earnings per event. This supports the saying “Drive for show; putt for dough”.

For the pros: “Drive for show; putt for dough” can be supported by the data.

There is room for discussion about whether the putting statistics fairly represent a golfer’s ability to putt. For example, golfers who struggle to hit greens in regulation may then chip close and putt less. These golfers would have lower putting stats by virtue of their poor approaches and not skill with the putter.

Another discussion could take us back to our sampling note at the beginning of the article. This data set is made up of only professionals. Sure putting separates the pros from each other, but maybe it’s the driving ability that gets them on tour in the first place.

These are great points. I firmly believe that findings from machine learning projects like this need to be appropriately interpreted and viewed in the proper context.

Thank you for taking the time to read this article! I start these projects to blend my passion for sports with my skillset around data. It satisfies my personal curiosity. More than that, I enjoy writing these articles and sharing my findings to have discussions and engage with a community of curious people. Please feel free to comment on this post with your thoughts or questions about the project.

I’ll also plug my LinkedIn. If you’d like to connect or have more conversation please send an invite!