Factorization Machines — Pictures + Code (PyTorch) | by Daniel Lam

TLDR:
Problem: Given a dataset of users, movies, and ratings. Use factorization machines to give movie recommendations.
Dataset: ml-1m.zip from https://grouplens.org/datasets/movielens/
Code: “03_factorization_machines.ipynb” https://github.com/Datadote/matrix-factorization-pytorch
Refs: 1) Factorization machine post
2) Original factorization paper by Rendle
3) Medium post on matrix factorization

Steps:
1) Describe the problem and explore dataset
2) Preprocess dataset for training and validation
3) Create factorization machine
4) Train model
5) Check results
6) Save files for Streamlit deployment

Problem: Given a dataset of users, movies, and ratings. Use factorization machines to give movie recommendations.
Dataset: ml-1m.zip from https://grouplens.org/datasets/movielens/
Data features: movies, ratings, timestamps, titles, and genres. User metata features: age, gender, occupation, zip code. Movies without ratings are removed.

# Data:     https://files.grouplens.org/datasets/movielens/ml-1m.zip
# Metadata: https://files.grouplens.org/datasets/movielens/ml-1m-README.txt
DATA_DIR = './data/ml-1m/'
df_movies = pd.read_csv(DATA_DIR+'movies.dat', sep='::',
names=['movieId', 'title','genres'],
encoding='latin-1',
engine='python')
user_cols = ['userId', 'gender' ,'age', 'occupation', 'zipcode']
df_users = pd.read_csv(DATA_DIR+'users.dat', sep='::',
header=None,
names=user_cols,
engine='python')
df = pd.read_csv(DATA_DIR+'ratings.dat', sep='::',
names=['userId','movieId','rating','time'],
engine='python')
# Left merge removes movies with no rating. # of unique movies: 3883 -> 3706
df = df.merge(df_movies, on='movieId', how='left')
df = df.merge(df_users, on='userId', how='left')
df = df.sort_values(['userId', 'time'], ascending=[True, True]).reset_index(drop=True)

i) Convert columns into categorical with defaultdict(LabelEncoder)
This remaps column range to [0, len(unique(column))]. For example, the raw max movieId is 3952, but there are only 3706 unique movieIds. This remapping (3952 -> 3706) is important for reducing memory usage in the model’s embeddings. Note the “_index” columns added. These are the labels after labelEncoding.

# Convert users and movies into categorical - use LabelEncoder
# Column categorical range remapped between [0, len(df.column.unique())-1]
# Remapping is important to reduce memory size of nn.Embeddings
d = defaultdict(LabelEncoder)
cols_cat = ['userId', 'movieId', 'gender', 'age', 'occupation']
for c in cols_cat:
d[c].fit(df[c].unique())
df[c+'_index'] = d[c].transform(df[c])
print(f'# unique {c}: {len(d[c].classes_)}')min_num_ratings = df.groupby(['userId'])['userId'].transform(len).min()
print(f'Min # of ratings per user: {min_num_ratings}')
print(f'Min/Max rating: {df.rating.min()} / {df.rating.max()}')
print(f'df.shape: {df.shape}')

ii) Add feature offsets to allow factorization machine to use 1 embedding matrix
Original paper uses 1-hot encoding, here we use ordinal encoding to reduce memory size.

For factorization machines, one additional step is needed after label encoding: adding feature offsets. By adding feature offsets, we’re able to use only 1 embedding matrix versus using multiple embedding matrix + a for loop. This is important to increase training efficiency.

# To use 1 embedding matrix, need to calculate & add offsets to each feature column
# Orig. paper uses 1-hot encoding, here we use ordinal encoding
# Ordinal encoding reduces memory size. Important for train speed
feature_cols = ['userId_index', 'movieId_index', 'gender_index', 'age_index',
'occupation_index']
# Get offsets
feature_sizes = {}
for feat in feature_cols:
feature_sizes[feat] = len(df[feat].unique())
feature_offsets = {}
NEXT_OFFSET = 0
for k,v in feature_sizes.items():
feature_offsets[k] = NEXT_OFFSET
NEXT_OFFSET += v# Add offsets to each feature column
for col in feature_cols:
df[col] = df[col].apply(lambda x: x + feature_offsets[col])
print('Offset - feature')
for k, os in feature_offsets.items():
print(f'{os:<6} - {k}')

iii) Split data into train/validation. Create MovieDatasets + dataloaders
Each user has minimum 20 ratings. Use most recent 5 ratings for validation. Use remaining data for train. Features used are 1) userId_index 2) movieId_index 3) gender_index 4) age_index 5) occupation_index

# Make train and val dataset. Use last 5 rated movies per user
# 6040 unique users. Each user has minimum 20 rated movies
THRES = 5
cols = ['rating', *feature_cols]
df_train = df[cols].groupby('userId_index').head(-THRES).reset_index(drop=True)
df_val = df[cols].groupby('userId_index').tail(THRES).reset_index(drop=True)
print(f'df_train shape: {df_train.shape}')
print(f'df_val shape: {df_val.shape}')
df_train.head(3)

Check dataloader shapes and output.

class MovieDataset(Dataset):
""" Movie DS uses x_feats and y_feat """
def __init__(self, df, x_feats, y_feat):
super().__init__()
self.df = df
self.x_feats = df[x_feats].values
self.y_rating = df[y_feat].values
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
return self.x_feats[idx], self.y_rating[idx]BS = 1024
ds_train = MovieDataset(df_train, feature_cols, 'rating')
ds_val = MovieDataset(df_val, feature_cols, 'rating')
dl_train = DataLoader(ds_train, BS, shuffle=True, num_workers=2)
dl_val = DataLoader(ds_val, BS, shuffle=True, num_workers=2)
xb, yb = next(iter(dl_train))
print(xb.shape, yb.shape)
print(xb)
print(yb)

i) Why factorization machine?
Factorization machines are a general form of matrix factorization. Factorization machines also work for cold start problems by using metadata features. Matrix factorization only uses 2 features (user, movie), and takes a dot product to train a model. What if we had other features, such as age or gender? Factorization machines allow you to use any number of features to train a model. To model pairwise (feature-to-feature) interactions, we take a dot product of every feature with every other feature. Then we sum them all up.

In addition to the feature-to-feature dot products, the paper adds a global offset, and feature biases. We also include the offset and bias in our implementation. Below is the equation from the paper. “n” = number of features. “k” = feature dimension

Unoptimized equation. Time complexity: O(k*n²)

In the equation above, every feature is multiplied by every feature. If there are “n” features with “k” dimension, this leads to O(k * n²) time. Shown below, the paper derives a faster implementation with time complexity: O(k * n).

Optimized equation. Time complexity: O (k * n)

PyTorch factorization machine with optimized equation + offset + biases. Note, we use the nn.Embedding layer to handle the inputs (which are ordinally encoded). You can think of nn.Embedding layer as a combination of a one-hot vector + Linear/Dense layer. A sigmoid range function (idea taken from FastAi) is used to clamp the outputs for faster training.

class FM(nn.Module):
""" Factorization Machine + user/item bias, weight init., sigmoid_range 
Paper - https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
"""
def __init__(self, num_feats, emb_dim, init, bias, sigmoid):
super().__init__()
self.x_emb = nn.Embedding(num_feats, emb_dim)
self.bias = bias
self.sigmoid = sigmoid
if bias:
self.x_bias = nn.Parameter(torch.zeros(num_feats))
self.offset = nn.Parameter(torch.zeros(1))
if init:
self.x_emb.weight.data.uniform_(0., 0.05)def forward(self, X):
# Derived time complexity - O(nk)
x_emb = self.x_emb(X) # [bs, num_feats] -> [bs, num_feats, emb_dim]
pow_of_sum = x_emb.sum(dim=1).pow(2) # -> [bs, num_feats]
sum_of_pow = x_emb.pow(2).sum(dim=1) # -> [bs, num_feats]
fm_out = (pow_of_sum - sum_of_pow).sum(1)*0.5  # -> [bs]
if self.bias:
x_biases = self.x_bias[X].sum(1) # -> [bs]
fm_out +=  x_biases + self.offset # -> [bs]
if self.sigmoid:
return self.sigmoid_range(fm_out, low=0.5) # -> [bs]
return fm_out
def sigmoid_range(self, x, low=0, high=5.5):
""" Sigmoid function with range (low, high) """
return torch.sigmoid(x) * (high-low) + low

AdamW optimizer and mean-squared loss (MSE) are used to train the model. Hyperparameters are put into a config class (CFG) for ease of use.
*Important, to figure out the nn.Embedding size, we take the max feature number, and add 1 to it. The +1 accounts for 0-indexing

CFG = {
'lr': 0.001,
'num_epochs': 8,
'weight_decay': 0.01,
'sigmoid': True,
'bias': True,
'init': True,
}
n_feats = int(pd.concat([df_train, df_val]).max().max())
n_feats = n_feats + 1 # "+ 1" to account for 0 - indexing
mdl = FM(n_feats, emb_dim=100,
init=CFG['init'], bias=CFG['bias'], sigmoid=CFG['sigmoid'])
mdl.to(device)
opt = optim.AdamW(mdl.parameters(), lr=CFG['lr'], weight_decay=CFG['weight_decay'])
loss_fn = nn.MSELoss()
print(f'Model weights: {list(dict(mdl.named_parameters()).keys())}')

Train & validation script:

epoch_train_losses, epoch_val_losses = [], []for i in range(CFG['num_epochs']):
train_losses, val_losses = [], []
mdl.train()
for xb,yb in dl_train:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
loss = loss_fn(preds, yb)
train_losses.append(loss.item())
opt.zero_grad()
loss.backward()
opt.step()
mdl.eval()
for xb,yb in dl_val:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
loss = loss_fn(preds, yb)
val_losses.append(loss.item())
# Start logging
epoch_train_loss = np.mean(train_losses)
epoch_val_loss = np.mean(val_losses)
epoch_train_losses.append(epoch_train_loss)
epoch_val_losses.append(epoch_val_loss)
s = (f'Epoch: {i}, Train Loss: {epoch_train_loss:0.2f}, '
f'Val Loss: {epoch_val_loss:0.2f}'
)
print(s)

Let’s do some sanity checks. The model’s rating range is [0.65, 5.45], which is out of actual rating range [1, 5]. The prediction distribution looks good

# Check predictions distribution
lpreds, lratings = [], []
mdl.eval()
for xb,yb in dl_val:
xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
preds = mdl(xb)
lpreds.extend(preds.detach().cpu().numpy().tolist())
lratings.extend(yb.detach().cpu().numpy().tolist())print(f'Preds min/max: {min(lpreds):0.2f} / {max(lpreds):0.2f}')
print(f'Rating min/max: {min(lratings):0.2f} / {max(lratings):0.2f}')
plt.figure(figsize=(4,2))
plt.hist(lratings, label='ratings', bins=(np.arange(1,7)-0.5),
rwidth=0.25, color='blue')
plt.hist(lpreds, label='preds', bins=20, rwidth=0.5, color='red')
plt.title('Ratings & Predictions Distribution')
plt.grid()
plt.legend();

Use TSNE to see if the trained nn.Embeddings separated for movie genres. We see Children’s, Horror, and Documentary groupings were learned

# Check TSNE for genres - Make dataframe of movie + embeddings + biases
movies = df.drop_duplicates('movieId_index').reset_index(drop=True)
movies['movieId'] = d['movieId'].transform(movies.movieId)
# Get movie embeddings and biases
idxs_movies = torch.tensor(movies['movieId_index'].values, device=device)
movie_embs = mdl.x_emb.weight[idxs_movies]
movie_biases = mdl.x_bias[idxs_movies]
movies['emb'] = movie_embs.tolist()
movies['bias'] = movie_biases.tolist()# Check TSNE, and scatter plot movie embeddings
# Movie embeddings do get separated after training
genre_cols = ['Children's', 'Horror', 'Documentary']
GENRES = '|'.join(genre_cols)
print(f'Genres: {GENRES}')
movies_subset = movies[movies['genres'].str.contains(GENRES)].copy()
X = np.stack(movies_subset['emb'].values)
ldr = TSNE(n_components=2, init='pca', learning_rate='auto', random_state=42)
Y = ldr.fit_transform(X)
movies_subset['x'] = Y[:, 0]
movies_subset['y'] = Y[:, 1]
def single_genre(genres):
""" Filter movies for genre in genre_cols"""
for genre in genre_cols:
if genre in genres: return genre
movies_subset['genres'] = movies_subset['genres'].apply(single_genre)
plt.figure(figsize=(5, 5))
ax = sns.scatterplot(x='x', y='y', hue='genres', data=movies_subset)

Get movie recommendations for “Toy Story 2 (1999)”. Cosine similarity is used to find similar movies. Note we have to do some feature offset transformations to go from nn.Embeddings to the actual labelEncoder movie index.

# Helper function/dictionaries to convert form name to labelEncoder index/label
d_name2le = dict(zip(df.title, df.movieId))
d_le2name = {v:k for k,v in d_name2le.items()}def name2itemId(names):
"""Give movie name, returns labelEncoder label. This is before adding any offset"""
if not isinstance(names, list):
names = [names]
return d['movieId'].transform([d_name2le[name] for name in names])
# Input: movie name. Output: movie recommendations using cosine similarity
IDX = name2itemId('Toy Story 2 (1999)')[0] # IDX = 2898, before offset
IDX = IDX + feature_offsets['movieId_index'] # IDX = 8938, after offset to get input movie emb
emb_toy2 = mdl.x_emb(torch.tensor(IDX, device=device))
cosine_sim = torch.tensor(
[F.cosine_similarity(emb_toy2, emb, dim=0) for emb in movie_embs]
)
top8 = cosine_sim.argsort(descending=True)[:8]
movie_sims = cosine_sim[top8]
movie_recs = movies.iloc[top8.detach().numpy()]['title'].values
for rec, sim in zip(movie_recs, movie_sims):
print(f'{sim.tolist():0.3f} - {rec}')

Helper dictionaries showing the labelEncoder user meta feature encodings.

d_age_meta = {'Under 18': 1, '18-24': 18, '25-34': 25, '35-44': 35,
'45-49': 45, '50-55': 50, '56+': 56
}
d_gender = dict(zip(d['gender'].classes_, range(len(d['gender'].classes_))))
d_age = dict(zip(d['age'].classes_, range(len(d['age'].classes_))))
print(f'Gender mapping: {d_gender}')
print(f'Age mapping: {d_age}')

Get cold start movie recommendations for a male aged 18–24. To do this, create a metadata embedding vector by adding the gender and age embedding vectors. Then we dot-product this metadata vector with each movie, add the movie biases, and sort/rank the results.

# Get cold start movie recs for a male (GENDER=1), ages 18-24 (AGE=1)
GENDER = 1
AGE = 1
gender_emb = mdl.x_emb(
torch.tensor(GENDER+feature_offsets['gender_index'], device=device)
)
age_emb = mdl.x_emb(
torch.tensor(AGE+feature_offsets['age_index'], device=device)
)
metadata_emb = gender_emb + age_emb
rankings = movie_biases + (metadata_emb*movie_embs).sum(1) # dot product
rankings = rankings.detach().cpu()
for i, movie in enumerate(movies.iloc[rankings.argsort(descending=True)]['title'].values[:10]):
print(i, movie)

Save the model, and helper utilities for deployment.

# Save files for Streamlit app deployment - https://github.com/Datadote/movieyelp
SAVE = False
if SAVE:
movie_embs_cpu = movie_embs.cpu()
d_utils = {'label_encoder': d,
'feature_offsets': feature_offsets,
'movie_embs': movie_embs_cpu,
'movies': movies,
'd_name2le': d_name2le,
}
pd.to_pickle(d_utils, 'data/d_utils.pkl', protocol=4)
mdl_scripted = torch.jit.script(mdl)
mdl_scripted.save('mdls/fm_pt.pkl')

This factorization model is deployed as an Streamlit app at movieyelp.com . Deployment code located at https://github.com/Datadote/movieyelp