K Nearest Neighbors – Deep Dive (Part 2)

The first note in this series gave an introduction to the K- Nearest Neighbour algorithm. In the note, I went ahead to point out some of the concerns around KNN. I also did a primer on concepts like Feature Scaling and Euclidean distances. In this note, we will be implementing a simple movie recommender system using the concepts outlined in the previous note. 
Movie Recommenders form part of a broader range of problems captured by K-NN Search. The basic idea is to search for a list of items similar to a sample. 

Defining the Problem

Say we have a movie website. One in which we would like to recommend similar movies to the ones a user just watched. A great opportunity to apply K-NN Search. All we need is to come up with a vector representation of each movie. Our movie vector will comprise features that capture the essential properties of a movie. With this representation in place, we can apply Euclid’s formula to gauge the similarities between movies. The general idea being that movies closer to each other in the vector space are more similar. 

The first step will involve compiling our training data. For the purpose of this example, we have identified the following features we intend to capture in our movies:

  • ID
  • Year of Release
  • Comedy
  • Action
  • Drama
  • Romance
  • Horror
  • SciFi

The ID is a unique Identifier for the movie; it won’t form part of our computation, the reason being the ID is not an intrinsic property of the movie. The Year of Release captures the year the movie was released. The rest of the features are different movie genres. We hope to assign a value in a range (say 1-10) that specifies how much the movie fits in that genre. For example, If we have a movie like Titanic, we can expect it should score high in romance and drama. Commando should score high in Action. Soul Plane should score high on comedy. You get the idea.

I have concocted this training data for the sake of learning. Ideally, we should have more training samples.. like a lot more. Also, in a real-world problem, you’ll need to ensure the correctness of the data points for the movies without any kind of bias. 

Working through a solution

With values for the movie genres occurring on the same scale (0 – 10); and the Year of Release on a larger scale, we should scale our features. But we won’t, not yet. We will first attempt a KNN Search without any kind of feature scaling. 

First let’s define our distance function. Something like the one below should suffice: 

def distance(a, b):     
return math.sqrt((b['Year of Release'] - a['Year of Release']) ** 2 + 
(b['Comedy'] - a['Comedy']) ** 2 + (b['Action'] - a['Action']) ** 2 + 
(b['Drama'] - a['Drama']) ** 2 + (b['Romance'] - a['Romance']) ** 2 + 
(b['Horror'] - a['Horror']) ** 2 + (b['SciFi'] - a['SciFi']) ** 2))

The function above looks rather lengthy. We will update it later. But for the sake of our understanding… The function accepts two parameters where both parameters are numpy series. A series is a 1-dimensional array in which each element can be accessed by a label. Series are much more powerful than that, but let’s assume all we want to do is access values in the series which will serve as elements of our feature vectors. The result of the above function is the euclidean distance between two movie samples. 
We can greatly simplify the function above since we know that each series in our movie list looks something like this:

Name  Soul Plane
ID     1
Year of Release          2004
Comedy    9
Action   2
Drama                      5
Romance                      1
Horror         0
SciFi                       0

All we need is to extract all features asides from the Movie Name and ID. Since we will not consider the movie name as an element that influences our distance. 

def distance(a, b):
    return math.sqrt(np.sum((a[2:] - b[2:]) ** 2))

Now, this is much easier on the eyes. We make use of NumPy’s implicit support for vector math. Extracting all features in each series excluding the first two features. The array slicing operation (a[2:]) helps us accomplish this.  Next, we carry out a vectorized subtraction of the sliced a and b arrays. This will subtract the items in both arrays in an element-wise manner. The exponent operator (**) is also vectorized; squaring each element in the argument (a[2:] – b[2:]). We then sum the result and return the square root. Equivalent to what we did in the earlier function, but a lot terser. 

We still need some code that will enable us to extract the movie samples from the CSV into a Pandas DataFrame. Here’s what it looks like:

frame = pd.read_csv('movie-samples.csv')

Bear in mind we’re trying to implement a search of our movie database to select movies closest to our search query. So the next step is to read the movie we intend to search for. We’re reading the movie in order to have a reference to all the features that make up our search point. 

movie = frame[frame['ID'] == movieID].iloc[0]

The movieID will be supplied as a parameter, with the intention to search our data frame and select the rows matching the predicate; in this case id equals movieID. The result of the expression: frame[frame['ID'] == movieID] is itself a data frame. We use the DataFrame’s iloc indexer to access the first element which should be the only movie with id equal to movieID. 

After successfully extracting the movie vector, we need to compose our distance map. The distance map would be a mapping of each movie ID and its corresponding distance from our search movie. 

def generate_distance_map(mID, frame):
    movie = frame[frame['ID'] == mID].iloc[0]
    
    distance_map = {}
    for index, row in frame.iterrows():
        if movie['ID'] != row['ID']:
            distance_map[row['ID']] = distance(movie, row)
        
    return distance_map

First, we initialize an empty dictionary. Then iterate the rows of our data frame. If the row’s ID is not equal to our search movie ID, then insert into the distance map a key that equals the row ID and a value that captures the distance between that row and the search movie. The distance function is the one we devised earlier. The function outputs the euclidean distance between two movie vectors.  The loop will skip the row matching the movieID since we don’t want to recommend the search movie itself. We are looking for the closest matching movies. Besides, the euclidean distance between a vector and itself will always equal 0 – which is of little use to us in this context. 

Next is to sort the distance map in increasing order of distance. Our distances would be saved in the values portion of the distance map dictionary. To sort it, we apply the code below. 

sorted_distance_map = sorted(distance_map.items(), key=lambda item: item[1])
top_recommended = [item[0] for item in sorted_distance_map[:5]]

Each map item in distance_map.items() is a tuple containing the ID as the first element, and Euclidean distance from the sample as the second element. We’ve done more than just sort the distance map; we have also extracted the IDs of the first five elements in the sorted map. These are the Movie IDs closest to our search sample. Next, we want to extract the frame elements with those IDs. 

frame[np.isin(frame['ID'], top_recommended)]

And we’re done. Putting it all together, here’s what it looks like:

import math
import pandas as pd
import numpy as np


def distance(a, b):
    return math.sqrt(np.sum((a[2:] - b[2:]) ** 2))


def generate_distance_map(mID, frame):
    movie = frame[frame['ID'] == mID].iloc[0]
    
    distance_map = {}
    for index, row in frame.iterrows():
        if movie['ID'] != row['ID']:
            distance_map[row['ID']] = distance(movie, row)
        
    return distance_map


frame = pd.read_csv('movie-samples.csv')
movie_id = int(input('Enter Movie ID: '))

distance_map = generate_distance_map(movie_id, frame)

sorted_distance_map = sorted(distance_map.items(), key=lambda item: item[1])
top_recommended_ids = [item[0] for item in sorted_distance_map[:5]]


frame[np.isin(frame['ID'], top_recommended_ids)]


Running this code in a Jupyter notebook should output the closest samples to the inputted movie id. But, there’s a few concepts we must consider:

  1. Curse of Dimensionality
  2. Feature Scaling

Curse of Dimensionality. Sounds like a rather eerie concept. But it’s fairly easy to understand. In order to correctly evaluate the distance between two samples, it is of utmost importance we identify the essential features of a sample. Bear in mind, each feature introduces a new dimension to our problem space. A problem starts to present itself as we reach higher dimensions: the search space starts to grow exponentially. 

And so in order to correctly apply KNN, we need a dense dataset. One that spreads across the entire vector space. That way, when we search for the nearest neighbors, we will find samples with very similar features.  Then again, having a dense dataset with a high number of dimensions can present computation challenges. We will have to compute the Euclidean distance between a search sample and all existing samples in the search space. 
An approach to tackling the Curse of Dimensionality (COD) is a process called Dimensionality Reduction. Where we map high dimension data to a smaller number of dimensions while preserving properties of the data distribution. On another note, I’ll be diving into techniques for Dimensionality Reduction.

Feature Scaling. Something we failed to do in our movie recommender example is scale the features. While the different movie genres are scaled 1-10, the movie’s Year of Release (YOR) is on a much bigger scale. Meaning we have a scenario where the movie’s YOR starts to dominate our euclidean distance function. Imagine we had two romantic comedies A and B. A was released in 1900, while B was released in 2000. Going by our current implementation, A will be much further from B than a potential action scifi released in the same year as A. 
This problem can be solved by scaling our features to the same range. 
Applying feature scaling is straightforward. In the previous post, we introduced the formula for standardization. 


    \[z = \frac{x - u}{s}\]

where x is the feature value for an instance, u is the mean of all feature values for the sampled instances, and s is the standard deviation of all feature values for the sampled instances. 
We could evaluate the standard deviation and mean for our features and apply the formula to each item. Or we could employ the StandardScaler class in the ML library Scikit which allows us to standardize features in a dataset. The standardized features will now have a mean (u) of 0 and standard deviation (s) of 1. Here’s a code snippet that employs Scikit’s StandardScaler on our dataset.

import pandas as pd
from sklearn.preprocessing import StandardScaler 


raw_frame = pd.read_csv(‘movie-samples.csv')
frame = raw_frame[raw_frame.columns[1:]]
scaler = StandardScaler()
frame[frame.columns[1:]] = scaler.fit_transform(frame[frame.columns[1:]])

Line 5, we initialize a pandas frame from a CSV. The frame contains some redundant data, like the movie name. Since we don’t intend to employ the movie name in our modeling, we can slice it out of the raw dataset. Which is what line 6 tries to accomplish. We apply the slicing operator to select only columns indexed from 1 till the end, thus omitting the name column. 
Next, we initialize a standard scaler which is a component of the Scikit framework. We call the fit_transform method which accepts a subset of the frame we initialized in line 6. We are standardizing only a subset of this frame because we don’t intend to standardize the ID column. The ID is only for referencing the instances of our dataset and does not form part of the euclidean distancing input. 
We will have to adjust the distance function to exclude ID value during the evaluation. 

def distance(a, b):
    return math.sqrt(np.sum((a[1:] - b[1:]) ** 2))

Putting everything together, here’s what the code now looks like:

import math
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler 


def distance(a, b):
    return math.sqrt(np.sum((a[1:] - b[1:]) ** 2))


def generate_distance_map(mID, frame):
    movie = frame[frame['ID'] == mID].iloc[0]
    
    distance_map = {}
    for index, row in frame.iterrows():
        if movie['ID'] != row['ID']:
            distance_map[row['ID']] = distance(movie, row)
        
    return distance_map


raw_frame = pd.read_csv('movie-samples.csv')
frame = raw_frame[raw_frame.columns[1:]]

scaler = StandardScaler()
frame[frame.columns[1:]] = scaler.fit_transform(frame[frame.columns[1:]])


movie_id = int(input('Enter Movie ID: '))

distance_map = generate_distance_map(movie_id, frame)

sorted_distance_map = sorted(distance_map.items(), key=lambda item: item[1])
top_recommended_ids = [item[0] for item in sorted_distance_map[:5]]

raw_frame[np.isin(frame['ID'], top_recommended_ids)]

Attempting a search will give us better results than what we had earlier before we applied standardization. 

A few things to Note

While we were able to successfully apply the KNN algorithm to evaluate the distance between movie samples, I must point out that we utilized a very sparse dataset. Although this may suffice for our example implementation, in the real world, we will need a much more dense dataset when employing distance searches. 

Also, as we start to add more data points; our KNN algorithm becomes ever less efficient. This is because a single pass must compute the distance between the search item and every other item in the dataset. Which may be infeasible for very large datasets. Some algorithms try to solve this KNN shortfall, for example, Vantage Point Tree, Ball Tree. Hopefully, I’ll get to explore these methods in later posts. 

Wrapping Up

The last two posts on KNN, involved diving into a simple but interesting algorithm. KNN has a number of applications in data science, but it also has its limitations. I explored an application of KNN in movie recommendation, while also looking at some of the limitations of the algorithm. In my next post on KNN, i will apply it in a classification problem where we will search the nearest neighbours and have them vote for the class to be applied to our search data point. Till then.