In [1]:

import sys, os
import pandas as pd
from math import log
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:

DFILE = "ml-latest-small"
CHARSET = 'utf8'
ratings = pd.read_csv(os.path.join(DFILE, 'ratings.csv'), encoding=CHARSET)
tags = pd.read_csv(os.path.join(DFILE, 'tags.csv'), encoding=CHARSET)
movies = pd.read_csv(os.path.join(DFILE, 'movies.csv'), encoding=CHARSET)

Content-Based(CB) Approach

The Content-Based Recommender relies on the similarity of the items being recommended. The basic idea is that if you like an item, then you will also like a “similar” item. It generally works well when it’s easy to determine the context/properties of each item.

Step 1. Movies to Vectors

One way to get a user profile: TF-IDF

A movie can be represented as a vector by utilizing TF-IDF

korean description of How to calculate TF-IDF with examples - wikidocs

Let’s look at a toy example as follows.

In [3]:

docs = [
    'you know I want your love',
    'I like you',
    'what should I do '] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()
vocab

['I', 'do', 'know', 'like', 'love', 'should', 'want', 'what', 'you', 'your']

In [4]:

N = len(docs) 

def tf(t, d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df + 1))

def tfidf(t, d):
    return tf(t,d)* idf(t)

result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]
        result[-1].append('{:.3f}'.format(tfidf(t,d)))

# print(result)
pd.DataFrame(result, columns = vocab)

	I	do	know	like	love	should	want	what	your
0	-0.288	0.000	0.405	0.000	0.405	0.000	0.405	0.000	0.405
1	-0.288	0.000	0.000	0.405	0.000	0.000	0.000	0.000	0.000
2	-0.288	0.405	0.000	0.000	0.000	0.405	0.000	0.405	0.000

Use TfidfVectorizer in sklearn library to calculate TF-IDF with [1, 2]-gram easily.

In [5]:

# tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0).fit(docs)
swords = ['I', 'you', 'what', 'do', 'should', 'your']
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words=swords)
# tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english').fit(docs)
tfidf_matrix = tf.fit_transform(docs)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names())
tfidf_df

	know	know want	like	love	want	want love
0	0.447214	0.447214	0.0	0.447214	0.447214	0.447214
1	0.000000	0.000000	1.0	0.000000	0.000000	0.000000
2	0.000000	0.000000	0.0	0.000000	0.000000	0.000000

Let’s preprocess movielen dataset.
If you want to know about more details please see this document.

The description of movielen dataset as follows.
M=610 users, N=9742 movies.
100836 ratings, each movie has genres information.

In [6]:

tf = TfidfVectorizer(ngram_range=(1, 1), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape

(9742, 23)

In [7]:

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names())
tfidf_df

	action	adventure	animation	children	comedy	crime	documentary	drama	fantasy	fi	...	imax	listed	musical	mystery	noir	romance	sci	thriller	war	western
0	0.000000	0.416846	0.516225	0.504845	0.267586	0.0	0.0	0.000000	0.482990	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
1	0.000000	0.512361	0.000000	0.620525	0.000000	0.0	0.0	0.000000	0.593662	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
2	0.000000	0.000000	0.000000	0.000000	0.570915	0.0	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.821009	0.0	0.0	0.0	0.0
3	0.000000	0.000000	0.000000	0.000000	0.505015	0.0	0.0	0.466405	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.726241	0.0	0.0	0.0	0.0
4	0.000000	0.000000	0.000000	0.000000	1.000000	0.0	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9737	0.436010	0.000000	0.614603	0.000000	0.318581	0.0	0.0	0.000000	0.575034	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
9738	0.000000	0.000000	0.682937	0.000000	0.354002	0.0	0.0	0.000000	0.638968	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
9739	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0	1.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
9740	0.578606	0.000000	0.815607	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0
9741	0.000000	0.000000	0.000000	0.000000	1.000000	0.0	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.0

9742 rows × 23 columns

Step 2. Get Similarity Matrix

In [8]:

similarity = cosine_similarity(tfidf_matrix, tfidf_matrix) # similarity matrix
pd.DataFrame(similarity, columns=movies.movieId)

movieId	1	2	3	4	5	6	7	8	9	10	...	193565	193567	193571	193573	193579	193581	193583	193585	193587	193609
0	1.000000	0.813578	0.152769	0.135135	0.267586	0.000000	0.152769	0.654698	0.000000	0.262413	...	0.360397	0.465621	0.196578	0.516225	0.0	0.680258	0.755891	0.000000	0.421037	0.267586
1	0.813578	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.804715	0.000000	0.322542	...	0.000000	0.000000	0.000000	0.000000	0.0	0.341376	0.379331	0.000000	0.000000	0.000000
2	0.152769	0.000000	1.000000	0.884571	0.570915	0.000000	1.000000	0.000000	0.000000	0.000000	...	0.162848	0.000000	0.419413	0.000000	0.0	0.181883	0.202105	0.000000	0.000000	0.570915
3	0.135135	0.000000	0.884571	1.000000	0.505015	0.000000	0.884571	0.000000	0.000000	0.000000	...	0.144051	0.201391	0.687440	0.000000	0.0	0.160888	0.178776	0.466405	0.000000	0.505015
4	0.267586	0.000000	0.570915	0.505015	1.000000	0.000000	0.570915	0.000000	0.000000	0.000000	...	0.285240	0.000000	0.734632	0.000000	0.0	0.318581	0.354002	0.000000	0.000000	1.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9737	0.680258	0.341376	0.181883	0.160888	0.318581	0.239513	0.181883	0.000000	0.436010	0.241142	...	0.599288	0.554355	0.234040	0.614603	0.0	1.000000	0.899942	0.000000	0.753553	0.318581
9738	0.755891	0.379331	0.202105	0.178776	0.354002	0.000000	0.202105	0.000000	0.000000	0.000000	...	0.476784	0.615990	0.260061	0.682937	0.0	0.899942	1.000000	0.000000	0.557008	0.354002
9739	0.000000	0.000000	0.000000	0.466405	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.431794	0.678466	0.000000	0.0	0.000000	0.000000	1.000000	0.000000	0.000000
9740	0.421037	0.000000	0.000000	0.000000	0.000000	0.317844	0.000000	0.000000	0.578606	0.320007	...	0.674692	0.735655	0.000000	0.815607	0.0	0.753553	0.557008	0.000000	1.000000	0.000000
9741	0.267586	0.000000	0.570915	0.505015	1.000000	0.000000	0.570915	0.000000	0.000000	0.000000	...	0.285240	0.000000	0.734632	0.000000	0.0	0.318581	0.354002	0.000000	0.000000	1.000000

9742 rows × 9742 columns

In [9]:

idx2t = movies.title.to_dict()
t2idx = {v: k for k, v in idx2t.items()}

In [10]:

def content(title, K=10):
    """ Recommend Top K similar movies. """
    if title not in t2idx:
        result = movies[movies['title'].str.contains(title)] 
        if len(result) != 1:
            print("Given title are patially matched! re-find correct title among results!")
            return result
        else:
            print("From substring '{}', recommend movies as follows!".format(title))
            title = result.iloc[0].title
    
    scores = similarity[t2idx[title]]
    tuples = sorted(list(enumerate(scores)), key=lambda e: e[1], reverse=True)[:K]
    indices = [i for i, _ in tuples]
    return movies.iloc[indices][['title', 'genres']]

In [11]:

content('Toy Story', K=5)

Given title are patially matched! re-find correct title among results!

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2355	3114	Toy Story 2 (1999)	Adventure\|Animation\|Children\|Comedy\|Fantasy
7355	78499	Toy Story 3 (2010)	Adventure\|Animation\|Children\|Comedy\|Fantasy\|IMAX

In [12]:

content('Toy Story 2', K=3)

From substring 'Toy Story 2', recommend movies as follows!

	title	genres
0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1706	Antz (1998)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2355	Toy Story 2 (1999)	Adventure\|Animation\|Children\|Comedy\|Fantasy

In [13]:

content('Matrix, The', K=3)

From substring 'Matrix, The', recommend movies as follows!

	title	genres
59	Lawnmower Man 2: Beyond Cyberspace (1996)	Action\|Sci-Fi\|Thriller
68	Screamers (1995)	Action\|Sci-Fi\|Thriller
144	Johnny Mnemonic (1995)	Action\|Sci-Fi\|Thriller

In [14]:

content('Saving Private Ryan', K=3)

From substring 'Saving Private Ryan', recommend movies as follows!

	title	genres
97	Braveheart (1995)	Action\|Drama\|War
909	Apocalypse Now (1979)	Action\|Drama\|War
933	Boot, Das (Boat, The) (1981)	Action\|Drama\|War

In [15]:

content('Inception')

From substring 'Inception', recommend movies as follows!

	title	genres
7372	Inception (2010)	Action\|Crime\|Drama\|Mystery\|Sci-Fi\|Thriller\|IMAX
6797	Watchmen (2009)	Action\|Drama\|Mystery\|Sci-Fi\|Thriller\|IMAX
7625	Super 8 (2011)	Mystery\|Sci-Fi\|Thriller\|IMAX
8358	RoboCop (2014)	Action\|Crime\|Sci-Fi\|IMAX
167	Strange Days (1995)	Action\|Crime\|Drama\|Mystery\|Sci-Fi\|Thriller
6151	V for Vendetta (2006)	Action\|Sci-Fi\|Thriller\|IMAX
6521	Transformers (2007)	Action\|Sci-Fi\|Thriller\|IMAX
7545	I Am Number Four (2011)	Action\|Sci-Fi\|Thriller\|IMAX
7866	Battleship (2012)	Action\|Sci-Fi\|Thriller\|IMAX
8151	Iron Man 3 (2013)	Action\|Sci-Fi\|Thriller\|IMAX

Step 4. Furthermore; Utilize Tags

We have to extract and preprocessing tag information to implement tag-based recommender engine.

Therefore, We will use MySQL to preprocessing records eaily. If you want to know about how to deal with MySQL and pymysql follow this guide in my blog.

Step 4 - 1. Generate View in MySQL

PreRequsite: In mysql, please execute this commands to preprocessing tag information as follows.

create view tmp as 
    select movieId, group_concat(tag separator '|') as tag 
    from tags 
    group by movieId;

create view tmp2 as 
    select m.movieId, m.title, m.genres, t.tag 
    from tmp t, movies m 
    where t.movieId = m.movieId;

Step 4 - 2. Extract the View using Pymysql

In the previous step, we generated view table tmp2.
Therefore, we will convert the view table into pandas.Dataframe as df.

In [16]:

import argparse, pymysql, sys, os
import pandas as pd
from sqlalchemy import create_engine

parser = argparse.ArgumentParser()
parser.add_argument('-user', help="mysql database user", type=str, required=False, default='swyoo')
parser.add_argument('-pw', help="password", type=str, required=False, default='****')
parser.add_argument('-host', help="ip address", type=str, required=False, default='***.***.***.***')
parser.add_argument('-db', help="database name", type=str, required=False, default='movielen')
parser.add_argument('-charset', help="character set to use", type=str, required=False, default='utf8mb4')
parser.add_argument('-verbose', help="table name", type=bool, required=False, default=False)
sys.argv = ['-f']
args = parser.parse_args()

con = pymysql.connect(host=args.host, user=args.user, password=args.pw, use_unicode=True, charset=args.charset)
cursor = con.cursor()

## helper function
sql = lambda command: pd.read_sql(command, con)
def fetch(command):
    cursor.execute(command)
    return cursor.fetchall()

fetch("use {}".format(args.db))
if args.verbose: print(sql("show tables"))

In [17]:

df = sql("select * from tmp2")
df[:3]

/home/swyoo/anaconda3/envs/torch/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1260, 'Row 309 was cut by GROUP_CONCAT()')
  result = self._query(query)

	movieId	title	genres	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	pixar\|pixar\|fun
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	fantasy\|magic board game\|Robin Williams\|game
2	3	Grumpier Old Men (1995)	Comedy\|Romance	moldy\|old

Step 4 - 3. Calculate TF-IDF Based on Tags

In [18]:

idx2midTag = df.movieId.to_dict()
mid2idxTag = {v: k for k, v in idx2midTag.items()}
idx2mid = movies.movieId.to_dict()
mid2idx = {v: k for k, v in idx2mid.items()}

In [19]:

tf = TfidfVectorizer(ngram_range=(1, 4), min_df=0, stop_words='english')
tag_matrix = tf.fit_transform(df['tag'])
tag_df = pd.DataFrame(tag_matrix.toarray(), columns=tf.get_feature_names())
tag_df

	06	06 oscar	06 oscar nominated	06 oscar nominated best	1900s	1920s	1920s gangsters	1950s	1950s adolescence	1960s	...	zellweger	zellweger retro	zither	zoe	zoe kazan	zombie	zombies	zombies zombies	zooey	zooey deschanel
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1567	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1568	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1569	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1570	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1571	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

1572 rows × 8791 columns

In [20]:

similarity_tag = cosine_similarity(tag_matrix, tag_matrix) # similarity matrix
pd.DataFrame(similarity_tag, columns=df.movieId)

movieId	1	2	3	5	7	11	14	16	17	21	...	176371	176419	179401	180031	180985	183611	184471	187593	187595	193565
0	1.0	0.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000
1	0.0	1.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.091603	0.0	0.0	0.000000
2	0.0	0.000000	1.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000
3	0.0	0.000000	0.0	1.000000	0.506712	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000
4	0.0	0.000000	0.0	0.506712	1.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1567	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.077864	0.0	0.0	1.000000	0.000000	0.0	0.0	0.039869
1568	0.0	0.091603	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	1.000000	0.0	0.0	0.000000
1569	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	1.0	0.0	0.000000
1570	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.0	0.000000	0.000000	0.0	1.0	0.000000
1571	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.032774	0.0	0.0	0.039869	0.000000	0.0	0.0	1.000000

1572 rows × 1572 columns

In [21]:

def content(title, K=10, tag=False):
    """ Recommend Top K similar movies. """
    if title not in t2idx:
        result = movies[movies['title'].str.contains(title)] 
        if len(result) != 1:
            print("Given title are patially matched! re-find correct title among results!")
            return result
        else:
            print("From substring '{}', recommend movies as follows!".format(title))
            title = result.iloc[0].title
    idx_tfidf = t2idx[title]
    scores_tfidf = similarity[idx_tfidf]
    tuples_tfidf = sorted(list(enumerate(scores_tfidf)), key=lambda e: e[1], reverse=True)[:K]
    indices_tfidf = [i for i, _ in tuples_tfidf]
    if tag and idx2mid[idx_tfidf] in mid2idxTag:
        print("utilize tags ... ")
        idx_tag = mid2idxTag[idx2mid[idx_tfidf]]
        scores_tag = similarity_tag[idx_tag]
        tuples_tag = sorted(list(enumerate(scores_tag)), key=lambda e: e[1], reverse=True)[:K]
        indices_tag = [i for i, _ in tuples_tag]
        return df.iloc[indices_tag][['title', 'tag']]
    
    return movies.iloc[indices_tfidf][['title', 'genres']]

In [22]:

content("Toy Story 2", tag=True)

From substring 'Toy Story 2', recommend movies as follows!
utilize tags ... 

	title	tag
666	Toy Story 2 (1999)	animation\|Disney\|funny\|original\|Pixar\|sequel\|T...
544	Bug's Life, A (1998)	Pixar
0	Toy Story (1995)	pixar\|pixar\|fun
909	Road to Perdition (2002)	cinematography\|Tom Hanks
1480	Invincible Iron Man, The (2007)	animation
142	Aladdin (1992)	Disney
147	Snow White and the Seven Dwarfs (1937)	Disney
148	Beauty and the Beast (1991)	Disney
149	Pinocchio (1940)	Disney
152	Aristocats, The (1970)	Disney

We can also use genres and tags at the same time
However, I only use genres or tag for simplify this problem.

Summary

Pros

No need for data on other users, thus no cold-start or sparsity problems.
Can recommend to users with unique tastes.
Can recommend new & unpopular items.
Can provide explanations for recommended items by listing content-features that caused an item to be recommended (in this case, movie genres)

Cons

Finding the appropriate features is hard.
Does not recommend items outside a user’s content profile.
Unable to exploit quality judgments of other users.

Share on

Twitter Facebook LinkedIn

Content-Based(CB) Filtering

sw Yoo

Content-Based(CB) Approach

Step 1. Movies to Vectors

Step 2. Get Similarity Matrix

Step 4. Furthermore; Utilize Tags

Step 4 - 1. Generate View in MySQL

Step 4 - 2. Extract the View using Pymysql

Step 4 - 3. Calculate TF-IDF Based on Tags

Summary

Pros

Cons

Share on

Leave a comment

You may also enjoy

Pybullet Tutorial을 해보자.

Reeds-Shepp (RS) Path

Model Predictive Control (MPC)

RRT star

Content-Based(CB) Filtering

sw Yoo

Content-Based(CB) Approach

Step 1. Movies to Vectors

Step 2. Get Similarity Matrix

Step 3. Recommend: Find Top K Similar Movies

Step 4. Furthermore; Utilize Tags

Step 4 - 1. Generate View in MySQL

Step 4 - 2. Extract the View using Pymysql

Step 4 - 3. Calculate TF-IDF Based on Tags

Step 4 - 4. Recommend Top K movies based on Tags.

Summary

Pros

Cons

Share on

Leave a comment

You may also enjoy

Pybullet Tutorial을 해보자.

Reeds-Shepp (RS) Path

Model Predictive Control (MPC)

RRT star