In [1]:
1
2
3
4
5
6
import sys, os
import pandas as pd
from math import log
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
In [2]:
1
2
3
4
5
DFILE = "ml-latest-small"
CHARSET = 'utf8'
ratings = pd.read_csv(os.path.join(DFILE, 'ratings.csv'), encoding=CHARSET)
tags = pd.read_csv(os.path.join(DFILE, 'tags.csv'), encoding=CHARSET)
movies = pd.read_csv(os.path.join(DFILE, 'movies.csv'), encoding=CHARSET)

Content-Based(CB) Approach

The Content-Based Recommender relies on the similarity of the items being recommended. The basic idea is that if you like an item, then you will also like a “similar” item. It generally works well when it’s easy to determine the context/properties of each item.

Step 1. Movies to Vectors

One way to get a user profile: TF-IDF

A movie can be represented as a vector by utilizing TF-IDF

korean description of How to calculate TF-IDF with examples - wikidocs

Let’s look at a toy example as follows.

In [3]:
1
2
3
4
5
6
7
docs = [
    'you know I want your love',
    'I like you',
    'what should I do '] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()
vocab
1
['I', 'do', 'know', 'like', 'love', 'should', 'want', 'what', 'you', 'your']
In [4]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
N = len(docs) 

def tf(t, d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df + 1))

def tfidf(t, d):
    return tf(t,d)* idf(t)

result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]
        result[-1].append('{:.3f}'.format(tfidf(t,d)))

# print(result)
pd.DataFrame(result, columns = vocab)
I do know like love should want what you your
0 -0.288 0.000 0.405 0.000 0.405 0.000 0.405 0.000 0.000 0.405
1 -0.288 0.000 0.000 0.405 0.000 0.000 0.000 0.000 0.000 0.000
2 -0.288 0.405 0.000 0.000 0.000 0.405 0.000 0.405 0.000 0.000

Use TfidfVectorizer in sklearn library to calculate TF-IDF with [1, 2]-gram easily.

In [5]:
1
2
3
4
5
6
7
# tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0).fit(docs)
swords = ['I', 'you', 'what', 'do', 'should', 'your']
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words=swords)
# tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english').fit(docs)
tfidf_matrix = tf.fit_transform(docs)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names())
tfidf_df
know know want like love want want love
0 0.447214 0.447214 0.0 0.447214 0.447214 0.447214
1 0.000000 0.000000 1.0 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.0 0.000000 0.000000 0.000000

Let’s preprocess movielen dataset.
If you want to know about more details please see this document.

The description of movielen dataset as follows.
M=610 users, N=9742 movies.
100836 ratings, each movie has genres information.

In [6]:
1
2
3
tf = TfidfVectorizer(ngram_range=(1, 1), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape
1
(9742, 23)
In [7]:
1
2
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names())
tfidf_df
action adventure animation children comedy crime documentary drama fantasy fi ... imax listed musical mystery noir romance sci thriller war western
0 0.000000 0.416846 0.516225 0.504845 0.267586 0.0 0.0 0.000000 0.482990 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
1 0.000000 0.512361 0.000000 0.620525 0.000000 0.0 0.0 0.000000 0.593662 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
2 0.000000 0.000000 0.000000 0.000000 0.570915 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.821009 0.0 0.0 0.0 0.0
3 0.000000 0.000000 0.000000 0.000000 0.505015 0.0 0.0 0.466405 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.726241 0.0 0.0 0.0 0.0
4 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9737 0.436010 0.000000 0.614603 0.000000 0.318581 0.0 0.0 0.000000 0.575034 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
9738 0.000000 0.000000 0.682937 0.000000 0.354002 0.0 0.0 0.000000 0.638968 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
9739 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 1.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
9740 0.578606 0.000000 0.815607 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
9741 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0

9742 rows × 23 columns

Step 2. Get Similarity Matrix

In [8]:
1
2
similarity = cosine_similarity(tfidf_matrix, tfidf_matrix) # similarity matrix
pd.DataFrame(similarity, columns=movies.movieId)
movieId 1 2 3 4 5 6 7 8 9 10 ... 193565 193567 193571 193573 193579 193581 193583 193585 193587 193609
0 1.000000 0.813578 0.152769 0.135135 0.267586 0.000000 0.152769 0.654698 0.000000 0.262413 ... 0.360397 0.465621 0.196578 0.516225 0.0 0.680258 0.755891 0.000000 0.421037 0.267586
1 0.813578 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.804715 0.000000 0.322542 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.341376 0.379331 0.000000 0.000000 0.000000
2 0.152769 0.000000 1.000000 0.884571 0.570915 0.000000 1.000000 0.000000 0.000000 0.000000 ... 0.162848 0.000000 0.419413 0.000000 0.0 0.181883 0.202105 0.000000 0.000000 0.570915
3 0.135135 0.000000 0.884571 1.000000 0.505015 0.000000 0.884571 0.000000 0.000000 0.000000 ... 0.144051 0.201391 0.687440 0.000000 0.0 0.160888 0.178776 0.466405 0.000000 0.505015
4 0.267586 0.000000 0.570915 0.505015 1.000000 0.000000 0.570915 0.000000 0.000000 0.000000 ... 0.285240 0.000000 0.734632 0.000000 0.0 0.318581 0.354002 0.000000 0.000000 1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9737 0.680258 0.341376 0.181883 0.160888 0.318581 0.239513 0.181883 0.000000 0.436010 0.241142 ... 0.599288 0.554355 0.234040 0.614603 0.0 1.000000 0.899942 0.000000 0.753553 0.318581
9738 0.755891 0.379331 0.202105 0.178776 0.354002 0.000000 0.202105 0.000000 0.000000 0.000000 ... 0.476784 0.615990 0.260061 0.682937 0.0 0.899942 1.000000 0.000000 0.557008 0.354002
9739 0.000000 0.000000 0.000000 0.466405 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.431794 0.678466 0.000000 0.0 0.000000 0.000000 1.000000 0.000000 0.000000
9740 0.421037 0.000000 0.000000 0.000000 0.000000 0.317844 0.000000 0.000000 0.578606 0.320007 ... 0.674692 0.735655 0.000000 0.815607 0.0 0.753553 0.557008 0.000000 1.000000 0.000000
9741 0.267586 0.000000 0.570915 0.505015 1.000000 0.000000 0.570915 0.000000 0.000000 0.000000 ... 0.285240 0.000000 0.734632 0.000000 0.0 0.318581 0.354002 0.000000 0.000000 1.000000

9742 rows × 9742 columns

Step 3. Recommend: Find Top K Similar Movies

In [9]:
1
2
idx2t = movies.title.to_dict()
t2idx = {v: k for k, v in idx2t.items()}
In [10]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def content(title, K=10):
    """ Recommend Top K similar movies. """
    if title not in t2idx:
        result = movies[movies['title'].str.contains(title)] 
        if len(result) != 1:
            print("Given title are patially matched! re-find correct title among results!")
            return result
        else:
            print("From substring '{}', recommend movies as follows!".format(title))
            title = result.iloc[0].title
    
    scores = similarity[t2idx[title]]
    tuples = sorted(list(enumerate(scores)), key=lambda e: e[1], reverse=True)[:K]
    indices = [i for i, _ in tuples]
    return movies.iloc[indices][['title', 'genres']]
In [11]:
1
content('Toy Story', K=5)
1
2
Given title are patially matched! re-find correct title among results!

movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2355 3114 Toy Story 2 (1999) Adventure|Animation|Children|Comedy|Fantasy
7355 78499 Toy Story 3 (2010) Adventure|Animation|Children|Comedy|Fantasy|IMAX
In [12]:
1
content('Toy Story 2', K=3)
1
2
From substring 'Toy Story 2', recommend movies as follows!

title genres
0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1706 Antz (1998) Adventure|Animation|Children|Comedy|Fantasy
2355 Toy Story 2 (1999) Adventure|Animation|Children|Comedy|Fantasy
In [13]:
1
content('Matrix, The', K=3)
1
2
From substring 'Matrix, The', recommend movies as follows!

title genres
59 Lawnmower Man 2: Beyond Cyberspace (1996) Action|Sci-Fi|Thriller
68 Screamers (1995) Action|Sci-Fi|Thriller
144 Johnny Mnemonic (1995) Action|Sci-Fi|Thriller
In [14]:
1
content('Saving Private Ryan', K=3)
1
2
From substring 'Saving Private Ryan', recommend movies as follows!

title genres
97 Braveheart (1995) Action|Drama|War
909 Apocalypse Now (1979) Action|Drama|War
933 Boot, Das (Boat, The) (1981) Action|Drama|War
In [15]:
1
content('Inception')
1
2
From substring 'Inception', recommend movies as follows!

title genres
7372 Inception (2010) Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX
6797 Watchmen (2009) Action|Drama|Mystery|Sci-Fi|Thriller|IMAX
7625 Super 8 (2011) Mystery|Sci-Fi|Thriller|IMAX
8358 RoboCop (2014) Action|Crime|Sci-Fi|IMAX
167 Strange Days (1995) Action|Crime|Drama|Mystery|Sci-Fi|Thriller
6151 V for Vendetta (2006) Action|Sci-Fi|Thriller|IMAX
6521 Transformers (2007) Action|Sci-Fi|Thriller|IMAX
7545 I Am Number Four (2011) Action|Sci-Fi|Thriller|IMAX
7866 Battleship (2012) Action|Sci-Fi|Thriller|IMAX
8151 Iron Man 3 (2013) Action|Sci-Fi|Thriller|IMAX

Step 4. Furthermore; Utilize Tags

We have to extract and preprocessing tag information to implement tag-based recommender engine.

Therefore, We will use MySQL to preprocessing records eaily. If you want to know about how to deal with MySQL and pymysql follow this guide in my blog.

Step 4 - 1. Generate View in MySQL

PreRequsite: In mysql, please execute this commands to preprocessing tag information as follows.

create view tmp as 
    select movieId, group_concat(tag separator '|') as tag 
    from tags 
    group by movieId;

create view tmp2 as 
    select m.movieId, m.title, m.genres, t.tag 
    from tmp t, movies m 
    where t.movieId = m.movieId;

Step 4 - 2. Extract the View using Pymysql

In the previous step, we generated view table tmp2.
Therefore, we will convert the view table into pandas.Dataframe as df.

In [16]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import argparse, pymysql, sys, os
import pandas as pd
from sqlalchemy import create_engine

parser = argparse.ArgumentParser()
parser.add_argument('-user', help="mysql database user", type=str, required=False, default='swyoo')
parser.add_argument('-pw', help="password", type=str, required=False, default='****')
parser.add_argument('-host', help="ip address", type=str, required=False, default='***.***.***.***')
parser.add_argument('-db', help="database name", type=str, required=False, default='movielen')
parser.add_argument('-charset', help="character set to use", type=str, required=False, default='utf8mb4')
parser.add_argument('-verbose', help="table name", type=bool, required=False, default=False)
sys.argv = ['-f']
args = parser.parse_args()

con = pymysql.connect(host=args.host, user=args.user, password=args.pw, use_unicode=True, charset=args.charset)
cursor = con.cursor()

## helper function
sql = lambda command: pd.read_sql(command, con)
def fetch(command):
    cursor.execute(command)
    return cursor.fetchall()

fetch("use {}".format(args.db))
if args.verbose: print(sql("show tables"))
In [17]:
1
2
df = sql("select * from tmp2")
df[:3]
1
2
3
/home/swyoo/anaconda3/envs/torch/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1260, 'Row 309 was cut by GROUP_CONCAT()')
  result = self._query(query)

movieId title genres tag
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy pixar|pixar|fun
1 2 Jumanji (1995) Adventure|Children|Fantasy fantasy|magic board game|Robin Williams|game
2 3 Grumpier Old Men (1995) Comedy|Romance moldy|old

Step 4 - 3. Calculate TF-IDF Based on Tags

In [18]:
1
2
3
4
idx2midTag = df.movieId.to_dict()
mid2idxTag = {v: k for k, v in idx2midTag.items()}
idx2mid = movies.movieId.to_dict()
mid2idx = {v: k for k, v in idx2mid.items()}
In [19]:
1
2
3
4
tf = TfidfVectorizer(ngram_range=(1, 4), min_df=0, stop_words='english')
tag_matrix = tf.fit_transform(df['tag'])
tag_df = pd.DataFrame(tag_matrix.toarray(), columns=tf.get_feature_names())
tag_df
06 06 oscar 06 oscar nominated 06 oscar nominated best 1900s 1920s 1920s gangsters 1950s 1950s adolescence 1960s ... zellweger zellweger retro zither zoe zoe kazan zombie zombies zombies zombies zooey zooey deschanel
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1567 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1568 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1569 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1570 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1571 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1572 rows × 8791 columns

In [20]:
1
2
similarity_tag = cosine_similarity(tag_matrix, tag_matrix) # similarity matrix
pd.DataFrame(similarity_tag, columns=df.movieId)
movieId 1 2 3 5 7 11 14 16 17 21 ... 176371 176419 179401 180031 180985 183611 184471 187593 187595 193565
0 1.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000
1 0.0 1.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.091603 0.0 0.0 0.000000
2 0.0 0.000000 1.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000
3 0.0 0.000000 0.0 1.000000 0.506712 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000
4 0.0 0.000000 0.0 0.506712 1.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1567 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.077864 0.0 0.0 1.000000 0.000000 0.0 0.0 0.039869
1568 0.0 0.091603 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 1.000000 0.0 0.0 0.000000
1569 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 1.0 0.0 0.000000
1570 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 1.0 0.000000
1571 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.032774 0.0 0.0 0.039869 0.000000 0.0 0.0 1.000000

1572 rows × 1572 columns

Step 4 - 4. Recommend Top K movies based on Tags.

In [21]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def content(title, K=10, tag=False):
    """ Recommend Top K similar movies. """
    if title not in t2idx:
        result = movies[movies['title'].str.contains(title)] 
        if len(result) != 1:
            print("Given title are patially matched! re-find correct title among results!")
            return result
        else:
            print("From substring '{}', recommend movies as follows!".format(title))
            title = result.iloc[0].title
    idx_tfidf = t2idx[title]
    scores_tfidf = similarity[idx_tfidf]
    tuples_tfidf = sorted(list(enumerate(scores_tfidf)), key=lambda e: e[1], reverse=True)[:K]
    indices_tfidf = [i for i, _ in tuples_tfidf]
    if tag and idx2mid[idx_tfidf] in mid2idxTag:
        print("utilize tags ... ")
        idx_tag = mid2idxTag[idx2mid[idx_tfidf]]
        scores_tag = similarity_tag[idx_tag]
        tuples_tag = sorted(list(enumerate(scores_tag)), key=lambda e: e[1], reverse=True)[:K]
        indices_tag = [i for i, _ in tuples_tag]
        return df.iloc[indices_tag][['title', 'tag']]
    
    return movies.iloc[indices_tfidf][['title', 'genres']]
In [22]:
1
content("Toy Story 2", tag=True)
1
2
3
From substring 'Toy Story 2', recommend movies as follows!
utilize tags ... 

title tag
666 Toy Story 2 (1999) animation|Disney|funny|original|Pixar|sequel|T...
544 Bug's Life, A (1998) Pixar
0 Toy Story (1995) pixar|pixar|fun
909 Road to Perdition (2002) cinematography|Tom Hanks
1480 Invincible Iron Man, The (2007) animation
142 Aladdin (1992) Disney
147 Snow White and the Seven Dwarfs (1937) Disney
148 Beauty and the Beast (1991) Disney
149 Pinocchio (1940) Disney
152 Aristocats, The (1970) Disney

We can also use genres and tags at the same time
However, I only use genres or tag for simplify this problem.

Summary

Pros

  • No need for data on other users, thus no cold-start or sparsity problems.
  • Can recommend to users with unique tastes.
  • Can recommend new & unpopular items.
  • Can provide explanations for recommended items by listing content-features that caused an item to be recommended (in this case, movie genres)

Cons

  • Finding the appropriate features is hard.
  • Does not recommend items outside a user’s content profile.
  • Unable to exploit quality judgments of other users.

Leave a comment