In [1]:
In [2]:
Content-Based(CB) Approach
The Content-Based Recommender relies on the similarity of the items being recommended. The basic idea is that if you like an item, then you will also like a “similar” item. It generally works well when it’s easy to determine the context/properties of each item.
Step 1. Movies to Vectors
One way to get a user profile: TF-IDF
A movie can be represented as a vector by utilizing TF-IDF
korean description of How to calculate TF-IDF with examples - wikidocs
Let’s look at a toy example as follows.
In [3]:
1
| ['I', 'do', 'know', 'like', 'love', 'should', 'want', 'what', 'you', 'your']
|
In [4]:
|
I |
do |
know |
like |
love |
should |
want |
what |
you |
your |
0 |
-0.288 |
0.000 |
0.405 |
0.000 |
0.405 |
0.000 |
0.405 |
0.000 |
0.000 |
0.405 |
1 |
-0.288 |
0.000 |
0.000 |
0.405 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
2 |
-0.288 |
0.405 |
0.000 |
0.000 |
0.000 |
0.405 |
0.000 |
0.405 |
0.000 |
0.000 |
Use TfidfVectorizer in sklearn library to calculate TF-IDF with [1, 2]-gram easily.
In [5]:
|
know |
know want |
like |
love |
want |
want love |
0 |
0.447214 |
0.447214 |
0.0 |
0.447214 |
0.447214 |
0.447214 |
1 |
0.000000 |
0.000000 |
1.0 |
0.000000 |
0.000000 |
0.000000 |
2 |
0.000000 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.000000 |
Let’s preprocess movielen dataset.
If you want to know about more details please see this document.
The description of movielen dataset as follows.
M=610 users, N=9742 movies.
100836 ratings, each movie has genres information.
In [6]:
In [7]:
|
action |
adventure |
animation |
children |
comedy |
crime |
documentary |
drama |
fantasy |
fi |
... |
imax |
listed |
musical |
mystery |
noir |
romance |
sci |
thriller |
war |
western |
0 |
0.000000 |
0.416846 |
0.516225 |
0.504845 |
0.267586 |
0.0 |
0.0 |
0.000000 |
0.482990 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.000000 |
0.512361 |
0.000000 |
0.620525 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.593662 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.570915 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.821009 |
0.0 |
0.0 |
0.0 |
0.0 |
3 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.505015 |
0.0 |
0.0 |
0.466405 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.726241 |
0.0 |
0.0 |
0.0 |
0.0 |
4 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
1.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9737 |
0.436010 |
0.000000 |
0.614603 |
0.000000 |
0.318581 |
0.0 |
0.0 |
0.000000 |
0.575034 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
9738 |
0.000000 |
0.000000 |
0.682937 |
0.000000 |
0.354002 |
0.0 |
0.0 |
0.000000 |
0.638968 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
9739 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.0 |
0.0 |
1.000000 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
9740 |
0.578606 |
0.000000 |
0.815607 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
9741 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
1.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
9742 rows × 23 columns
Step 2. Get Similarity Matrix
In [8]:
movieId |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
... |
193565 |
193567 |
193571 |
193573 |
193579 |
193581 |
193583 |
193585 |
193587 |
193609 |
0 |
1.000000 |
0.813578 |
0.152769 |
0.135135 |
0.267586 |
0.000000 |
0.152769 |
0.654698 |
0.000000 |
0.262413 |
... |
0.360397 |
0.465621 |
0.196578 |
0.516225 |
0.0 |
0.680258 |
0.755891 |
0.000000 |
0.421037 |
0.267586 |
1 |
0.813578 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.804715 |
0.000000 |
0.322542 |
... |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.0 |
0.341376 |
0.379331 |
0.000000 |
0.000000 |
0.000000 |
2 |
0.152769 |
0.000000 |
1.000000 |
0.884571 |
0.570915 |
0.000000 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
... |
0.162848 |
0.000000 |
0.419413 |
0.000000 |
0.0 |
0.181883 |
0.202105 |
0.000000 |
0.000000 |
0.570915 |
3 |
0.135135 |
0.000000 |
0.884571 |
1.000000 |
0.505015 |
0.000000 |
0.884571 |
0.000000 |
0.000000 |
0.000000 |
... |
0.144051 |
0.201391 |
0.687440 |
0.000000 |
0.0 |
0.160888 |
0.178776 |
0.466405 |
0.000000 |
0.505015 |
4 |
0.267586 |
0.000000 |
0.570915 |
0.505015 |
1.000000 |
0.000000 |
0.570915 |
0.000000 |
0.000000 |
0.000000 |
... |
0.285240 |
0.000000 |
0.734632 |
0.000000 |
0.0 |
0.318581 |
0.354002 |
0.000000 |
0.000000 |
1.000000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
9737 |
0.680258 |
0.341376 |
0.181883 |
0.160888 |
0.318581 |
0.239513 |
0.181883 |
0.000000 |
0.436010 |
0.241142 |
... |
0.599288 |
0.554355 |
0.234040 |
0.614603 |
0.0 |
1.000000 |
0.899942 |
0.000000 |
0.753553 |
0.318581 |
9738 |
0.755891 |
0.379331 |
0.202105 |
0.178776 |
0.354002 |
0.000000 |
0.202105 |
0.000000 |
0.000000 |
0.000000 |
... |
0.476784 |
0.615990 |
0.260061 |
0.682937 |
0.0 |
0.899942 |
1.000000 |
0.000000 |
0.557008 |
0.354002 |
9739 |
0.000000 |
0.000000 |
0.000000 |
0.466405 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
... |
0.000000 |
0.431794 |
0.678466 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
1.000000 |
0.000000 |
0.000000 |
9740 |
0.421037 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.317844 |
0.000000 |
0.000000 |
0.578606 |
0.320007 |
... |
0.674692 |
0.735655 |
0.000000 |
0.815607 |
0.0 |
0.753553 |
0.557008 |
0.000000 |
1.000000 |
0.000000 |
9741 |
0.267586 |
0.000000 |
0.570915 |
0.505015 |
1.000000 |
0.000000 |
0.570915 |
0.000000 |
0.000000 |
0.000000 |
... |
0.285240 |
0.000000 |
0.734632 |
0.000000 |
0.0 |
0.318581 |
0.354002 |
0.000000 |
0.000000 |
1.000000 |
9742 rows × 9742 columns
Step 3. Recommend: Find Top K Similar Movies
In [9]:
In [10]:
In [11]:
1
2
| Given title are patially matched! re-find correct title among results!
|
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
2355 |
3114 |
Toy Story 2 (1999) |
Adventure|Animation|Children|Comedy|Fantasy |
7355 |
78499 |
Toy Story 3 (2010) |
Adventure|Animation|Children|Comedy|Fantasy|IMAX |
In [12]:
1
2
| From substring 'Toy Story 2', recommend movies as follows!
|
|
title |
genres |
0 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1706 |
Antz (1998) |
Adventure|Animation|Children|Comedy|Fantasy |
2355 |
Toy Story 2 (1999) |
Adventure|Animation|Children|Comedy|Fantasy |
In [13]:
1
2
| From substring 'Matrix, The', recommend movies as follows!
|
|
title |
genres |
59 |
Lawnmower Man 2: Beyond Cyberspace (1996) |
Action|Sci-Fi|Thriller |
68 |
Screamers (1995) |
Action|Sci-Fi|Thriller |
144 |
Johnny Mnemonic (1995) |
Action|Sci-Fi|Thriller |
In [14]:
1
2
| From substring 'Saving Private Ryan', recommend movies as follows!
|
|
title |
genres |
97 |
Braveheart (1995) |
Action|Drama|War |
909 |
Apocalypse Now (1979) |
Action|Drama|War |
933 |
Boot, Das (Boat, The) (1981) |
Action|Drama|War |
In [15]:
1
2
| From substring 'Inception', recommend movies as follows!
|
|
title |
genres |
7372 |
Inception (2010) |
Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX |
6797 |
Watchmen (2009) |
Action|Drama|Mystery|Sci-Fi|Thriller|IMAX |
7625 |
Super 8 (2011) |
Mystery|Sci-Fi|Thriller|IMAX |
8358 |
RoboCop (2014) |
Action|Crime|Sci-Fi|IMAX |
167 |
Strange Days (1995) |
Action|Crime|Drama|Mystery|Sci-Fi|Thriller |
6151 |
V for Vendetta (2006) |
Action|Sci-Fi|Thriller|IMAX |
6521 |
Transformers (2007) |
Action|Sci-Fi|Thriller|IMAX |
7545 |
I Am Number Four (2011) |
Action|Sci-Fi|Thriller|IMAX |
7866 |
Battleship (2012) |
Action|Sci-Fi|Thriller|IMAX |
8151 |
Iron Man 3 (2013) |
Action|Sci-Fi|Thriller|IMAX |
We have to extract and preprocessing tag information to implement tag-based recommender engine.
Therefore, We will use MySQL to preprocessing records eaily.
If you want to know about how to deal with MySQL and pymysql follow this guide in my blog.
Step 4 - 1. Generate View in MySQL
PreRequsite: In mysql, please execute this commands to preprocessing tag information as follows.
create view tmp as
select movieId, group_concat(tag separator '|') as tag
from tags
group by movieId;
create view tmp2 as
select m.movieId, m.title, m.genres, t.tag
from tmp t, movies m
where t.movieId = m.movieId;
In the previous step, we generated view table tmp2
.
Therefore, we will convert the view table into pandas.Dataframe
as df
.
In [16]:
In [17]:
1
2
3
| /home/swyoo/anaconda3/envs/torch/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1260, 'Row 309 was cut by GROUP_CONCAT()')
result = self._query(query)
|
|
movieId |
title |
genres |
tag |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
pixar|pixar|fun |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
fantasy|magic board game|Robin Williams|game |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
moldy|old |
In [18]:
In [19]:
|
06 |
06 oscar |
06 oscar nominated |
06 oscar nominated best |
1900s |
1920s |
1920s gangsters |
1950s |
1950s adolescence |
1960s |
... |
zellweger |
zellweger retro |
zither |
zoe |
zoe kazan |
zombie |
zombies |
zombies zombies |
zooey |
zooey deschanel |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
3 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
4 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1567 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1568 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1569 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1570 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1571 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1572 rows × 8791 columns
In [20]:
movieId |
1 |
2 |
3 |
5 |
7 |
11 |
14 |
16 |
17 |
21 |
... |
176371 |
176419 |
179401 |
180031 |
180985 |
183611 |
184471 |
187593 |
187595 |
193565 |
0 |
1.0 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.000000 |
1 |
0.0 |
1.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.091603 |
0.0 |
0.0 |
0.000000 |
2 |
0.0 |
0.000000 |
1.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.000000 |
3 |
0.0 |
0.000000 |
0.0 |
1.000000 |
0.506712 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.000000 |
4 |
0.0 |
0.000000 |
0.0 |
0.506712 |
1.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.000000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1567 |
0.0 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.077864 |
0.0 |
0.0 |
1.000000 |
0.000000 |
0.0 |
0.0 |
0.039869 |
1568 |
0.0 |
0.091603 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
1.000000 |
0.0 |
0.0 |
0.000000 |
1569 |
0.0 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
1.0 |
0.0 |
0.000000 |
1570 |
0.0 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.000000 |
0.0 |
0.0 |
0.000000 |
0.000000 |
0.0 |
1.0 |
0.000000 |
1571 |
0.0 |
0.000000 |
0.0 |
0.000000 |
0.000000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.032774 |
0.0 |
0.0 |
0.039869 |
0.000000 |
0.0 |
0.0 |
1.000000 |
1572 rows × 1572 columns
In [21]:
In [22]:
1
2
3
| From substring 'Toy Story 2', recommend movies as follows!
utilize tags ...
|
|
title |
tag |
666 |
Toy Story 2 (1999) |
animation|Disney|funny|original|Pixar|sequel|T... |
544 |
Bug's Life, A (1998) |
Pixar |
0 |
Toy Story (1995) |
pixar|pixar|fun |
909 |
Road to Perdition (2002) |
cinematography|Tom Hanks |
1480 |
Invincible Iron Man, The (2007) |
animation |
142 |
Aladdin (1992) |
Disney |
147 |
Snow White and the Seven Dwarfs (1937) |
Disney |
148 |
Beauty and the Beast (1991) |
Disney |
149 |
Pinocchio (1940) |
Disney |
152 |
Aristocats, The (1970) |
Disney |
We can also use genres and tags at the same time
However, I only use genres or tag for simplify this problem.
Summary
Pros
- No need for data on other users, thus no cold-start or sparsity problems.
- Can recommend to users with unique tastes.
- Can recommend new & unpopular items.
- Can provide explanations for recommended items by listing content-features that caused an item to be recommended (in this case, movie genres)
Cons
- Finding the appropriate features is hard.
- Does not recommend items outside a user’s content profile.
- Unable to exploit quality judgments of other users.
Leave a comment