In [1]:
1
2
3
4
5
6
7
8
9
10
11
12
13
import os, sys, random
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sqrt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

sys.path.append("/home/swyoo/algorithm/")
from utils.verbose import logging_time
from ipypb import track, chain

np.set_printoptions(precision=3)
In [2]:
1
2
3
4
5
DFILE = "ml-latest-small"
CHARSET = 'utf8'
ratings = pd.read_csv(os.path.join(DFILE, 'ratings.csv'), encoding=CHARSET)
# tags = pd.read_csv(os.path.join(DFILE, 'tags.csv'), encoding=CHARSET)
movies = pd.read_csv(os.path.join(DFILE, 'movies.csv'), encoding=CHARSET)

Memory Based Collabortive Filtering

  1. User-User Collaborative Filtering: Here we find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.
  2. Item-Item Collaborative Filtering: It is quite similar to previous algorithm, but instead of finding user’s look-alike, we try finding movie’s look-alike. Once we have movie’s look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users.

Step 1. Preprocessing

Now I use the scikit-learn library to split the dataset into testing and training. Cross_validation.train_test_split shuffles and splits the data into two datasets according to the percentage of test examples, which in this case is 0.2.

In [3]:
1
2
3
4
5
6
7
8
df_train, df_test = train_test_split(ratings, test_size=0.2, random_state=0, shuffle=True)
df_train.shape, df_test.shape

R = ratings.pivot(index='userId', columns='movieId', values='rating')
M, N = R.shape
print("num_users: {}, num_movies: {}".format(M, N))
print("density rate: {:.2f}%".format((1 - (R.isna().sum(axis=0).sum() / (M * N))) * 100))
R
1
2
3
num_users: 610, num_movies: 9724
density rate: 1.70%

movieId 1 2 3 4 5 6 7 8 9 10 ... 193565 193567 193571 193573 193579 193581 193583 193585 193587 193609
userId
1 4.0 NaN 4.0 NaN NaN 4.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
606 2.5 NaN NaN NaN NaN NaN 2.5 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
607 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
608 2.5 2.0 2.0 NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
609 3.0 NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
610 5.0 NaN NaN NaN NaN 5.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

610 rows × 9724 columns

Data sparseness can be visualized as follows.

In [4]:
1
2
3
4
5
6
plt.imshow(R)
plt.grid(False)
plt.xlabel("item")
plt.ylabel("user")
plt.title("train Matrix")
plt.show()

png

Step 2. Calculate Similarity

At first, calculate similiarity scores for a toy example.

In [5]:
1
2
3
4
5
6
7
8
9
# matrix = [[5,4,None,4,2,None,4,1],[None,5,5,4,2,1,2,None],[1,None,1,5,None,5,3,4]]
# users = ['A','B','C']
# items = ['a','b','c','d','e','f','g','h']

matrix = [[4, None, 5, 5], [4, 2, 1, None], [3, None, 2, 4], [4, 4, None, None], [2, 1, 3, 5]]
users = ['u1', 'u2', 'u3', 'u4', 'u5']
items = ['i1', 'i2', 'i3', 'i4']
df_table = pd.DataFrame(matrix, index=users, columns=items, dtype=float)
df_table
i1 i2 i3 i4
u1 4.0 NaN 5.0 5.0
u2 4.0 2.0 1.0 NaN
u3 3.0 NaN 2.0 4.0
u4 4.0 4.0 NaN NaN
u5 2.0 1.0 3.0 5.0

Pearson Similarity

Pearson Similarity can be computed as follows.

Let the number of users be $M$ and the number of items be $N$.

The dimension of pearson correlation similarity matrices for users and items are $ M \times M $, $ N \times N $.

User-Based(UB)

$u, v$ are users
$I$: item set being co-rated by both user $u$ and user $v$

\[sim(u, v) = \frac{\sum_{i \in I} (r_{u, i} - \bar{r_u}) (r_{v, i} - \bar{r_v})}{ \sqrt{\sum_{i \in I} (r_{u, i} - \bar{r_u})^2} \sqrt{\sum_{i \in I} (r_{v, i} - \bar{r_v})^2}}\]

Item-Based(IB)

Note that item-based similarity is modified.
Adjusted cosine similarity is used.
Please see Item-Based Collaborative Filtering Recommendation Algorithms by Sarwar

Basic cosin similarity has one import drawback that difference in rating scale between different users are not takend into account. Therefore, the adjusted cosine similarity offset this drawback by sub-tracking the corresponding user average from each co-rated pair. The formula of adjusted cosine similarity as follows.
$U$: users set that rated both item $i$ and item $j$ (co-rated pair)

\[sim(i, j) = \frac{\sum_{u \in U} (r_{u,i} - \bar{r_u}) (r_{u,j} - \bar{r_u})}{ \sqrt{\sum_{u \in U} (r_{u,i} - \bar{r_u})^2} \sqrt{\sum_{u \in U} (r_{u,j} - \bar{r_u})^2}}\]
if neighbor does not exist, substiture the similarity value as 0.
In [6]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ratings = df_table.to_numpy()
M, N = ratings.shape

def ratedset(ratings, i, kind):
    """ if kind is True, return indices of item set else user set. """
    rates = ratings[i] if kind else ratings[:, i]
    where = np.argwhere(~np.isnan(rates)).flatten()
    return set(where)

def neighbors(ratings, i, j, kind):
    """ return neighbors list. """
    return list(ratedset(ratings, i, kind).intersection(ratedset(ratings, j, kind)))
# neighbors(0, 4, ub=True)
# neighbors(0, 2, ub=False)
In [7]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@logging_time
def pearson(ratings, ub, adjusted=False):
    M, N = ratings.shape
    epsilon = 1e-20
    if ub:
        sim = np.zeros(shape=(M, M), dtype=float)
        for u in track(range(M)):
            for v in range(u, M):
                # find sim(u, v)
                nei = neighbors(ratings, u, v, kind=True) # indices of common items
                if not nei:
                    sim[u][v] = sim[v][u] = np.nan
                    continue
                ru = ratings[u][nei] - np.mean(ratings[u][nei])
                rv = ratings[v][nei] - np.mean(ratings[v][nei])
                up = ru.dot(rv)
                down = sqrt(np.sum(ru**2)) * sqrt(np.sum(rv**2))
                sim[u][v] = sim[v][u] = up / down
        return sim
    else: 
        sim = np.zeros(shape=(N, N), dtype=float)
        if adjusted:
            umeans = np.nanmean(ratings, axis=1)
        for i in track(range(N)):
            for j in range(i, N):
                # find sim(i, j)
                nei = neighbors(ratings, i, j, kind=False) # indices of common users
                if not nei:
                    sim[i][j] = sim[j][i] = np.nan
                    continue
                if adjusted:
                    ri = ratings[nei, i] - umeans[nei]
                    rj = ratings[nei, j] - umeans[nei]
                else:
                    ri = ratings[nei, i] - np.mean(ratings[nei, i])
                    rj = ratings[nei, j] - np.mean(ratings[nei, j])
                up = ri.dot(rj)
                down = sqrt(np.sum(ri**2)) * sqrt(np.sum(rj**2))
                sim[i][j] = sim[j][i] = up / down
        return sim
In [8]:
1
df_table
i1 i2 i3 i4
u1 4.0 NaN 5.0 5.0
u2 4.0 2.0 1.0 NaN
u3 3.0 NaN 2.0 4.0
u4 4.0 4.0 NaN NaN
u5 2.0 1.0 3.0 5.0
In [9]:
1
2
usim_toy = pearson(ratings, ub=True, verbose=True)
pd.DataFrame(usim_toy, index=users, columns=users)
</progress> 100% 5/5 [00:00<00:00, 0.00s/it]
1
2
/home/swyoo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:18: RuntimeWarning: invalid value encountered in double_scalars

1
2
WorkingTime[pearson]: 5.10669 ms

u1 u2 u3 u4 u5
u1 1.000000 -1.000000 0.000000 NaN 0.755929
u2 -1.000000 1.000000 1.000000 NaN -0.327327
u3 0.000000 1.000000 1.000000 NaN 0.654654
u4 NaN NaN NaN NaN NaN
u5 0.755929 -0.327327 0.654654 NaN 1.000000
In [10]:
1
2
# pandas library provides calcultation of pearson correlation.
df_table.T.corr(method='pearson', min_periods=1)
u1 u2 u3 u4 u5
u1 1.000000 -1.000000 0.000000 NaN 0.755929
u2 -1.000000 1.000000 1.000000 NaN -0.327327
u3 0.000000 1.000000 1.000000 NaN 0.654654
u4 NaN NaN NaN NaN NaN
u5 0.755929 -0.327327 0.654654 NaN 1.000000
In [11]:
1
2
3
4
# This calucated values are adjusted cosine similarity
# so, it is different with pearson correlation values.
isim_toy = pearson(ratings, ub=False, adjusted=True, verbose=True)
pd.DataFrame(isim_toy, index=items, columns=items)
</progress> 100% 4/4 [00:00<00:00, 0.00s/it]
1
2
WorkingTime[pearson]: 3.65472 ms

i1 i2 i3 i4
i1 1.000000 0.232485 -0.787493 -0.765945
i2 0.232485 1.000000 0.002874 -1.000000
i3 -0.787493 0.002874 1.000000 -0.121256
i4 -0.765945 -1.000000 -0.121256 1.000000
In [12]:
1
2
isim_toy_adjust = pearson(ratings, ub=False, adjusted=False, verbose=True)
pd.DataFrame(isim_toy_adjust, index=items, columns=items)
</progress> 100% 4/4 [00:00<00:00, 0.00s/it]
1
2
/home/swyoo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:39: RuntimeWarning: invalid value encountered in double_scalars

1
2
WorkingTime[pearson]: 3.94559 ms

i1 i2 i3 i4
i1 1.000000 0.755929 0.050965 0.000000
i2 0.755929 1.000000 -1.000000 NaN
i3 0.050965 -1.000000 1.000000 0.755929
i4 0.000000 NaN 0.755929 1.000000
In [13]:
1
df_table.corr(method='pearson', min_periods=1)
i1 i2 i3 i4
i1 1.000000 0.755929 0.050965 0.000000
i2 0.755929 1.000000 -1.000000 NaN
i3 0.050965 -1.000000 1.000000 0.755929
i4 0.000000 NaN 0.755929 1.000000

Step 3. Predict unrated scores

User-Based

\(\hat{r}_{a, i} = \bar{r_a} + \frac{ \sum_{b \in nei(i)} sim(a, b) * (r_{b, i} - \bar{r_b})}{\sum_{b \in nei(i)} |sim(a, b)|}\)

where

  • $sim(a,b), r_{b, i} - \bar{r_b}$ are scalar.
  • $sim(a,b)$ means a pearson similarity score between user a and user b.
  • $r_{b, i} - \bar{r_b}$ means a row of normalized rating corresponding to user b.
  • $nei(i)$ means users who have rated the item $i$; $\vert nei(i) \vert \le M$

Item-Based

\(\hat{r}_{a, i} = \bar{r_i} + \frac{ \sum_{j \in nei(a)} sim(i, j) * (r_{a, j} - \bar{r_j})}{\sum_{j \in nei(a)} |sim(i, j)|}\) where

  • $nei(a)$ means items who “user a” have rated; $\vert nei(a) \vert \le N$
if neighbor does not exist, substiture the value as 0.
In [14]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@logging_time
def predict(ratings, sim, entries, ub):
    epsilon = 1e-20
    sim = np.nan_to_num(sim)
    
    def pred(a, i):
        if ub:
            nei = list(ratedset(ratings, i, kind=False))  # users set who have rated "movie i".
            if not nei: return np.nanmean(ratings[a])
            ru = np.ones(shape=len(nei))
            for k, u in enumerate(nei):
                overlap = neighbors(ratings, a, u, kind=True)
                ru_mean = np.mean(ratings[u, overlap]) if overlap else 0
                ru[k] = ratings[u][i] - ru_mean
            term = ru.dot(sim[a][nei]) / (epsilon + np.sum(np.abs(sim[a][nei])))
            return np.nanmean(ratings[a]) + term
        
        else:
            nei = list(ratedset(ratings, a, kind=True)) # items set that "user a" rated.
            if not nei: return np.nanmean(ratings[:, i])
            rj = np.ones(shape=len(nei))
            for k, j in enumerate(nei):
                overlap = neighbors(ratings, i, j, kind=False)
                rj_mean = np.mean(ratings[overlap, j]) if overlap else 0
                rj[k] = ratings[a][j] - rj_mean
            term = rj.dot(sim[i][nei]) / (epsilon + np.sum(np.abs(sim[i][nei])))
            return np.nanmean(ratings[:, i]) + term
        
    return np.nan_to_num(np.array([pred(u, i) for u, i in track(entries)]))
In [15]:
1
df_table
i1 i2 i3 i4
u1 4.0 NaN 5.0 5.0
u2 4.0 2.0 1.0 NaN
u3 3.0 NaN 2.0 4.0
u4 4.0 4.0 NaN NaN
u5 2.0 1.0 3.0 5.0
In [16]:
1
2
3
4
5
6
7
8
9
entries = np.argwhere(np.isnan(ratings))
usim = df_table.T.corr(method='pearson').to_numpy()
isim = df_table.corr(method='pearson').to_numpy()
ub = predict(ratings, usim, entries, ub=True, verbose=True)  # UB
ib = predict(ratings, isim, entries, ub=False, verbose=True)  # IB
print(entries)
print("\npredicted results as follows...")
print(ub)
print(ib)
</progress> 100% 5/5 [00:00<00:00, 0.00s/it]
1
2
WorkingTime[predict]: 12.54416 ms

</progress> 100% 5/5 [00:00<00:00, 0.00s/it]
1
2
3
4
5
6
7
8
9
10
11
WorkingTime[predict]: 7.43055 ms
[[0 1]
 [1 3]
 [2 1]
 [3 2]
 [3 3]]

predicted results as follows...
[3.947 2.341 1.775 4.    4.   ]
[0.912 2.333 2.19  0.408 4.667]

Step 4. Apply to movielen Dataset

In [17]:
1
2
3
4
5
6
7
8
9
10
ratings = pd.read_csv(os.path.join(DFILE, 'ratings.csv'), encoding=CHARSET)

samples = ratings.sample(frac=1)
df_train, df_test = train_test_split(samples, test_size=0.1, random_state=0, shuffle=True)
df_train.shape, df_test.shape

R = samples.pivot(index='userId', columns='movieId', values='rating')
M, N = R.shape
print("num_users: {}, num_movies: {}".format(M, N))
print("density rate: {:.2f}%".format((1 - (R.isna().sum(axis=0).sum() / (M * N))) * 100))
1
2
3
num_users: 610, num_movies: 9724
density rate: 1.70%

In [18]:
1
df_train.shape, df_test.shape
1
((90752, 4), (10084, 4))
In [19]:
1
2
3
4
5
6
mid2idx = {mid: i for i, mid in enumerate(R.columns)}
idx2mid = {v: k for k, v in mid2idx.items()}
uid2idx = {uid: i for i, uid in enumerate(R.index)}
idx2uid = {v: k for k, v in uid2idx.items()}
rmatrix = R.to_numpy().copy()
rmatrix
1
2
3
4
5
6
7
array([[4. , nan, 4. , ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [2.5, 2. , 2. , ..., nan, nan, nan],
       [3. , nan, nan, ..., nan, nan, nan],
       [5. , nan, nan, ..., nan, nan, nan]])

4 - 1. conceal test and validation dataset

In [20]:
1
2
3
4
5
6
for uid, mid in zip(df_test.userId, df_test.movieId):
    uidx, midx = uid2idx[uid], mid2idx[mid]
    rmatrix[uidx][midx] = np.nan
    
rtable = pd.DataFrame(rmatrix, index=uid2idx.keys(), columns=mid2idx.keys())
rtable
1 2 3 4 5 6 7 8 9 10 ... 193565 193567 193571 193573 193579 193581 193583 193585 193587 193609
1 4.0 NaN 4.0 NaN NaN 4.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
606 2.5 NaN NaN NaN NaN NaN 2.5 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
607 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
608 2.5 2.0 2.0 NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
609 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
610 5.0 NaN NaN NaN NaN 5.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

610 rows × 9724 columns

In [21]:
1
2
3
M, N = rtable.shape
print("num_users: {}, num_movies: {}".format(M, N))
print("density rate: {:.2f}%".format((1 - (rtable.isna().sum(axis=0).sum() / (M * N))) * 100))
1
2
3
num_users: 610, num_movies: 9724
density rate: 1.53%

4 - 2. calculate similarity and predict ratings.

In [22]:
1
entries = [(uid2idx[uid], mid2idx[mid]) for uid, mid in zip(df_test.userId, df_test.movieId)]
In [23]:
1
2
usim = rtable.T.corr(method='pearson')
upreds = predict(rmatrix, usim, entries, ub=True, verbose=True)
</progress> 100% 9700/10084 [01:31<00:00, 0.01s/it]
1
2
WorkingTime[predict]: 91141.55078 ms

In [24]:
1
2
isim = rtable.corr(method='pearson')
ipreds = predict(rmatrix, isim, entries, ub=False, verbose=True)
</progress> 100% 9600/10084 [05:38<00:00, 0.03s/it]
1
2
/home/swyoo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:27: RuntimeWarning: Mean of empty slice

1
2
WorkingTime[predict]: 338750.83947 ms

This calucated values are adjusted cosine similarity
so, it is different with pearson correlation values.

In [25]:
1
isim_adjust = pearson(rmatrix, ub=False, adjusted=True, verbose=True)
</progress> 100% 9021/9724 [34:59<00:00, 0.22s/it]
1
2
/home/swyoo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:39: RuntimeWarning: invalid value encountered in double_scalars

1
2
WorkingTime[pearson]: 2099248.79313 ms

In [28]:
1
ipreds_adjusted = predict(rmatrix, isim_adjust, entries, ub=False, verbose=True)
</progress> 100% 9600/10084 [05:34<00:00, 0.03s/it]
1
2
/home/swyoo/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:27: RuntimeWarning: Mean of empty slice

1
2
WorkingTime[predict]: 335237.43176 ms

Step 5. Evaluation

There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is Root Mean Squared Error (RMSE). I will use the mean_square_error (MSE) function from sklearn, where the RMSE is just the square root of MSE.

\[\mathit{RMSE} =\sqrt{\frac{1}{N} \sum (x_i -\hat{x_i})^2}\]
In [29]:
1
2
3
4
5
6
7
8
9
10
def rmse(ratings, preds, entries):
    """ calculate RMSE
    ratings: np.array
    preds: List[float]
    entries: List[List[int]] """
    N = len(preds)
    diff = np.zeros_like(preds)
    for k, (i, j) in enumerate(entries):
        diff[k] = ratings[i][j] - preds[k]
    return sqrt(np.sum(diff ** 2) / N)
In [30]:
1
2
3
4
5
6
7
entries = [(uid2idx[uid], mid2idx[mid]) for uid, mid in zip(df_test.userId, df_test.movieId)]
rmse1 = rmse(R.to_numpy(), upreds, entries)
rmse2 = rmse(R.to_numpy(), ipreds, entries)
rmse3 = rmse(R.to_numpy(), ipreds_adjusted, entries)
print('User-based CF RMSE: {:.3f}'.format(rmse1))
print('Item-based CF RMSE: {:.3f}'.format(rmse2))
print('Item-based CF RMSE with ajdusted: {:.3f}'.format(rmse3))
1
2
3
4
User-based CF RMSE: 0.892
Item-based CF RMSE: 1.127
Item-based CF RMSE with ajdusted: 1.151

Recommend Top K Movies

Assume that our recommender system recommends top K movies to a user.
The recommendation precedure is as follows.

  1. Predict scores for unseen movies by utilizing similarity matrix.
  2. Sort and determine Top K movies from predicted scores for unseen movies.
  3. Recommend the top K movies.

Before recommend movies, let’s look at the user’s profile.

In [31]:
1
2
3
4
5
M, N = R.shape
ratings = R.to_numpy()
uid = random.randint(1, M)
print(uid)
R[R.index==uid]
1
2
425

movieId 1 2 3 4 5 6 7 8 9 10 ... 193565 193567 193571 193573 193579 193581 193583 193585 193587 193609
userId
425 NaN 3.0 NaN NaN NaN 4.0 NaN NaN NaN 3.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 9724 columns

In [104]:
1
2
3
4
5
seen_indices = np.argwhere(~np.isnan(np.squeeze(R[R.index==uid].to_numpy()))).flatten()
seen = [idx2mid[i] for i in seen_indices]
user_profile = pd.merge(left=movies[movies.movieId.isin(seen)], right=R[R.index==uid][seen].T, on='movieId')
user_profile.columns = ['movieId', 'title', 'genres', 'rating']
user_profile.sort_values(by=['rating'], ascending=False)
movieId title genres rating
72 778 Trainspotting (1996) Comedy|Crime|Drama 5.0
37 356 Forrest Gump (1994) Comedy|Drama|Romance|War 5.0
32 318 Shawshank Redemption, The (1994) Crime|Drama 5.0
29 293 Léon: The Professional (a.k.a. The Professiona... Action|Crime|Drama|Thriller 5.0
85 1061 Sleepers (1996) Thriller 4.5
... ... ... ... ...
259 6764 Rundown, The (2003) Action|Adventure|Comedy 2.5
68 736 Twister (1996) Action|Adventure|Romance|Thriller 2.5
69 741 Ghost in the Shell (Kôkaku kidôtai) (1995) Animation|Sci-Fi 2.5
71 762 Striptease (1996) Comedy|Crime 2.5
64 608 Fargo (1996) Comedy|Crime|Drama|Thriller 2.5

306 rows × 4 columns

In [32]:
1
2
3
4
5
6
7
8
9
def recommend(R, sim, uid, K):
    """ R: pandas.Dataframe, rating pivot. """
    rates = np.squeeze(R[R.index==uid].to_numpy())
    empties = np.argwhere(np.isnan(rates)).flatten()
    entries = [(uid, mid) for mid in empties]
    preds = predict(ratings, sim, entries, ub=True, verbose=True)
    topK_indices = np.argsort(preds)[-K:][::-1]
    topK = [idx2mid[idx] for idx in topK_indices]
    return topK
In [33]:
1
2
3
topK = recommend(R, usim, uid, K=10)
print(topK)
movies[movies.movieId.isin(topK)]
</progress> 100% 8554/9418 [00:13<00:00, 0.00s/it]
1
2
3
WorkingTime[predict]: 13215.18350 ms
[6863, 63768, 87867, 60941, 147250, 72603, 5959, 7264, 7394, 5328]

movieId title genres
3807 5328 Rain (2001) Drama|Romance
4143 5959 Narc (2002) Crime|Drama|Thriller
4608 6863 School of Rock (2003) Comedy|Musical
4861 7264 An Amazing Couple (2002) Comedy|Romance
4929 7394 Those Magnificent Men in Their Flying Machines... Action|Adventure|Comedy
6812 60941 Midnight Meat Train, The (2008) Horror|Mystery|Thriller
6899 63768 Tattooed Life (Irezumi ichidai) (1965) Crime|Drama
7195 72603 Merry Madagascar (2009) Animation
7637 87867 Zookeeper (2011) Comedy
9138 147250 The Adventures of Sherlock Holmes and Doctor W... (no genres listed)
In [34]:
1
2
3
topK = recommend(R, isim, uid, K=10)
print(topK)
movies[movies.movieId.isin(topK)]
</progress> 100% 0/9418 [00:14<00:00, 0.00s/it]
1
2
3
WorkingTime[predict]: 14716.72630 ms
[166558, 4215, 4235, 4234, 4233, 4232, 4231, 4229, 4228, 4226]

movieId title genres
3132 4215 Revenge of the Nerds II: Nerds in Paradise (1987) Comedy
3141 4226 Memento (2000) Mystery|Thriller
3142 4228 Heartbreakers (2001) Comedy|Crime|Romance
3143 4229 Say It Isn't So (2001) Comedy|Romance
3144 4231 Someone Like You (2001) Comedy|Romance
3145 4232 Spy Kids (2001) Action|Adventure|Children|Comedy
3146 4233 Tomcats (2001) Comedy
3147 4234 Tailor of Panama, The (2001) Drama|Thriller
3148 4235 Amores Perros (Love's a Bitch) (2000) Drama|Thriller
9435 166558 Underworld: Blood Wars (2016) Action|Horror
In [35]:
1
2
3
topK = recommend(R, isim_adjust, uid, K=10)
print(topK)
movies[movies.movieId.isin(topK)]
</progress> 100% 6204/9418 [00:14<00:00, 0.00s/it]
1
2
3
WorkingTime[predict]: 15062.08897 ms
[70301, 72696, 6577, 1024, 3689, 6301, 4883, 1551, 1490, 81156]

movieId title genres
782 1024 Three Caballeros, The (1945) Animation|Children|Musical
1139 1490 B*A*P*S (1997) Comedy
1170 1551 Buddy (1997) Adventure|Children|Drama
2751 3689 Porky's II: The Next Day (1983) Comedy
3566 4883 Town is Quiet, The (Ville est tranquille, La) ... Drama
4313 6301 Straw Dogs (1971) Drama|Thriller
4454 6577 Kickboxer 2: The Road Back (1991) Action|Drama
7092 70301 Obsessed (2009) Crime|Drama|Thriller
7201 72696 Old Dogs (2009) Comedy
7442 81156 Jackass 3D (2010) Action|Comedy|Documentary

Reference

[1] Item-Based Recommender System original paper written by sarwar 2001

[2] Survey paper

[3] Korean Blog

[4] English Blog

[5] Amazon Item-Based Recommeder System 2003

Leave a comment