New lightfm model with different checkins(1000/50000/227427all)(updated 29th, Aug)

Finally succeeded in optimizing the codes of lightfm model!

But the computational cost is very high, so I wil use only 1000/227427 of all the checkins.

And the results turned out to be good!

The original lightfm model running on my laptop:

unique user&venue checkin combination in test 195
unique user&venue checkin combination in test 778
max num in matrix 2
max num in train 4
I am beginning to model
model has been fitted
this is the model that consider the checkin times
Time used: 0.3982211436102695
Train_auc is 0.932690
Test_aus is 0.159056
Collabrative Filtering testAUC is: 0.500707
Hybrid train auc is 0.958416
Hybrig test auc is 0.512641
logistic train auc is 0.822063
logistic test auc is 0.138891

we can see that due to the loss of data the train AUC is extremely low, and using hybrid model greaty improves it.

Now let's see the results of the model that considers the domain specific biases:

this is test for new lightfm, 1000 checkins
unique user&venue checkin combination in test 195
unique user&venue checkin combination in test 778
max num in matrix 4
max num in train 3
I am beginning to get negtive examples
object preprocess created
calculate neighbor for item 0
calculate neighbor for item 1
calculate neighbor for item 2
calculate neighbor for item 3
calculate neighbor for item 4
......
calculate neighbor for item 914
calculate neighbor for item 915
calculate neighbor for item 916
get neighbor time used: 31.218598
0
1
2
3
.....
774
775
776
777
Time used for negative examples: 31.323323000000002
I am beginning to model,this is the new model
model has been fitted
this is the model that consider the checkin times
Time used: 0.04152100000000303
Train_auc is 0.589729
Test_aus is 0.329315

Although the train AUC drops, the test AUC increases a lot (almost double). That is a really good result. although it does not out reach the result of the hybrid model.

It still shows that the new model still conpensate the information loss to some exetent

This is the 50000 checkins running on my laotop.

unique user&venue checkin combination in test 5010
unique user&venue checkin combination in test 20036
max num in matrix 35
max num in train 48
I am beginning to model
model has been fitted
this is the model that consider the checkin times
Time used: 5.658149130902446
Train_auc is 0.999952
Test_aus is 0.465492
Collabrative Filtering testAUC is: 0.554559
Hybrid train auc is 0.596089
Hybrig test auc is 0.529985
logistic train auc is 0.774696
logistic test auc is 0.42213

The new lightfm model is still running on the cluster.....waiting for the results

Ok,here is the results:

this is test for new lightfm, 50000 checkins
unique user&venue checkin combination in test 5010
unique user&venue checkin combination in test 20036
max num in matrix 48
max num in train 47
I am beginning to get negtive examples
object preprocess created
get neighbor time used: 9331.736611
Time used for negative examples: 9375.006032
I am beginning to model,this is the new model
model has been fitted
this is the model that consider the checkin times
Time used: 0.9198419999993348
Train_auc is 0.553874
Test_aus is 0.485107
/home/s2013258/.local/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

A slight improvement in the AUC....

And as for the all data: the AUC declined anyway :

here is the result runnin on my laptop:

unique user&venue checkin combination in test 18205
unique user&venue checkin combination in test 72819
max num in matrix 155
max num in train 257
I am beginning to model
model has been fitted
this is the model that consider the checkin times
Time used: 28.566111388207524
Train_auc is 0.999501
Test_aus is 0.654774
Collabrative Filtering testAUC is: 0.686022
Hybrid train auc is 0.513596
Hybrig test auc is 0.507019

and here is the result running on the cluster with the new model:

this is test for new lightfm, all checkins
unique user&venue checkin combination in test 18205
unique user&venue checkin combination in test 72819
max num in matrix 219
max num in train 257
I am beginning to get negtive examples
object preprocess created
get neighbor time used: 51382.303583
Time used for negative examples: 51741.248447
I am beginning to model,this is the new model
model has been fitted
this is the model that consider the checkin times
Time used: 3.28872599999886
Train_auc is 0.562395
Test_aus is 0.543550
/home/s2013258/.local/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

clearly we can see that the AUC in hybrid model and new model is lower than the AUC in the original CF model with warp loss, I think it may have something to do with the overfitting...or the redundancy of information

Temporarily I have two kinds of possible improvements in our minds:

1.changing the radius of neighbor area

2.improve the problem of overfitting...

solution1 is easy, but it requires some time to see the results, as for solutoin 2 I do not have any specific ideas yet.