Note from “CIKM Cup 2016 Track 2”

I created this post during the competition, two months before this blog’s inception.

Overview

This is my repo* for CIKM 2016 - Track 2: e-commerce personalization which is my first data science competition. I registered under pseudonym because I thought I’d be the worst participant with negative FinalNDCG**.

Looking back, I kinda underestimated their “only” megabytes of dataset.

* My codes are unreadable throwaways, so hosting them here will probably hurt my hirability.

** Fortunately, I’m only the worst non-zero participant on phase 1.

Submissions

The scores are FinalNDCG, SearchNDCG, CategoryNDCG

Submission 1, 2 ~0.34, ~0.37, ~0.33

Submission and leaderboard behavior trial, just baseline stuffs

Submission 3, 4, 5 ~0.22, ~0.32, ~0.20

I tried going the personal route. Using user-item is impossible because the raw data is around 200GB (500k * 180k). So I tried to group user based on their affinity (view, purchase) toward category. The raw data user-category is only 2GB which fits my RAM.

The result is sparse af. Even Canopy would be useless because the users are only affiliated to few out of one thousand categories, also they would take years to run.

Desperate move, my goal was only to give anything personal-ish. So from 500k user and 1k category data, I did one_group = reduce(lambda x, y: x or (y[0] and y[1]), zip(user1, user2), False) which means put them together even if they only have 1 similar interest. It resulted in a group of 70k people. And I ignored the rest. Just give them the baseline stuffs.

It was still impossible to give personal recommendation to 70k users on 180k items, so I just gave up and do stupid thing like

for session, item in data:
	score = 0
	if session in special_group:
		score = group_score[item]
	else:
		score = normal_score[item]
	yield score

That’s all, CIKM2016 folks!

Q&A

Q: What machine did you use to develop this computer thingy?

A: My office’s ROG G550JX with 16GB RAM and 1TB SATA 5400rpm. Windows 10.

Q: What stack did you choose to develop this science technometerpreter?

A: Originally (10 days before the deadline) I planned to use CUDA, but C++ is 2cryptic4me. So I tried Apache Spark 2.0 on a single machine. Alas, DFS doesn’t play well with Windows. I was stuck for hours on spark-warehouse which should’ve been resolved. Then I gave up. So I tried Apache Spark 1.6 on a single machine. It ran allright until I started feeding the actual data. And then the machine gave up. In the end I essentially just tried to stab Cthulhu with a pypy toothpick.

So hot!

Q: What lesson did you learn after all this?

A: Well..

Use Linux x64 anytime possible.
Don’t underestimate small data. Matrix factorization can make it grow from MB to GB.
Personalization hurts. Really. Now I wonder how the giants do this.

Q: Anything you regret?

A: Yeah, I spent 5 days of paid leave for this. *just partially kidding.* Actually I regret not uploading a fully randomized answer to see whether my attempts are better than some random answers.