Download Estimation on KDD Cup 2003
Janez Brank
and Jure Leskovec
Jozef Stefan Institute, Ljubljana, Slovenia
One of the tasks on this year's
KDD Cup
concerned download estimation.
- The dataset is a collection of papers
from the high energy physics -- theory
area of arXiv.org.
- For some papers, we are told how many times they have been
downloaded from the arXiv.org web servers (actually, for each
paper, only downloads in the first two months since it was added
to the archive are reported; besides, we also get the exact time
of each download).
- The task is to predict, for certain other papers, how many times they have
been downloaded in the first two months since being added to the archive.
This page contains some information about the work done
on this task by Jure Leskovec and myself.
18 teams submitted their entries for this task, and our predictions
turned out to be the most accurate.
-
Slides (in PowerPoint format)
about our methodology and results:
shorter version,
longer and newer version
(presented as a Solomon Seminar,
28 October 2003).
-
Our predictions. This is a tab-separated
file, with one (paper-ID, prediction) pair per line.
-
A table showing the average prediction error
of several models based on different representations of papers.
-
A detailed technical report about our work. This includes
information about the dataset and about all the features we experimented
with (even the unpromising ones). Postscript version,
PDF version.
-
[New!]
A short three-page paper about our work, submitted to SIGKDD Explorations.
Postscript version,
PDF version.
-
[New!]
The values of various attributes that we used in our experiments, available
for all the training and test papers in an easily parsable plain-text format.
The KDD Cup 2003 organizers have also published the
actual
download counts of the test papers in question.
Janez Brank
To my home page.