Download Estimation on KDD Cup 2003

This table shows the average error of the predictions based on various representations of the papers. Support vector regression was used in all these experiments as the learning algorithm. Descriptions of the features are included below the table. More details can be found in our forthcoming technical report, which will be posted on this web site when it is complete (hopefully quite soon).

RepresentationAverage errors
Cross-validationFull training setTest set used for evaluation on KDD Cup
Training setTest setAll 150 papersWithout the outlier (#0103239)*
 Trivial model (predict median of training set)150.18152.26150.15181.11136.77
(AA)Author + Abstract37.62135.8637.77155.38111.31
 AA + 0.004 in-degree35.99127.6935.90146.77103.62
(R1)AA + 0.005 in-degree + 0.5 in-links + 0.8 out-links29.90123.7429.96143.06100.25
(R2)R1 + 0.25 journal29.42121.1229.45143.38100.60
(R3)R2 + 0.004 title length29.14119.5829.22140.3097.27
 R3 + 1.3 title word length29.21118.9429.33139.7596.79
(R4)R3 + 0.9 title word length + 0.1 (year - 2000)29.17118.8129.30138.6995.74
(R5)R4 + 0.4 ClusDlMed29.13118.1329.17138.5695.63
 Our entry on KDD Cup 2003**31.80118.8931.77141.6098.72
 Second best entry on KDD Cup 2003***   146.34 
 Third best entry on KDD Cup 2003***   158.39 

Notes:

* This is a document that was downloaded 7160 times, which is more than ten times as much as the average download count. All our models make a huge error here, which increases the average error on the test set of 150 papers by about 40. Without this outlier, the test set is actually easier to predict than the test papers considered during cross-validation.

** This was AA + 0.006 in-degree + 0.7 in-links + 0.85 out-links + 0.35 journal + 0.006 title length + 0.3 ClusDlAvg. Additionally, the error cost parameter of the SVM algorithm was set to 0.7, whereas it was set to 1 in all other experiments reported here. (Later, when preparing the technical report about this work, we found a few slightly better representations, of which R5 in the above table is the best one.)

*** These two rows are based on the results published on the KDD Cup 2003 results page.

Explanation of individual features:

Abstract

The contents of the title and abstract of the paper. The bag-of-words model is used to represent text as a sparse vector. There is one component for each possible word, and its value is the TF-IDF value computed from: (1) the number of occurrences of this word in the title and abstract of the current paper; and (2) the number of documents that contain this word. The entire vector is normalized to Euclidean length 1.

Author

There is one component for each possible author. Its value is set 1 if this author is one of the authors of the current paper, and 0 otherwise. Finally the vector is normalized to have Euclidean length 1.

In-degree

The number of papers that reference the current paper. Papers from the entire dataset are taken into account, not just those from the training period.

In-links

There is one feature for each paper in the dataset. Its value is set to 1 if that paper references the current paper, and to 0 otherwise. The resulting vector is then normalized to have Euclidean length 1.

Out-links

Like in-links, but with nonzero values corresponding to papers referenced by the current paper.

Journal

There is one feature for each possible journal. Its value is set to 1 if this paper has been published in that journal, and to 0 otherwise. There is also one feature for papers that do not mention having been published in a journal.

Title length

The length of the title, in characters. Spaces and punctuation symbols count too (though it probably wouldn't matter mach if they weren't included in these counts).

Title word length

The average length of words in the title. This is the title length in characters divided by the number of words in the title.

Year

The year when this paper was added to the archive. For the papers we're interested in, this is either 2000, 2001, or 2002. This value can be trivially determined from the first two digits of the seven-digit paper ID (which has the form yymmnnn).

ClusDlMed, ClusDlAvg

The median or average number of downloads over all training papers that belong to the same cluster as the current paper. (The papers have been partitioned into 26 clusters by recursively applying the 2-means algorithm.)

Janez Brank To the index.