This table shows the average error of the predictions based on various representations of the papers. Support vector regression was used in all these experiments as the learning algorithm. Descriptions of the features are included below the table. More details can be found in our forthcoming technical report, which will be posted on this web site when it is complete (hopefully quite soon).
| Representation | Average errors | |||||
|---|---|---|---|---|---|---|
| Cross-validation | Full training set | Test set used for evaluation | ||||
| Training set | Test set | All 150 papers | Without the outlier | |||
| Trivial model | 150.18 | 152.26 | 150.15 | 181.11 | 136.77 | |
| (AA) | Author + Abstract | 37.62 | 135.86 | 37.77 | 155.38 | 111.31 |
| AA + 0.004 in-degree | 35.99 | 127.69 | 35.90 | 146.77 | 103.62 | |
| (R1) | AA + 0.005 in-degree + 0.5 in-links + | 29.90 | 123.74 | 29.96 | 143.06 | 100.25 |
| (R2) | R1 + 0.25 journal | 29.42 | 121.12 | 29.45 | 143.38 | 100.60 |
| (R3) | R2 + 0.004 title length | 29.14 | 119.58 | 29.22 | 140.30 | 97.27 |
| R3 + 1.3 title word length | 29.21 | 118.94 | 29.33 | 139.75 | 96.79 | |
| (R4) | R3 + 0.9 title word length + | 29.17 | 118.81 | 29.30 | 138.69 | 95.74 |
| (R5) | R4 + 0.4 ClusDlMed | 29.13 | 118.13 | 29.17 | 138.56 | 95.63 |
| Our entry on | 31.80 | 118.89 | 31.77 | 141.60 | 98.72 | |
| Second best entry on | 146.34 | |||||
| Third best entry on | 158.39 | |||||
Notes:
* This is a document that was downloaded 7160 times, which is more than ten times as much as the average download count. All our models make a huge error here, which increases the average error on the test set of 150 papers by about 40. Without this outlier, the test set is actually easier to predict than the test papers considered during cross-validation.
** This was AA + 0.006 in-degree + 0.7 in-links + 0.85 out-links + 0.35 journal + 0.006 title length + 0.3 ClusDlAvg. Additionally, the error cost parameter of the SVM algorithm was set to 0.7, whereas it was set to 1 in all other experiments reported here. (Later, when preparing the technical report about this work, we found a few slightly better representations, of which R5 in the above table is the best one.)
*** These two rows are based on the results published on the KDD Cup 2003 results page.
Explanation of individual features:
The contents of the title and abstract of the paper. The bag-of-words model is used to represent text as a sparse vector. There is one component for each possible word, and its value is the TF-IDF value computed from: (1) the number of occurrences of this word in the title and abstract of the current paper; and (2) the number of documents that contain this word. The entire vector is normalized to Euclidean length 1.
There is one component for each possible author. Its value is set 1 if this author is one of the authors of the current paper, and 0 otherwise. Finally the vector is normalized to have Euclidean length 1.
The number of papers that reference the current paper. Papers from the entire dataset are taken into account, not just those from the training period.
There is one feature for each paper in the dataset. Its value is set to 1 if that paper references the current paper, and to 0 otherwise. The resulting vector is then normalized to have Euclidean length 1.
Like in-links, but with nonzero values corresponding to papers referenced by the current paper.
There is one feature for each possible journal. Its value is set to 1 if this paper has been published in that journal, and to 0 otherwise. There is also one feature for papers that do not mention having been published in a journal.
The length of the title, in characters. Spaces and punctuation symbols count too (though it probably wouldn't matter mach if they weren't included in these counts).
The average length of words in the title. This is the title length in characters divided by the number of words in the title.
The year when this paper was added to the archive. For the papers we're interested in, this is either 2000, 2001, or 2002. This value can be trivially determined from the first two digits of the seven-digit paper ID (which has the form yymmnnn).
The median or average number of downloads over all training papers that belong to the same cluster as the current paper. (The papers have been partitioned into 26 clusters by recursively applying the 2-means algorithm.)
To the index.