During the last decade, there has been increased interest in the learning of ranking models, probably because of the wide
range of its potential applications. However,one drawback of this learning task when used in some real world applications is
the lack of interpretability of its results. Indeed, with most commonly used ranking methods (e.g. SVMs, Boosting) no explanations are provided to the user along with the ranking list.
In that context, we have developed YRank, a new method for providing explanations of the top part of a ranking list.
The underlying idea is to produce interpretable rules by learning the main characteristic properties of the Top as opposed to the Bottom part of the ranking list.
Such intelligible rules can be quite useful for the domain expert to direct actions and gain more understanding of the target system.
Notice that the former name of YRank is YSVM, as it has been developed initially to cope with SVM's lack of intelligibility.
We have renamed it since it is actually independent from the ranking method used. You can provide any ranked list which does not have to
be produced by SVMs.
To run YRank, you need to provide the system with a ranked list of examples described by categorical
and numerical attributes. For numerical attributes, you can precise the method of discretization you
want. Two discretization methods are possible with YRank: equisize or expert. For the equisize, you
give the size of bins while for expert you need to provide the cutting points. Discretization need to
be given for each numerical attribute in a separate file.
This distribution contains two scripts: feature_rankings.py and iterate_feature_rankings.py
feature_rankings.py
Usage: feature_rankings.py [options] [<]filename
Options:
-h, --help show this help message and exit
-b BOTTOM_PRECENTAGE, --bottom-precentage=BOTTOM_PRECENTAGE
percentage of input to be considered "bottom"
-t TOP_PRECENTAGE, --top-precentage=TOP_PRECENTAGE
percentage of input to be considered "top"
-s SUPPORT_THRESHOLD, --support-threshold=SUPPORT_THRESHOLD
minimum percentage for support threshold
-l LEVERAGE_THRESHOLD, --leverage-threshold=LEVERAGE_THRESHOLD
minimum leverage threshold between 0 and 0.25
-p FIELD_SEPARATOR, --field-separator=FIELD_SEPARATOR
field separator
-m MAX_FEATURES, --max-features=MAX_FEATURES
Maximum number of features to combine
-a NUMERICAL_ATTRIBUTES, --numerical-attributes=NUMERICAL_ATTRIBUTES
Filename which contains specs for discretization
-c COPY_MODIFIED_INPUT, --copy-modified-input=COPY_MODIFIED_INPUT
Filename to which to write a copy of discretized input
-g, --graph Output histograms
-q, --quiet suppress certain error messages
iterate_feature_rankings.py
Usage: iterate_feature_rankings.py infile from_value to_value step [--extra-args=...]
Options:
-h, --help show this help message and exit
--extra-args=EXTRA_ARGS
extra arguments to pass to feature_rankings.py
Examples
You can download datasets that you can try here. The files are ranked with
linear SVM.
Here is a complete toy example:
toy.txt
--------
Rank Size Manufacturer Age
1 500 B 2
2 500 B 8
3 1000 C 12
4 500 A 1
5 500 B 17
6 500 A 27
7 2000 A 2
8 2000 A 8
9 2000 A 12
10 2500 B 25
toy.att
--------
Size equisize 500
Age expert 3,10,15,25
python feature_rankings.py -q -b 50 -t 50 -s 0.0001 -l 0.1 -m 2 toy -a toy.att
num bottom =5
num top =5
final num records =10
Total number of features = 7
Average leverage = 0.135714285714
Standard deviation of leverage = 0.0225876975726
Number of top features = 3
Average leverage for top features = 0.133333333333
Standard deviation of leverage for top features = 0.0235702260396
Number of bottom features = 4
Average leverage for bottom features = 0.1375
Standard deviation of leverage for bottom features = 0.0216506350946
PropertiesAll.txt
------------------
feature Freq_top sup_top lev_top Freq_bottom sup_bottom lev_bottom
--------------------------------------------------------------------------------
Size=[2000,2500) 0 0 -0.15 3 0.30 0.15
Size=[500,1000) 4 0.40 0.15 1 0.10 -0.15
Manufacturer=A 1 0.10 -0.15 4 0.40 0.15
Age=[25,+inf) 0 0 -0.10 2 0.20 0.10
Manufacturer=A
AND Size=[2000,2500) 0 0 -0.15 3 0.30 0.15
Manufacturer=B
AND Size=[500,1000) 3 0.30 0.15 0 0 -0.15
Age=(-inf,3)
AND Size=[500,1000) 2 0.20 0.10 0 0 -0.10
PropertiesBottom.txt
---------------------
feature freq support leverage
--------------------------------------------------------------------------------
Size=[2000,2500) 3 0.30 0.15
manufacturer=A 4 0.40 0.15
Age=[25,+inf) 2 0.20 0.10
Manufacturer=A AND Size=[2000,2500) 3 0.30 0.15
PropertiesTop.txt
------------------
feature freq support leverage
--------------------------------------------------------------------------------
Size=[500,1000) 4 0.40 0.15
Manufacturer=B AND Size=[500,1000) 3 0.30 0.15
Age=(-inf,3) AND Size=[500,1000) 2 0.20 0.10
Histograms
----------
size.png
Manufacturer.png
Age.png
To download the scripts, please enter the following information:
Copyright CCLS -- Columbia University. YRank is a research prototype developed at CCLS and is
distributed without any warranty.
Question/suggestion? Please send an email to: ansaf AT ccls DOT columbia DOT edu.