Tuesday, June 12, 2012

Notes on RuleFit

Edit (dec 24 2017): I have developed an R package for deriving prediction rule ensembles: pre (available from CRAN). It provides most of the functionality of the Rulefit package, with some improvements/adjustments. The main differences with Rulefit are: 1) pre derives prediction rules using unbiased tree growing algorithms from package partykit, instead of CART trees, which have biased variable selection; 2) pre is completely R-based, so easier to use the rules, coefficients, importance etc. for further computations (but also a bit slower); 3) Rulefit allowed for a bagging and/or boosting approach to generating the initial ensemble of prediction rules, pre allows for bagging and/or boosting and/or random forest approaches.

 

Original post:

RuleFit provides an ensemble of regression functions and CART tree derived prediction rules. It's one of the very few rule prediction rule ensemble methods that can be used for both classification and regression (c.f., C4.5). The paper describing the methodology and program can be found here (Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 916-954). The TMVA (Toolkit for Multivariate Analysis) package in R provides an implementation of rulefit as well, but only for classification.

  

Arguments of the rulefit function

sparse relates to \lambda in Friedman & Popescu (2005)
memory.par is \nu in Friedman & Popescu (2005)
samp.fract is \eta in Friedman & Popescu (2005)
tree.size is \overline{L} in Friedman & Popescu (2005)
inter.supp is \kappa in Friedman & Popescu (2005)

The penalty function in the Friedman & Popescu (2005) paper shows only the lasso penalty (equation 4), but elastic net penalty is implemented in RuleFit, as well. If you set the sparse argument of the rulefit function to low values, (e.g., sparse=.0000001), the model will be built using the ridge penalty, resulting in a much larger number of terms in the model. 

samp.fract=min(1,(11*sqrt(neff)+1)/neff), by default. When all observation weights are equal, neff = n for regression; so samp.fract=min(1,(11*sqrt(n)+1)/n). By default, samp.fract becomes smaller when n increases. It becomes less than 1 about when neff>122.
 
test.frac should be a value between 0.1 and 0.5, and is 0.2 by default. Test sample is used for what some would refer to as validation sample: determining the optimal value of \lambda, the penalty parameter.

If test.reps is set to a value >0, then [value]-fold cross validation is used for determining the optimal value of \lambda. By default, test.reps=round(min(20,max(0.0,5200/neff-2))): it is at most 20 and at least 0. By default, the value of test.resps decreases when neff (n) increases. By default, test.reps < 20 when neff > 237, and test.reps < 4 when neff < 866.  

Using the max.trms argument is very useful in controlling the size of the final ensemble. However, the maximum number specified is an approximate maximum number of terms: in most cases, the ensemble will be somewhat bigger, especially when test.reps > 1.

The sparse argument selects the sparse regression method used determining the weights of prediction functions in the final ensemble (thus, it also selects the prediction functions: prediction functions with zero weight are not included in the final ensemble). For ensembles with human-interpretable sizes, use sparse=3: it uses forward stepwise (regression) or stagewise (classification) regression (see Hastie & Taylor, 2007).

Hastie, T., Taylor, J., Tibshirani, R., & Walther, G. (2007). Forward stagewise regression and the monotone lasso. Electronic Journal of Statistics, 1, 1-29.


Other functions

The rfxeval function uses all data, to estimate the expected extra-sample error Err, the average generalization error when the method f^{hat}(X) is applied to an independent test sample (see paragraph 7.10 from Hastie, Tibshirani and Friedman, 2008).



Troubleshooting

Do not use rfxeval() before rfpred() or rules(): it seems to change the current model in the RuleFit home directory. If rfxval() is used, rebuild the rulefit model using rulefit(x,y), before using rules() and rfpred(). I'm not sure this only involves the rules and rfpred functions.

You can obtain a list (instead of a plot) of the variable importances by using the following code:
imp <- varimp(plot=F); imp

The rules() function provides a list of prediction functions, printed in a command prompt window. This is not very convenient for editing, copying and pasting. However, when the rules() function is used, a file named "rulesout.hlp" is created in the working directory, which can be opened using any text editor. Be aware that R and Rulefit will hang, when the final ensemble consists of < 10 prediction functions. I think this is due to the defaults for the rules function (rules(begin=1, end=begin+9). There's three ways to solve this:
1) open windows task manager, and end the process called 'rf_go.exe'; find the file 'rules.out' in the working directory and open it in a text editor.
2) open windows task manager, and end the process called 'rf_go.exe'; make sure the R working directory is the dierctory from which rulefit is run; type 'readLines("rulesout.hlp")' in R.
3) type "rules(1, [some number])", where [some number] is the number of terms in the current RuleFit model.


Lambda value

According to Hastie, Tibshirani and Friedman (2009), cross validation should be used to estimate the shrinkage, or smoothing, parameters. To use cross-validation for determining the lambda parameter (i.o.w., the size of the final model) in the rulefit function, use the test.reps argument. It provides the shrinkage parameter used for model selection, but this is done implicitly, and currently there's no way to obtain the actual lambda value used.

Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning (Vol. 2, No. 1). New York: Springer.


Range of prediction rules: missings

According to the rulefit help file, missing predictor variable values (NAs) are coded 9.0e30 (observations with missing output variable values are of no use to the algorithm, and are excluded from the analysis). In the description of the prediction rules, you may come across something like:

                                                                        
"Rule   1:     2  variables"                                               
"     support =  0.3496      coeff =   1.538      importance =   100.00          "     V1:  range = -0.9900E+36   1.500"                             
"     V2:  range =   1.500      0.9000E+31"                        
"     V3:  range = 0.9000E+31   0.9900+36" 
 

This rule would not apply for observations with missing values for V2 (range does not include values equal to or higher than 0.9e31), but does apply for observations with missing values for V3 (range includes values equal to or higher than 0.9e31).

No comments:

Post a Comment