Title: | Hypothesis Testing Tree |
---|---|
Description: | A novel decision tree algorithm in the hypothesis testing framework. The algorithm examines the distribution difference between two child nodes over all possible binary partitions. The test statistic of the hypothesis testing is equivalent to the generalized energy distance, which enables the algorithm to be more powerful in detecting the complex structure, not only the mean difference. It is applicable for numeric, nominal, ordinal explanatory variables and the response in general metric space of strong negative type. The algorithm has superior performance compared to other tree models in type I error, power, prediction accuracy, and complexity. |
Authors: | Jiaqi Hu [cre, aut], Zhe Gao [aut], Bo Zhang [aut], Xueqin Wang [aut] |
Maintainer: | Jiaqi Hu <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2024-10-09 03:14:11 UTC |
Source: | https://github.com/cran/HTT |
The data is about energy performance of buildings, containing eight input variables: relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution and two output variables: heating load (HL) and cooling load (CL) of residential buildings. The goal is to predict two real valued responses from eight input variables. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.
data("ENB")
data("ENB")
A data frame with 768 observations on the following 10 variables.
X1
Relative Compactness
X2
Surface Area
X3
Wall Area
X4
Roof Area
X5
Overall Height
X6
Orientation
X7
Glazing Area
X8
Glazing Area Distribution
Y1
Heating Load
Y2
Cooling Load
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Energy+efficiency.
A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012
data(ENB) set.seed(1) idx = sample(1:nrow(ENB), floor(nrow(ENB)*0.8)) train = ENB[idx, ] test = ENB[-idx, ] htt_enb = HTT(cbind(Y1, Y2) ~ . , data = train, controls = htt_control(pt = 0.05, R = 99)) # prediction pred = predict(htt_enb, newdata = test) test_y = test[, 9:10] # MAE colMeans(abs(pred - test_y)) # MSE colMeans(abs(pred - test_y)^2)
data(ENB) set.seed(1) idx = sample(1:nrow(ENB), floor(nrow(ENB)*0.8)) train = ENB[idx, ] test = ENB[-idx, ] htt_enb = HTT(cbind(Y1, Y2) ~ . , data = train, controls = htt_control(pt = 0.05, R = 99)) # prediction pred = predict(htt_enb, newdata = test) test_y = test[, 9:10] # MAE colMeans(abs(pred - test_y)) # MSE colMeans(abs(pred - test_y)^2)
Various parameters that control aspects of the HTT
function.
htt_control(teststat = c("energy0", "energy1"), testtype = c("permutation", "fastpermutation"), alpha = 1, pt = 0.05, minsplit = 30, minbucket = round(minsplit/3), R = 199, nmin = 1000)
htt_control(teststat = c("energy0", "energy1"), testtype = c("permutation", "fastpermutation"), alpha = 1, pt = 0.05, minsplit = 30, minbucket = round(minsplit/3), R = 199, nmin = 1000)
teststat |
a character specifying the type of the test statistic to be applied.
It can be |
testtype |
a character specifying how to compute the distribution of the test statistic.
It can be |
alpha |
the exponent on Euclidean distance in (0,2] (for regression tree).
Default is |
pt |
the p-value of the permutation test must be less than in order to implement a split.
If |
minsplit |
the minimum number of observations in a node
in order to be considered for splitting.
Default is |
minbucket |
the minimum number of observations in a terminal node.
Default is |
R |
the number of permutation replications are used to simulated
the distribution of the test statistic.
Default is |
nmin |
the minimum number of observations in a node that does not require
the permutation test (for |
The arguments teststat
, testtype
and pt
determine
the hypothesis testing of each split.
The argument R
is the number of permutations to be used.
For the dataset with more than 2000 observations, testtype = "fastpermutation"
will be useful to save time.
A list containing the options.
## choose the teststat as "energy1" htt_control(teststat = "energy1") ## choose the p-value 0.01 htt_control(pt = 0.01) ## choose the alpha to 0.5 htt_control(alpha = 0.5) ## change the minimum number of observations in a terminal node htt_control(minbucket = 7) ## reduce the number of permutation replications to save time htt_control(R = 99)
## choose the teststat as "energy1" htt_control(teststat = "energy1") ## choose the p-value 0.01 htt_control(pt = 0.01) ## choose the alpha to 0.5 htt_control(alpha = 0.5) ## change the minimum number of observations in a terminal node htt_control(minbucket = 7) ## reduce the number of permutation replications to save time htt_control(R = 99)
A class for representing hypothesis testing tree.
frame |
a dataframe about the split information. It contains following information:
|
where |
an integer vector of the same length as the number of observations in the
root node, containing the row number of |
method |
the method used to grow the hypothesis testing tree, |
control |
a list of options that control the |
X |
a copy of the input |
var.type |
a vector recording for each variables, 0 represents continuous, 1 represents ordinal and 2 represents nominal variables. |
HTT
, plot.htt
, print.htt
, predict.htt
Fit a hypothesis testing tree.
HTT(formula, data, method, distance, controls = htt_control(...), ...)
HTT(formula, data, method, distance, controls = htt_control(...), ...)
formula |
a symbolic description of the model to be fit. |
data |
a data frame containing the variables in the model. |
method |
|
distance |
If |
controls |
a list of options that control details of the |
... |
arguments passed to |
Hypothesis testing trees examines the distribution difference over two child nodes by the binary partitioning in a hypothesis testing framework. At each split, it finds the maximum distribution difference over all possible binary partitions, the test statistic is based on generalized energy distance. The permutation test is used to estimate the p-value of the hypothesis testing.
An object of class htt
. See htt.object
.
Jiaqi Hu
htt_control
, print.htt
, plot.htt
, predict.htt
## regression data("Boston", package = "MASS") Bostonhtt <- HTT(medv ~ . , data = Boston, controls = htt_control(R = 99)) plot(Bostonhtt) mean((Boston$medv - predict(Bostonhtt))^2) ## classification irishtt <- HTT(Species ~., data = iris) plot(irishtt) mean(iris$Species == predict(irishtt))
## regression data("Boston", package = "MASS") Bostonhtt <- HTT(medv ~ . , data = Boston, controls = htt_control(R = 99)) plot(Bostonhtt) mean((Boston$medv - predict(Bostonhtt))^2) ## classification irishtt <- HTT(Species ~., data = iris) plot(irishtt) mean(iris$Species == predict(irishtt))
Visualize a htt
object, several arguments can be passed to control the color and shape.
## S3 method for class 'htt' plot(x, digits = 3, line.color = "blue", node.color = "black", line.type = c("straight", "curved"), layout = c("tree", "dendrogram"), ...)
## S3 method for class 'htt' plot(x, digits = 3, line.color = "blue", node.color = "black", line.type = c("straight", "curved"), layout = c("tree", "dendrogram"), ...)
x |
fitted model object of class |
digits |
the number of significant digits in displayed numbers.
Default is |
line.color |
a character specifying the edge color.
Default is |
node.color |
a character specifying the node color.
Default is |
line.type |
a character specifying the type of edge,
|
layout |
a character specifying the layout,
|
... |
additional print arguments. |
This function is a method for the generic function plot
, for objects of class htt
.
Visualize the hypothesis testing tree.
print.htt
, printsplit
, predict.htt
irishtt = HTT(Species ~., data = iris) plot(irishtt) # change the line color and node color plot(irishtt, line.color = "black", node.color = "blue") # change the line type plot(irishtt, line.type = "curved") # change the layout plot(irishtt, layout = "dendrogram")
irishtt = HTT(Species ~., data = iris) plot(irishtt) # change the line color and node color plot(irishtt, line.color = "black", node.color = "blue") # change the line type plot(irishtt, line.type = "curved") # change the layout plot(irishtt, layout = "dendrogram")
Compute predictions from htt
object.
## S3 method for class 'htt' predict(object, newdata, type = c("response", "prob", "node"), ...)
## S3 method for class 'htt' predict(object, newdata, type = c("response", "prob", "node"), ...)
object |
fitted model object of class |
newdata |
an optional data frame in which to look for variables with which to predict, if omitted, the fitted values are used. |
type |
a character string denoting the type of predicted value returned.
For |
... |
additional print arguments. |
This function is a method for the generic function predict
for class htt
. It can be invoked by calling predict
for an
object
of the appropriate class, or directly by calling predict.htt
regardless of the class of the object.
A list of predictions, possibly simplified to a numeric vector, numeric matrix or factor.
If type = "response"
:
the mean of a numeric response and
the predicted class for a categorical response is returned.
If type = "prob"
:
the matrix of class probabilities
is returned for a categorical response.
If type = "node"
:
an integer vector of terminal node identifiers is returned.
irishtt <- HTT(Species ~., data = iris) ## the predicted class predict(irishtt, type = "response") ## class probabilities predict(irishtt, type = "prob") ## terminal node identifiers predict(irishtt, type = "node")
irishtt <- HTT(Species ~., data = iris) ## the predicted class predict(irishtt, type = "response") ## class probabilities predict(irishtt, type = "prob") ## terminal node identifiers predict(irishtt, type = "node")
This function prints a htt.object
.
It is a method for the generic function print
of class htt
.
It can be invoked by calling print
for an object of the appropriate class,
or directly by calling print.htt
regardless of the class of the object.
## S3 method for class 'htt' print(x, ...)
## S3 method for class 'htt' print(x, ...)
x |
fitted model object of class |
... |
additional print arguments. |
A semi-graphical layout of the contents of x$frame is printed. Indentation is used to convey the tree topology. Information for each node includes the node number,split rule, size and p-value. For the "class" method, the class probabilities are also printed.
Visualize the hypothesis testing tree in a semi-graphical layout.
irishtt = HTT(Species ~., data = iris) print(irishtt)
irishtt = HTT(Species ~., data = iris) print(irishtt)
Display the split table for fitted htt
object.
printsplit(object)
printsplit(object)
object |
fitted model object of class |
Display the split table.
irishtt = HTT(Species ~., data = iris) printsplit(irishtt)
irishtt = HTT(Species ~., data = iris) printsplit(irishtt)