--- title: "Focus parameters in SEM forests" author: "Andreas M. Brandmaier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Focus parameters in SEM forests} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi=300, out.width="50%" ) library(ggplot2) ``` We first generate a mixture of bivariate normal distributions. The distributions differ only by their x- and y-displacement, that is, by their mean values. There are two predictors `grp1` and `grp2` which predict the differences in means. `grp1` predicts differences in the first dimension and `grp2` predicts differences in the second dimension. Without focus parameter, both predictors are needed to distinguish all four groups. If one of the two means is chosen as a focus parameter, only one of the two predictors is important. ```{r} library(semtree) set.seed(123) N <- 1000 grp1 <- factor(sample(x = c(0,1), size=N, replace=TRUE)) grp2 <- factor(sample(x = c(0,1), size=N, replace=TRUE)) noise <- factor(sample(x = c(0,1),size=N, replace=TRUE)) Sigma <- matrix(byrow=TRUE, nrow=2,c(2,0.2, 0.2,1)) obs <- MASS::mvrnorm(N,mu=c(0,0), Sigma=Sigma) obs[,1] <- obs[,1] + ifelse(grp1==1,3,0) obs[,2] <- obs[,2] + ifelse(grp2==1,3,0) df.biv <- data.frame(obs, grp1, grp2, noise) names(df.biv)[1:2] <- paste0("x",1:2) manifests<-c("x1","x2") ``` The following code specifies a bivariate Gaussian model with five parameters: ```{r} model.biv <- mxModel("Bivariate_Model", type="RAM", manifestVars = manifests, latentVars = c(), mxPath(from="x1",to=c("x1","x2"), free=c(TRUE,TRUE), value=c(1.0,.2) , arrows=2, label=c("VAR_x1","COV_x1_x2") ), mxPath(from="x2",to=c("x2"), free=c(TRUE), value=c(1.0) , arrows=2, label=c("VAR_x2") ), mxPath(from="one",to=c("x1","x2"), label=c("mu1","mu2"), free=TRUE, value=0, arrows=1), mxData(df.biv, type = "raw") ); result <- mxRun(model.biv) summary(result) ``` This is how the data look in a 2D space: ```{r} df.biv.pred <- data.frame(df.biv, leaf=factor(as.numeric(df.biv$grp2)*2+as.numeric(df.biv$grp1))) ggplot(data = df.biv.pred, aes(x=x1, y=x2, group=leaf))+ geom_density_2d(aes(colour=leaf))+ viridis::scale_color_viridis(discrete=TRUE)+ theme_classic() ``` Now, we choose the mean of the second dimension `mu2` as focus parameter. We expect that only predictor `grp2`. This is what we see in a single tree. ```{r message=FALSE,eval=TRUE, results="hide"} fp <- "mu2" # predicted by grp2 #fp <- "mu1" # predicted by grp1 tree.biv <- semtree(model.biv, data=df.biv, constraints = list(focus.parameters=fp)) ``` ```{r} plot(tree.biv) ``` Now, we are repeating the same analysis in a forest. ```{r message=FALSE, warning=FALSE,results="hide"} forest <- semforest(model.biv, data=df.biv, constraints = list(focus.parameters=fp), control=semforest.control(num.trees=10, control=semtree.control(method="score",alpha=1))) ``` By default, we see that individual trees are fully grown (without a p-value threshold). The first split is according to `grp2` because it best explains the group differences. Subsequent splits are according to `grp1` even though the chi2 values are close to zero. They only appear because there is no p-value-based stopping criterion. ```{r} plot(forest$forest[[1]]) ``` Now, let us investigate the permutation-based variable importance: ```{r} vim <- varimp(forest, method="permutationFocus") plot(vim, main="Variable Importance") ```