## Step 1: Explore your dataset to generate hypotheses

For example, we may have a dataset of 100 observations on 10 variables that are actually uncorrelated. To illustrate, let’s generate such a dataset:```
set.seed(1)
data <- matrix(rnorm(1000), nrow = 100, ncol = 10)
colnames(data) <- paste("x", 1:10, sep = "")
```

Let’s explore the data by inspecting the correlations:```
library(corrplot)
corrplot(cor(data), method = "number", type = "upper")
```

We got at least one meaningful correlation there: x4 and x9 show a correlation \(>.2\). Now let’s perform some statistical testing to see if some are correlations significant:

```
library(Hmisc)
pvalues <- rcorr(data)$P
diag(pvalues) <- 1
corrplot(1-pvalues, type = "upper", method = "number", main = "1 minus p-value")
```

Two out of 45 correlation coefficients are significant. Surely there must be a publication in these data! Also, we could decide to test one- instead of two-sided, then we have 3 significant associations. But what would be much better than 3 significant correlations? Yes, indeed: a mediation effect!

## Step 2: Perform confirmatory analyses on the same dataset

### Example 1: Mediation Analysis

Obviously, we hypothesized before gathering our data that x10 is a causal variable, with a partially indirect effect on x9, via x4.```
lm1 <- lm(x9 ~ x10, data = data.frame(data))
summary(lm1)
```

```
##
## Call:
## lm(formula = x9 ~ x10, data = data.frame(data))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.46497 -0.74698 0.03043 0.64968 2.86824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009241 0.107194 0.086 0.9315
## x10 0.205740 0.102135 2.014 0.0467 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.03976, Adjusted R-squared: 0.02996
## F-statistic: 4.058 on 1 and 98 DF, p-value: 0.04671
```

Great! The causal and dependent variable are signficantly associated!```
lm2 <- lm(x9 ~ x4 + x10, data = data.frame(data))
summary(lm2)
```

```
##
## Call:
## lm(formula = x9 ~ x4 + x10, data = data.frame(data))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.36641 -0.62453 0.04468 0.62737 2.84705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.001822 0.105719 -0.017 0.9863
## x4 0.216306 0.107765 2.007 0.0475 *
## x10 0.181501 0.101315 1.791 0.0763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.07805, Adjusted R-squared: 0.05904
## F-statistic: 4.106 on 2 and 97 DF, p-value: 0.01942
```

Awesome! We have partial mediation: the effect of x10 is smaller when x4 is included in the equation and even becomes insignificant.### Example 2: Factor Analysis

Confirmatory factor analysis (CFA) offers a great way to generate a lot of outcomes at once: There will always be a parameter estimate, significance test or fit index that you can use to back up any interpretation you prefer. As we have no clue what we are looking for in our data, let’s explore the data first using PCA, which will provide us with the hypotheses we need to subsequently fit a CFA on the same data:```
library(psych)
VSS.scree(data)
```

The scree plot indicates that five components have an eigenvalue $ > 1 $, providing justification for a 5-factor solution. However, such a solution is not very parsimonious for 10 items, so let’s use the elbow criterion and hypothesize that there are 2 factors in our data.

Of course, a better way to go about exploring the factorial structure of the data would be to perform parallel analysis, which would have shown that we would have obtained a very similar scree plot using simulated or resampled data:

`fa.parallel(data)`

`## Parallel analysis suggests that the number of factors = 0 and the number of components = 0`

But that spoils all the fun, so let’s ignore these results. Instead, we explore and interpret the 2-component solution:`principal(data, nfactors = 2, rotate = "varimax")`

```
## Principal Components Analysis
## Call: principal(r = data, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## x1 0.07 0.48 0.235 0.76 1.0
## x2 -0.07 0.28 0.082 0.92 1.1
## x3 -0.08 -0.27 0.080 0.92 1.2
## x4 0.64 -0.05 0.408 0.59 1.0
## x5 -0.05 0.52 0.272 0.73 1.0
## x6 0.02 -0.49 0.244 0.76 1.0
## x7 -0.31 0.28 0.176 0.82 2.0
## x8 0.00 0.62 0.382 0.62 1.0
## x9 0.68 0.00 0.456 0.54 1.0
## x10 0.65 0.20 0.461 0.54 1.2
##
## RC1 RC2
## SS loadings 1.40 1.40
## Proportion Var 0.14 0.14
## Cumulative Var 0.14 0.28
## Proportion Explained 0.50 0.50
## Cumulative Proportion 0.50 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.13
## with the empirical chi square 153.01 with prob < 5.9e-20
##
## Fit based upon off diagonal values = -0.88
```

A clear structure emerges: x4, x9 and x10 form a factor; x1, x5, x6 and x8 form a factor; x7 loads on both factors. We may need to discard x2 and x3 because they have low correlations with the components.This is exactly the structure we expected! So let’s use this structure to perform the CFA:

```
library(lavaan)
model <- '
F1 =~ x4 + x9 + x10 + x7
F2 =~ x1 + x5 + x6 + x8 + x7
'
fit <- cfa(model, data)
fitmeasures(fit, c("chisq", "df", "pvalue", "cfi", "srmr", "rmsea"))
```

```
## chisq df pvalue cfi srmr rmsea
## 13.777 18.000 0.744 1.000 0.053 0.000
```

Great! Our model has a non-significant chi-square value, CFI and RMSEA indicate perfect model fit and SRMR indicates a well-fitting model! Let’s look at the parameter estimates:`summary(fit, standardized = TRUE)`

```
## lavaan (0.5-23.1097) converged normally after 61 iterations
##
## Number of observations 100
##
## Estimator ML
## Minimum Function Test Statistic 13.777
## Degrees of freedom 18
## P-value (Chi-square) 0.744
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## F1 =~
## x4 1.000 0.427 0.433
## x9 1.226 0.861 1.423 0.155 0.524 0.484
## x10 0.898 0.589 1.523 0.128 0.384 0.366
## x7 -0.418 0.450 -0.929 0.353 -0.179 -0.166
## F2 =~
## x1 1.000 0.294 0.329
## x5 1.426 1.229 1.160 0.246 0.419 0.360
## x6 -0.997 0.882 -1.130 0.258 -0.293 -0.305
## x8 1.146 1.011 1.134 0.257 0.337 0.309
## x7 0.742 0.805 0.922 0.356 0.218 0.203
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## F1 ~~
## F2 0.012 0.035 0.334 0.738 0.094 0.094
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .x4 0.791 0.176 4.483 0.000 0.791 0.812
## .x9 0.898 0.240 3.746 0.000 0.898 0.766
## .x10 0.954 0.176 5.425 0.000 0.954 0.866
## .x7 1.084 0.174 6.223 0.000 1.084 0.938
## .x1 0.712 0.133 5.362 0.000 0.712 0.892
## .x5 1.179 0.237 4.969 0.000 1.179 0.870
## .x6 0.838 0.149 5.634 0.000 0.838 0.907
## .x8 1.078 0.193 5.594 0.000 1.078 0.905
## F1 0.183 0.158 1.154 0.249 1.000 1.000
## F2 0.086 0.101 0.860 0.390 1.000 1.000
```

Well, none of the factor loadings or factor (co)variances may be significant, but who expects significance with 100 observations? Well, the item’s residual variances were significant, but those are always signficant, because psychological test data is always very noisy (non-significant residual variances, that would be something to worry about!) The completely standardized loadings indicate that this is a very good model, because the indicators and corresponding factors all have correlations \(>.30\). Except for x7, but this item has cross loadings, so this is to be expected.