# Matching example data structure for mvgls in mvMORPH

Hi,

I'm using the mvMORPH package to fit a model with a phylogenetic generalized least squares. The package authors give code to simulate data but I'm having trouble making my data match the simulation data structure. Note that a phylogenetic tree is also required here but I'm omitting to make this post less messy.

``````#Making the simulation mvMORPH data
set.seed(1)
n <- 32 # number of species
p <- 50 # number of traits (p>n)
tree <- pbtree(n=n, scale=1) # phylogenetic tree
R <- crossprod(matrix(runif(p*p), ncol=p)) # a random symmetric matrix (covariance)
# simulate a BM dataset
Y <- mvSIM(tree, model="BM1", nsim=1, param=list(sigma=R, theta=rep(0,p)))
data=list(Y=Y)

#Example of my data

df1 <- data.frame(
var1 = c(1,16,0,0,0),
var2 = c(17,24,0,0,0),
var3 = c(2,1,0,39,3),
var4 = c(0,0,36,0,5),
var5 = c(3,2,0,0,16))

#Fits the desired model
data=list(df1=df1)
fit1 <- mvgls(df1~1, data=data, tre, model="BM", penalty="RidgeArch")

``````

When I attempt to run the code to fit the model I get the error message "Error in svd(B, nu = 0) : infinite or missing values in 'x'".

A quick google suggested that this is because my data should not be a data frame, but rather a matrix. The issue is that after transforming my data to a matrix I get different outputs when using str() to see if the simulation data and my data match.

``````#Simulation data
num [1:32, 1:50] 2.93 3.2 5.77 2.65 2.5 ...
- attr(*, "dimnames")=List of 2
..\$ : chr [1:32] "t17" "t18" "t5" "t11" ...
..\$ : NULL

#My data
num [1:5, 1:5] 1 16 0 0 0 17 24 0 0 0 ...
- attr(*, "dimnames")=List of 2
..\$ : NULL
..\$ : chr [1:5] "var1" "var2" "var3" "var4" ...
``````

I'm assuming that my code won't run because the NULL and character lines should be switched, but not really sure how to do this.

Any suggestions? Thanks for the help!

What is `df6`? It doesn't seem to be defined in your code.

You seem to have several code snippets and it's unclear how they are related. It would be easier to help you if you gave a single, straightforward, reproducible example.

Apologies, this should be df1! I edited the post and tried to cut down significantly and consolidate code. Please let me know if this is still confusing.

I'm still confused, here are reproducible examples based on your code in the first post, the first one runs without problem, the second one complains about the lack of rownames but runs anyway:

``````#Making the simulation mvMORPH data
set.seed(1)
n <- 32 # number of species
p <- 50 # number of traits (p>n)
tree <- phytools::pbtree(n=n, scale=1) # phylogenetic tree
R <- crossprod(matrix(runif(p*p), ncol=p)) # a random symmetric matrix (covariance)
# simulate a BM dataset
Y <- mvMORPH::mvSIM(tree, model="BM1", nsim=1, param=list(sigma=R, theta=rep(0,p)))
data=list(Y=Y)
fit_y <- mvMORPH::mvgls(Y~1, data=data, tree, model="BM", penalty="RidgeArch")
fit_y
#>
#> Call:
#> mvMORPH::mvgls(formula = Y ~ 1, data = data, tree = tree, model = "BM",
#>     penalty = "RidgeArch")
#>
#>
#> Generalized least squares fit by penalized REML
#> LOOCV of the log-restricted-likelihood: -2305.528
#>
#>
#> Parameter estimate(s):
#> Regularization parameter (gamma): 0.193
#>
#>
#> Covariance matrix of size: 50 by 50
#> for 32 observations
#>
#> Coefficients (truncated):
#>              [,1]  [,2]  [,3]  [,4]  [,5]  [,6]   [,7]  [,8]    [,9] [,10]
#> (Intercept) 1.706 1.093 3.537 1.059 2.169 1.645 -1.059 1.876 0.09703 5.584
#> Use "coef" to display all the coefficients

#Example of my data

df1 <- as.matrix(data.frame(
var1 = c(1,16,0,0,0),
var2 = c(17,24,0,0,0),
var3 = c(2,1,0,39,3),
var4 = c(0,0,36,0,5),
var5 = c(3,2,0,0,16)))

#Fits the desired model
tre <- phytools::pbtree(n=nrow(df1), scale=1) # phylogenetic tree

data=list(df1=df1)
fit1 <- mvMORPH::mvgls(df1~1, data=data, tre, model="BM", penalty="RidgeArch")
#> Warning in mvMORPH::mvgls(df1 ~ 1, data = data, tre, model = "BM", penalty = "RidgeArch"): the data has no names, order assumed to be the same as tip labels in the tree.
fit1
#>
#> Call:
#> mvMORPH::mvgls(formula = df1 ~ 1, data = data, tree = tre, model = "BM",
#>     penalty = "RidgeArch")
#>
#>
#> Generalized least squares fit by penalized REML
#> LOOCV of the log-restricted-likelihood: -85.5726
#>
#>
#> Parameter estimate(s):
#> Regularization parameter (gamma): 1
#>
#>
#> Covariance matrix of size: 5 by 5
#> for 5 observations
#>
#> Coefficients:
#>              var1    var2    var3    var4    var5
#> (Intercept)   2.582   6.227  10.600   8.881   4.624
``````

Created on 2024-01-22 with reprex v2.0.2

So the error you are getting isn't due to the lack of rownames, it appears to be due to something in the structure of the data (the fact that there are infinite or missing values in some of the intermediate steps). Can you run your code to get the error, then run `traceback()`, so we can see where exactly the error happened?

Thanks for responding.

Running fit1 <- mvgls(df6~1, data=data, tree2, model="BM", penalty="RidgeArch") and then traceback() gives:

12: stop("infinite or missing values in 'x'")
11: svd(B, nu = 0)
10: psmall.svd(m, tol)
9: fast.svd(m, tol)
8: pseudoinverse(mod_par\$X)
7: FUN(newX[, i], ...)
6: apply(brute_force, 1, .loocvPhylo, cvmethod = cvmethod, targM = target,
corrStr = corrModel, penalty = penalty, error = mserr, nobs = corrModel\$nobs)
5: which.min(apply(brute_force, 1, .loocvPhylo, cvmethod = cvmethod,
targM = target, corrStr = corrModel, penalty = penalty, error = mserr,
nobs = corrModel\$nobs))
4: `[.data.frame`(brute_force, which.min(apply(brute_force, 1, .loocvPhylo,
cvmethod = cvmethod, targM = target, corrStr = corrModel,
penalty = penalty, error = mserr, nobs = corrModel\$nobs)),
)
3: brute_force[which.min(apply(brute_force, 1, .loocvPhylo, cvmethod = cvmethod,
targM = target, corrStr = corrModel, penalty = penalty, error = mserr,
nobs = corrModel\$nobs)), ]
2: .startGuess(corrModel, cvmethod = method, mserr = mserr, target = target,
penalty = penalty, echo = echo, penalized)
1: mvgls(df6 ~ 1, data = data, tree2, model = "BM", penalty = "RidgeArch")

Sorry, I don't have the answer, I imagine there is something in the structure of the data that's a problem, but understanding what would take an understanding of the function that I don't have.

You should first make sure you visualize your data with different approaches to see if you can find anything "weird" (a classic is constant rows or columns); you could try varying the penalty, target, and method arguments to see if it's better with a slightly different approach; you could try asking the authors directly.