I've been using irlba::irlba
to do a partial SVD for very large sparse one-hot encoded datasets. The advantage of irlba
is that it is efficient for sparse data and allows you to specify center
and scale
vectors without explicitly forming the intermediate matrix thereby preserving sparsity. base::svd
can't do this.
reprex()
below uses the simple iris data to show that explicitly scaling with irlba(scale(N), ...)
produces the correct result, while using the scale
and center
arguments with irlba(N, center = colMeans(N), scale = apply(N, 2, sd), ...)
produces an incorrect result.
Is this a bug, or am I doing something wrong? Any help appreciated.
library(irlba)
N <- iris[-5]
str(N)
#> 'data.frame': 150 obs. of 4 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
a <- irlba(scale(N), nv = 2, nu = 2)
str(a)
#> List of 5
#> $ d : num [1:2] 20.9 11.7
#> $ u : num [1:150, 1:2] -0.1082 -0.0995 -0.113 -0.1099 -0.1142 ...
#> $ v : num [1:4, 1:2] 0.521 -0.269 0.58 0.565 -0.377 ...
#> $ iter : num 0
#> $ mprod: num 0
biplot(a$u, a$v)
b <- irlba(N, center = colMeans(N), scale = apply(N, 2, sd), nv = 2, nu = 2)
str(b)
#> List of 5
#> $ d : num [1:2] 67.9 27.7
#> $ u : num [1:150, 1:2] 0.01973 -0.05465 0.06383 0.00869 0.02097 ...
#> $ v : num [1:4, 1:2] -0.7531 -0.1218 -0.6345 -0.1242 0.0144 ...
#> $ iter : num 0
#> $ mprod: num 0
biplot(b$u, b$v)
Created on 2019-01-08 by the reprex package (v0.2.1)