The human microbiome has on the order of 100-fold more genes than the human genome[1] than the human genome. We are able to get a window on this enormous functional potential using shotgun metagenomic sequencing, but using standard tools for working with this much data can be challenging.
I tend to use the tools from the bioBakery,
such as MetaPhlAn
for taxonomic profiling,
and HUMAnN
for functional profiling.
Both of these tools can be run on the sequencing files
of each sample in parallel,
and generate plain-text table files (.csv
or .tsv
)
with a column for the taxa / function, and a column for the abundance.
So here's my situation: I have about 2000 CSV files, each of which has 2 columns - call them feature
and value
.
Each CSV represents a single biospecimen, call them sample
s.
Each sample has several hundreds of thousands of feature
s, but most features are not found in most samples.
Because storing so many zeros is a waste of memory,
ultimately, I want to end up with a sparse matrix filled with value
s, where I know the (column) index of each sample
and the (row) index of each feature
.
Here's an extremely naive implementation that works but is Slooooow
using DataFrames
using SparseArrays
function builddf(files)
df = DataFrame(feature=String[], value=Float64[], sample=String[])
for f in files
sdf = CSV.read(f, DataFrame; header=["feature", "value"], skipto=2)
subset!(sdf, "feature"=> ByRow(f-> !contains(f, '|'))) # skip stratified features
samplename = replace(splitext(basename(f))[1], r"_S\d+_genefamilies" => "")
sdf.sample .= samplename
append!(df, sdf)
end
return df
end
function matrixize(longdf)
fs = Set(longdf.feature)
ss = Set(longdf.sample)
features = Dict(f => i for (i, f) in enumerate(fs))
samples = Dict(s => i for (i, s) in enumerate(ss))
m = spzeros(length(fs), length(ss))
for row in eachrow(longdf)
rowidx = features[row.feature]
colidx = samples[row.sample]
m[rowidx, colidx] = row.value
end
return features, samples, m
end
So basically,
- I'm reading all of the files in one at a time and making a long dataframe. This is actually reasonably quick
- getting a unique index for each sample and feature.
- looping through the rows and filling the SparseMatrix
To get a sense of the numbers, trying on a subset of ~70 samples,
my initial longdf
generated by builddf(files)
has: ~11M rows, with ~1.1M unique features.
Step (1) took ~2 sec, step (2) took ~2 sec, and step (3) takes ~ 75 sec.
I thought maybe I could do the matrix filling in parallel (just using Threads.@threads
on the loop), but that caused julia to crash, so I guess maybe filling the sparse matrix isn't thread safe?
My other idea was to add row / col index columns to the longdf, then use the sparse(I, J, V)
syntax,
where I
is the a vector of row indices, J
is a vector of column indices,
and V
is the value entries.
Since I already already have the long dataframe stored essentially like this,
it's pretty easy to make matrixize2(longdf)
,
which is just
function matrixize2(longdf)
fs = Set(longdf.feature)
ss = Set(longdf.sample)
features = Dict(f => i for (i, f) in enumerate(fs))
samples = Dict(s => i for (i, s) in enumerate(ss))
m = sparse(
(features[f] for f in longdf.feature), # this is `I`
(samples[s] for s in longdf.sample), # this is `J`
longdf.value # this is `V`
)
return features, samples, m
end
This actually works a lot better - it essentially combines steps (2) and (3) from above, avoiding looping over each row of the long dataframe, which in the case of my full dataset gets very long. This version only took ~8 sec, for a combined total of 12 sec
Amazing!
- ⤶ Qin J. et. al, Nature, 2010, doi: 10.1038/nature08821