The human microbiome has on the order of 100-fold more genes  than the human genome. We are able to get a window on this enormous functional potential using shotgun metagenomic sequencing, but using standard tools for working with this much data can be challenging.
I tend to use the tools from the bioBakery,
MetaPhlAn for taxonomic profiling,
HUMAnN for functional profiling.
Both of these tools can be run on the sequencing files
of each sample in parallel,
and generate plain-text table files (
with a column for the taxa / function, and a column for the abundance.
So here's my situation: I have about 2000 CSV files, each of which has 2 columns - call them
Each CSV represents a single biospecimen, call them
Each sample has several hundreds of thousands of
features, but most features are not found in most samples.
Because storing so many zeros is a waste of memory,
ultimately, I want to end up with a sparse matrix filled with
values, where I know the (column) index of each
sample and the (row) index of each
Here's an extremely naive implementation that works but is Slooooow
using DataFrames using SparseArrays function builddf(files) df = DataFrame(feature=String, value=Float64, sample=String) for f in files sdf = CSV.read(f, DataFrame; header=["feature", "value"], skipto=2) subset!(sdf, "feature"=> ByRow(f-> !contains(f, '|'))) # skip stratified features samplename = replace(splitext(basename(f)), r"_S\d+_genefamilies" => "") sdf.sample .= samplename append!(df, sdf) end return df end function matrixize(longdf) fs = Set(longdf.feature) ss = Set(longdf.sample) features = Dict(f => i for (i, f) in enumerate(fs)) samples = Dict(s => i for (i, s) in enumerate(ss)) m = spzeros(length(fs), length(ss)) for row in eachrow(longdf) rowidx = features[row.feature] colidx = samples[row.sample] m[rowidx, colidx] = row.value end return features, samples, m end
- I'm reading all of the files in one at a time and making a long dataframe. This is actually reasonably quick
- getting a unique index for each sample and feature.
- looping through the rows and filling the SparseMatrix
To get a sense of the numbers, trying on a subset of ~70 samples,
longdf generated by
builddf(files) has: ~11M rows, with ~1.1M unique features.
Step (1) took ~2 sec, step (2) took ~2 sec, and step (3) takes ~ 75 sec.
I thought maybe I could do the matrix filling in parallel (just using
Threads.@threads on the loop), but that caused julia to crash, so I guess maybe filling the sparse matrix isn't thread safe?
My other idea was to add row / col index columns to the longdf, then use the
sparse(I, J, V) syntax,
I is the a vector of row indices,
J is a vector of column indices,
V is the value entries.
Since I already already have the long dataframe stored essentially like this,
it's pretty easy to make
which is just
function matrixize2(longdf) fs = Set(longdf.feature) ss = Set(longdf.sample) features = Dict(f => i for (i, f) in enumerate(fs)) samples = Dict(s => i for (i, s) in enumerate(ss)) m = sparse( (features[f] for f in longdf.feature), # this is `I` (samples[s] for s in longdf.sample), # this is `J` longdf.value # this is `V` ) return features, samples, m end
This actually works a lot better - it essentially combines steps (2) and (3) from above, avoiding looping over each row of the long dataframe, which in the case of my full dataset gets very long. This version only took ~8 sec, for a combined total of 12 sec