Speeding up loading of HUMAnN gene-families tables using Arrow.jl + Tables.jl

The human microbiome has on the order of 100-fold more genes than the human genome[1] than the human genome. We are able to get a window on this enormous functional potential using shotgun metagenomic sequencing, but using standard tools for working with this much data can be challenging.

I tend to use the tools from the bioBakery, such as MetaPhlAn for taxonomic profiling, and HUMAnN for functional profiling. Both of these tools can be run on the sequencing files of each sample in parallel, and generate plain-text table files (.csv or .tsv) with a column for the taxa / function, and a column for the abundance.

So here's my situation: I have about 2000 CSV files, each of which has 2 columns - call them feature and value. Each CSV represents a single biospecimen, call them samples.

Each sample has several hundreds of thousands of features, but most features are not found in most samples. Because storing so many zeros is a waste of memory, ultimately, I want to end up with a sparse matrix filled with values, where I know the (column) index of each sample and the (row) index of each feature.

Here's an extremely naive implementation that works but is Slooooow

using DataFrames
using SparseArrays

function builddf(files)
    df = DataFrame(feature=String[], value=Float64[], sample=String[])
    for f in files
        sdf = CSV.read(f, DataFrame; header=["feature", "value"], skipto=2)
        subset!(sdf, "feature"=> ByRow(f-> !contains(f, '|'))) # skip stratified features
        samplename = replace(splitext(basename(f))[1], r"_S\d+_genefamilies" => "")
        sdf.sample .= samplename
        append!(df, sdf)
    end
    return df
end

function matrixize(longdf)
    fs = Set(longdf.feature)
    ss = Set(longdf.sample)
    features = Dict(f => i for (i, f) in enumerate(fs)) 
    samples = Dict(s => i for (i, s) in enumerate(ss))

    m = spzeros(length(fs), length(ss))
    for row in eachrow(longdf)
        rowidx = features[row.feature]
        colidx = samples[row.sample]
        m[rowidx, colidx] = row.value
    end
    return features, samples, m
end

So basically,

  1. I'm reading all of the files in one at a time and making a long dataframe. This is actually reasonably quick
  2. getting a unique index for each sample and feature.
  3. looping through the rows and filling the SparseMatrix

To get a sense of the numbers, trying on a subset of ~70 samples, my initial longdf generated by builddf(files) has: ~11M rows, with ~1.1M unique features.

Step (1) took ~2 sec, step (2) took ~2 sec, and step (3) takes ~ 75 sec.

I thought maybe I could do the matrix filling in parallel (just using Threads.@threads on the loop), but that caused julia to crash, so I guess maybe filling the sparse matrix isn't thread safe?

My other idea was to add row / col index columns to the longdf, then use the sparse(I, J, V) syntax, where I is the a vector of row indices, J is a vector of column indices, and V is the value entries. Since I already already have the long dataframe stored essentially like this, it's pretty easy to make matrixize2(longdf), which is just

function matrixize2(longdf)
    fs = Set(longdf.feature)
    ss = Set(longdf.sample)
    features = Dict(f => i for (i, f) in enumerate(fs)) 
    samples = Dict(s => i for (i, s) in enumerate(ss))

    m = sparse(
        (features[f] for f in longdf.feature), # this is `I`
        (samples[s] for s in longdf.sample), # this is `J`
        longdf.value # this is `V`
    )

    return features, samples, m
end

This actually works a lot better - it essentially combines steps (2) and (3) from above, avoiding looping over each row of the long dataframe, which in the case of my full dataset gets very long. This version only took ~8 sec, for a combined total of 12 sec

Amazing!

  1. Qin J. et. al, Nature, 2010, doi: 10.1038/nature08821