Code Supplement for the Phylogenetic Placement

This document provides code listings used for the analyses as described in the supplement chapter "Phylogenetic Placement". See there for a higher level explanation of the analysis pipeline. We use terms and abbreviations from the supplement chapter in this document. Thus, we assume the reader to be familiar with the supplement. The outline of this document roughly follows the sections of the supplement.

The code provided here is intended for Linux systems using bash (e.g., Ubuntu). It needs to be adapted for other environments. Many steps were run on cluster nodes in parallel. As cluster environments differ, the code here is a simplified serial version. We however provide an outline of how to parallelize it.

Author: Lucas Czech (lucas.czech@h-its.org)
Date: 2016-03-04
Revision: 2016-10-11

In case of trouble or bugs, please email to lucas.czech@h-its.org or alexandros.stamatakis@h-its.org

Disclaimer

The code in this document is provided by the authors and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of the code, even if advised of the possibility of such damage.

Table of Contents

Input Data

The data input to the analysis pipeline consists of the following files (renamed for simplicity).

Reference data:

Sequence data:

Software

The pipeline for the data analysis uses the following programs:

To reproduce the figures from the resulting data, those programs are needed:

genesis

For some of the downstream analyses and data handling of the placements, we used our own toolkit genesis, for which we are currently preparing a proper release. At the time of this writing, there is no comprehensive manual yet on how to use the toolkit. Thus, we give a short introduction here that should suffice to get it to work.

The source code of the used version genesis v0.2.0 is available at GitHub. It is written in C++11, but also supports a Python interface. For simplicity, we only use the C++ interface here. For building, a fairly modern compiler is necessary (g++ >= 4.9, clang++ >= 3.6). Furthermore, make and cmake >= 2.6 are needed as build tools. Then, calling

make

in the main directory builds the library.

In order to use custom code, particularly the programs provided later in this document, the source files have to be placed as *.cpp files in the apps directory. They will be automatically compiled and turned into an executable in the bin directory when the command

make upate

is issued in the main directory of genesis. Those executables can then be called with the necessary command line parameters to run the programs.

1. Data Preparation

1.1. Reference Alignment

Although the reference alignments Euks.fasta and Apis.fasta are already aligned, there is some preprocessing necessary.

Clean the Alignments

The following script takes an alignment file in FASTA format as input and writes a new alignment file where sequence names are cleaned, i.e., they are stripped off any part that comes after the first whitespace. The file is named like the input file, plus a suffix of ".clean".

Some FASTA files contain additional information on their sequence name line, which confuses RAxML, for example

>AB000912_Tridacna_hemolymph_apicomplexan 2664 bp 2694 bp 2660 bp

This script gets rid of that and turns it into

>AB000912_Tridacna_hemolymph_apicomplexan

so that it can be used in RAxML.

#!/bin/bash 
 
ALN="path/to/alignment.fasta" # Euks.fasta or Apis.fasta 
 
cat ${ALN} | sed "s/>\([^ ]*\) \(.*\)/>\1/g" > ${ALN}.clean

This step needs to be done for both reference alignments (Euks.fasta, Apis.fasta; E and A).

Check and "Reduce" the Alignment

This script takes an alignment file in FASTA format as input and runs the RAxML check algorithm -f c on it. This produces a reduced alignment file, where sites without signal are removed.

A call to ./clean_alignment.sh might be necessary first in order to get a file that can be read by RAxML.

#!/bin/bash 
 
RAXML="path/to/raxml"
ALN="path/to/alignment.fasta.clean" # Euks.fasta or Apis.fasta (cleaned) 
 
${RAXML} -f c -m GTRGAMMA -s ${ALN} -n check_file

This step needed to be done for both reference alignments (Euks.fasta(.clean), Apis.fasta(.clean); E and A). The resulting files are then fed into the tree search.

1.2. Query Sequences

The amount of sequences (at least in the case of Amplicons.fasta) is too much to process the whole file at once. Thus, we split the data into smaller chunks. This is done by using the count tables Amplicons.table and OTUs.table to create sequence files for each of the 154 sampling location.

#!/usr/bin/python 
 
from sets import Set
import os
 
# Input file names 
seq_file  = "path/to/alignment.fasta" # Amplicons.fasta or OTUs.fasta 
tab_file  = "path/to/alignment.table" # Amplicons.table or OTUs.table 
 
# Output directory for the sequence files per sample 
out_dir   = "out_dir"
 
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
 
# Prepare a list of sequences per sample. We assume 154 samples here, as given in the table files. 
sample = []
for i in range(0154):
    sample.append([])
 
# Process the table file 
print "Reading table file "+tab_file
with open(tab_file"r") as tab_in:
    for line in tab_in:
        cols = line.split("\t")
 
        if len(cols) != 156:
            print "Warning: Line does not contain 156 columns."
            exit
 
        name = cols.pop(0)
        tot  = cols.pop()
 
        if len(cols) != 154:
            print "Warning: Line does not contain 154 columns."
            exit
 
        s = 0
        for i in range(0154):
            s += int(cols[i])
            if int(cols[i]) > 0:
                sample[i].append(name)
 
        if s != int(tot):
            print "Warning: Sum "+str(s)+" != "+tot+"."
 
# Now we convert the list of sequences per sample into a set for faster lookup. 
# (In our original script, we used the lists for some further checks. As this is not needed here, 
# we could instead directly fill the set instead of first a list and then convert...) 
sample_set = []
print "There are "+str(len(sample))+" samples:"
for i in range(0len(sample)):
    sample_set.append(Set())
    print "    Sample "+str(i)+" has "+str(len(sample[i]))+" sequences."
    for s in sample[i]:
        sample_set[i].add(s)
    sample[i][:] = []
 
print "Extracted read names."
 
# Process the sequence file 
print "Reading sequence file "+seq_file
with open(seq_file"r") as seq_in:
    while True:
        line1    = seq_in.readline()
        sequence = seq_in.readline()
        if not sequence:
            break
 
        if line1[0] != ">":
            print "Sequence does not start with >, aborting."
            exit
 
        name = line1[1:].split("_")[0]
        for i in range(0154):
            #~ print "name "+name+", i "+str(i) 
            if name in sample_set[i]:
                #~ print "found "+name+" at "+str(i) 
                out = open(out_dir+"/sample_"+str(i)+".fasta""a")
                out.write(">"+name+"\n"+sequence)
                out.close()
 
print "Done."

This script (and small variations of it to check different properties of the input data like correct number of sequences etc) were run for the Amplicons.fasta and OTUs.fasta files (M and O). First on all sequences, and later, for the second part of the pipeline (for the Apicomplexans, A) again on the Apicomplexan subset of the sequences. This results in a set of FASTA files (one FASTA file per sampling location) for each combination of either Eukariotes (all sequences) or Apicomplexans (only this subset of the sequences), and either amplicons or OTUs. In total, four sets of fasta files: E-M, E-O, A-M, A-O.

Remark: As there are amplicons and OTUs that appear in more than one sample, some of the sequences are duplicated in different chunks. This means that there is an overhead for calculating those duplicated placements multiple times. For the 10,567,804 amplicons, there are 1,618,894 duplications, which means that we needed to do 15.3% more computations. For the 29,092 OTUs, there are 52,511 duplications, so the amount of calculations increased 2.8 times. As the total number of OTUs is however small enough (compared to the number of amplicons), runtimes are still short. A reduplication step is done in the sequence extraction step.

There are other ways of splitting the data into smaller portions, for example by just separating the data into chunks of 100,000 sequences. This would circumvent the duplications. However, for some downstream analyses (not part of this paper), we were interested in a splitting according to sampling location. For other use cases, it might be helpful to split the data in a different manner. See also section Data Postprocessing for some more information related to this.

2. Building the reference tree

2.1. Unconstrained

Given the reference alignments, we inferred reference trees. To improve the result, we ran 40 thorough searches (option -f o) with RAxML and selected the best scoring (maximum likelihood) tree.

The following script runs independent instances of RAxML with random seeds. It stores the outputs in separate directories called sXYZ according to to the seed. The instances can of course also be executed in parallel (not shown here; the call to RAxML has to be changed to an appropriate cluster job call).

#!/bin/bash 
 
# Settings and paths 
JOBS=40
ALN="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced) 
RAXML="path/to/raxml"
 
# Run independent instances of RAxML 
for i in `seq ${JOBS}`
do
    SEED=$RANDOM
 
    while [ -d "s$SEED" ]; do
        echo "Directory s$SEED already exists, skipping..."
        SEED=$RANDOM
    done
 
    mkdir s${SEED}
 
    cd s${SEED}
    ${RAXML} -f o -p ${SEED} -m GTRGAMMA -s ${ALN} -n s${SEED} -T 12
    cd ..
done

This step is necessary for both the Euks and Apis reference alignment (Euks.fasta, and Apis.fasta, E, A, respectively their cleaned and reduced versions).

Now we need to select the best tree. For this, the following script scans the previously created directories containing the RAxML results and parses their output files.

#!/bin/bash 
 
# Output file names 
LH_FILE="LH_bests"
SORTED_LH_FILE="LH_bests_sorted"
rm -f ${LH_FILE} ${SORTED_LH_FILE}
 
# Scan the directories 
for dir in `ls -d s*`
do
    if [ -f "$dir" ]
    then
      continue
    fi
 
    echo $dir
    cd $dir
 
    INFO_FILE=`ls RAxML_info.*`
    TREE_FILE=`ls RAxML_bestTree.*`
 
    if [ -z "$INFO_FILE" ]
    then
      echo "Skipping $dir"
      cd ..
      continue
    fi
 
    LH_STRING="Final GAMMA-based Score of best tree "
    LH=`grep "${LH_STRING}" ${INFO_FILE} | sed s/"${LH_STRING}"//g`
    echo "$LH ${dir}/${TREE_FILE}" >> ../$LH_FILE
 
    cd ..
done
 
# Sort by likelihood, output the best tree 
sort -n -r $LH_FILE > ${SORTED_LH_FILE}
BEST_RESULT=`head -n 1 ${SORTED_LH_FILE} | tr -s " " "\n" | sed -n '2p'`
echo "Best tree: $BEST_RESULT"
cp ${BEST_RESULT} best_tree.newick

The best tree is now stored in best_tree.newick, one for the Euks and one for the Apis (EU and AU).

2.2. Constrained

The tree inference with taxonomic constraint was carried out using Sativa. In this step, Sativa infers a constrained tree. This is again the best tree from 40 maximum likelihood runs. Internally, Sativa uses RAxML for tree inference.

This step is again necessary for both the Euks and Apis reference alignment (E and A), and needs as an additional input a taxonomic constraint file. Sativa then yields a so-called refjson file, which is meant to be used for Sativa's original purpose of mislabel detection in taxonomies. However, we "misuse" Sativa and its output here, because our goal is not to check the consistency and correctness of the taxonomy. Instead, we only extract the constrained tree from the result file.

It is also possible to do this with RAxML directly. However, when designing the pipeline, we wanted to have easy access to the capabilities of Sativa. It was not needed in the end, but as the additional runtime overhead is rather small, we decided to keep it this way.

#!/bin/bash 
 
ALI="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced) 
TAX="path/to/taxonomy.tax" # Euks.tax or Apis.tax 
SATIVA="path/to/sativa"
 
# Run sativa 
${SATIVA}/epa_trainer.py -s ${ALI} -t ${TAX} -n Euks_Constr -x ZOO -no-hmmer -N 40
 
# Extract the best tree from the result file. 
grep "\"raxmltree\":" Euks_Constr.refjson \
    | sed "s/[ ]\+\"raxmltree\"\:\ \"//g" \
    | sed "s/(r_/(/g"                     \
    | sed "s/,r_/,/g"                     \
    | sed "s/;\",[ ]*$/;/g"               \
    > best_tree.newick

The best tree is written to best_tree.newick, again one for the Euks and one for the Apis (EC and AC)..

3. Aligning the query sequences to the reference

At this stage, we have created the following files:

The next step is now to align those sequence files to their respective references. We use PaPaRa for this, which takes the sequences, the reference alignment, and also the reference tree as input. By taking the tree into account, PaPaRa is able to use the phylogenetic signal of the sequences to get a better alignment.

As this alignment step takes all input data into account, it has to be run for all 8 analyses. Furthermore, for each of them, it is run for all of the 154 sequence files. The call to PaPaRa within the loop can be replaced by an according call to a cluster submission in order to parallelize the process. This is highly recommended, as this step takes a while.

#!/bin/bash 
 
ALI="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced) 
TREE="path/to/best_tree.newick" # Euks or Apis tree 
SAMPLES="path/to/sample_dir" # (Euks or Apis sequences) and (Amplicons or OTUs) 
PAPARA="path/to/papara"
 
for i in `seq 0 153`
do
    mkdir sample_${i}
    cd sample_${i}
    echo "Aligning sample ${i}..."
 
    ${PAPARA} -t ${TREE} -s ${ALI} -q ${SAMPLES}/sample_${i}.fasta -j 4 -r -n sample_${i}
 
    cd ..
done

This results in 154 alignment files (in PHYLIP format) for each of the 8 analyses. The output files are named papara_alignment.sample_i, where i is the sample number (0-153).

4. Phylogenetic Placement

The alignment files resulting from the previous step were then fed into RAxML-EPA in order to get a phylogenetic placement for all of the sequences.

For our pipeline, we further split the data into chucks of 20,000 query sequences in order to have more parallel execution speed. This was done by creating FASTA files that all contain the reference of either 512 or 190 sequences (E or A) plus up to 20,000 sequences from the previously split 154 files. After running EPA, these chunks were then combined again. This is possible because the query sequences are independent in EPA (no order dependency, they don't change the tree, etc). It however turned out that this was not necessary, as the runtime of EPA for short enough for our cluster (<48h) for each sample anyway. Thus, this splitting step is omitted here for simplicity.

4.1. Unconstrained

In the following script, we set the option --epa-accumulated-threshold to 99.9%. This option controls how many of the possible placement positions are outputted by the program - in this case as many as needed to get a sum of the likelihood weights that is close to 1.0 (the weights are sorted of course, so the most probably ones are outputted first). The default for this option is 95%. Most downstream analyses work with the most probable placement position only, hence it would normally suffice to use this default. We however need the higher threshold for the clade label annotation.

#!/bin/bash 
 
RAXML=path/to/raxml
TREE=path/to/best_tree.newick # Euks or Apis 
 
for i in `seq 0 153`
do
    ALI=path/to/papara_alignment.sample_${i} # Amplicons or OTUs (and either Euks or Apis) 
 
    ${RAXML} -f v --epa-accumulated-threshold=0.999 -t ${TREE} -s ${ALI} -m GTRGAMMAX -n sample_${i} -T 16
done

This step is run for the Euks and Apis reference (with respective sequences; EU, AU) and for the amplicons and OTUs, for a total of 4 analyses (EU-M, EU-O, AU-M, AU-O). The output files (in jplace format) are named sample_i.jplace, where i is the sample number (0-153).

4.2. Constrained

For the constrained case, we again used Sativa. It is also possible to use RAxML directly, as shown in the previous section. However, for some downstream analyses (not part of this paper), we wanted to have the full Sativa result, too. For completeness, we wanted to also show this script here. The result is the same, because internally, Sativa runs RAxML in the same way as shown before.

#!/bin/bash 
 
SATIVA=path/to/sativa
REF=path/to/ref.refjson # Euks or Apis 
 
for i in `seq 0 153`
do
    ALI=path/to/papara_alignment.sample_${i} # Amplicons or OTUs (and either Euks or Apis) 
 
    ${SATIVA}/epa_classifier.py -r ${REF} -q ${ALI} -n sample_${i} -x
done

This step is run for the Euks and Apis reference (with respective sequences) and for the amplicons and OTUs, for a total of 4 analyses (EC-M, EC-O, AC-M, AC-O). The output files (in jplace format) are named sample_i.jplace, where i is the sample number (0-153).

5. Data Postprocessing

5.1. Inconsistent Placements

As mentioned in the section on Data Preparation, some of the sequences occur multiple times in different samples. This also means that the placements for those sequences are calculated multiple times (one for each sample they appear in). As the EPA is deterministic, this is in principle no issue. Given identical input, all the results are the same.

However, for estimating the mutation rate of the sequences (which are important for evaluating the phylogenetic likelihood function), the EPA uses the base frequencies of the nucleotids of the whole data, i.e., the reference AND the query sequences. This means that they might slightly differ between samples. This is because different samples contain different query sequences, which might have a different distribution of the nucleotids. In the current implementation, this also affects the branch length of the outputted tree, but only after the 7th meaningful digit, so we decided to ignore this for the trees themselves. However, this whole issue can lead to placing the same (duplicated) sequence on different branches in different samples.

In our data, this issue occurred in in Unconstrained Apicomplexan datasets (AU), for both the amplicons and OTUs (AU-M and AU-O). In the former case, this affected 592 sequences (0.008% of 7,592,831 Apicomplexan amplicons); in the latter case, 7 sequences (0.05% of 13,953 Apicomplexan OTUs). It did not happen in the Constrained case (AC), and not for the Euks (EU and EC). The affected sequences "jumped" only between close by branches, which further confirms that this is an issue related to the phylogenetic likelihood evaluation (as opposed to random or other systematic errors).

As the impact of this issue is rather small, both in terms of affected sequences and difference in placement position, we decided to ignore this and simply use the first of the resulting positions.

5.2. Restriction to max weight Placements

As mentioned in the section on Unconstrained Phylogenetic Placement, EPA outputs multiple possible placement positions with different likelihood weights (probabilities), which sum up to 1.0 for all branches of the tree. For the purposes of abundance counting and visualization, we were only interested in the most probable position. Thus, we ran the following tool to get rid of all but this placement.

It is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.

// file: max_weight_placement.cpp 
 
#include <string>
 
#include "placement/functions.hpp"
#include "placement/io/jplace_processor.hpp"
#include "placement/placement_map.hpp"
#include "utils/core/logging.hpp"
 
using namespace genesis;
 
int main (int argc, char** argv)
{
    // Activate Logging. 
    Logging::log_to_stdout();
 
    // Get the dir containing the jplace files. 
    if (argc != 2) {
        LOG_WARN << "Need to provide base dir containing jplace files.";
        return 1;
    }
    auto base_dir = text::trim_right( std::string( argv[1] ), "/") + "/";
    LOG_INFO << "Jplace dir: " << base_dir;
 
    // Process all files. 
    for (size_t i = 0; i < 154; i++) {
        LOG_INFO << "=====================================================";
        LOG_INFO << "Sample " << i;
 
        // Read placement file. 
        PlacementMap map;
        std::string jfile  = base_dir + "sample_" + std::to_string(i) + ".jplace";
        if (! JplaceProcessor().from_file(jfile, map) ) {
            LOG_WARN << "Could not load jplace file " << jfile;
            return 1;
        }
 
        // Output information before. 
        LOG_INFO << "Before reduction to max weight placement:";
        LOG_INFO << "Pquery count    " << map.pquery_size();
        LOG_INFO << "Placement count " << map.placement_count();
 
        // Delete all but the most probable placement and save. 
        map.restrain_to_max_weight_placements();
        JplaceProcessor().to_file(map, base_dir + "sample_" + std::to_string(i) + "_max.jplace");
 
        // Output information after. 
        LOG_INFO << "After reduction to max weight placement:";
        LOG_INFO << "Pquery count    " << map.pquery_size();
        LOG_INFO << "Placement count " << map.placement_count();
    }
}

The program takes the .jplace files (resulting from the EPA run) and outputs new .jplace files, which only contain the most probable placement position, as given by the like_weight_ratio. As a side effect, this also reduces the amount of data for downstream analyses.

To run the program, call bin/max_weight_placement base_dir/ in the genesis directory. Here, base_dir is the directory in which the jplace files from the EPA run are stored. This results in files called sample_i_max.jplace, where i is the sample number (0-153).

6. Extracting the Apicomplexan clade

After running the pipeline with the Eukaryotic alignment and data, we ended up having .jplace files per sample (0-153) for the Unconstrained and Constrained tree and for Amplicons and OTUs, for a total of 4 analyses (EU-M, EU-O, EC-M, EC-O).

In this section, we discuss the steps to extract those sequences (amplicons and OTUS) that were placed into the Apicomplexan clade on the Eukariotes tree. Those sequences were then fed back into the pipeline for the additional remaining 4 analyses (AU-M, AU-O, AC-M, AC-O).

6.1. Preparing the Clade Annotation

The first step is to get a representation of the clades that can be read by our toolkit. For this, we used the input file Euks.tax, which was also used to create the constrained tree (EC). This file contains a taxonomic annotation for all 512 eukaryotic reference sequences. To get the clade annotation, we simply use the second level of that taxonomy.

By replacing all semicolons ; in the tax file by spaces, the file can be loaded into a spreadsheet application (Microsoft Excel or OpenOffice Calc) as a CSV file. Then, by selecting the first column (taxa names) and third column (which corresponds to the second taxonomy level) and copying those columns into a text file, we obtain the needed representation.

The following script then creates single files for each clade (in the directory clades), which each contain the taxa of those clades. The script needs to be provided with the text file from above as input.

#!/bin/bash 
 
rm -rf clades
mkdir clades
 
while read -r line; do
    # echo $line 
    taxon=`echo $line | cut -f 1 -d" "`
    clade=`echo $line | cut -f 2 -d" "`
 
    echo "${clade} >> ${taxon}"
    echo ${taxon} >> clades/${clade}
done < clade_list

The resulting clade files are the input for subsequent steps. They are also needed for visualizing the clade annotated trees later. Thus, this script has to be run for both taxonomic constraints (Euks.tax and Apis.tax).

6.2. Extracting Sequence Names with Threshold

In this step, we extract those sequences from the dataset (amplicons and OTUs) which were placed into the Apicomplexan clade in the previous placement step. As mentioned earlier, the EPA yields different possible placement positions for each sequence. This however implies that sequences can be placed in different clades. This can happen when either the reference is sparse or the sequences are somehow not well fitting to the reference (either new species, sequencing errors, chimeras, etc). Thus, we have to filter.

We apply a 95% threshold for the clade annotation. That means, we only extract a sequence and assign a clade label to it, if 95% of its placement probability (measured via like_weight_ratio) were placed into the same clade. All other sequences are discarded.

For the amplicons, there were 574,833 out of 10,567,804 sequences that were discarded this way, which is 5.4%. For the OTUs, there were 2,211 out of 29,092 sequences, which is 7.6%. In total, this means that by using Evolutionary Placement, we were able to keep >92% of all sequences. This is in contrast to other methods which need the sequences to be closer to a known reference database and thus would discard much more of our data (we estimated about 75%, so they only keep 25%).

Extract Sequences Names

Extracting the sequences names using the clades is done by the following program. As we need the information of the whole placement weights, we use the original .jplace files here (instead of the ones with only the max weight). It is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.

// file extract_sequence_names.cpp 
 
#include <algorithm>
#include <assert.h>
#include <cmath>
#include <numeric>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
 
#include "placement/functions.hpp"
#include "placement/io/jplace_processor.hpp"
#include "placement/placement_map.hpp"
#include "tree/bipartition/bipartition_set.hpp"
#include "tree/default/functions.hpp"
#include "tree/tree.hpp"
#include "utils/core/fs.hpp"
#include "utils/text/string.hpp"
 
using namespace genesis;
 
int main (int argc, char** argv)
{
    // Activate logging. 
    Logging::log_to_stdout();
 
    // Get the dir containing the jplace files. 
    if (argc != 2) {
        LOG_WARN << "Need to provide base dir containing jplace files.";
        return 1;
    }
    auto base_dir = text::trim_right( std::string( argv[1] ), "/") + "/";
    LOG_INFO << "Jplace dir: " << base_dir;
 
    // Threshold. 
    const double precision = 0.95;
 
    // Sativa prepends a "r_" to all taxa names. We remove this. 
    const std::string taxon_prefix = "r_";
 
    // Create output directories. 
    utils::dir_create(base_dir + "clades_extr");
    utils::dir_create(base_dir + "names");
 
    // Get a list of the files containing the clades. 
    std::vector<std::string> clade_files;
    utils::dir_list_files(base_dir + "clades", clade_files);
    std::sort(clade_files.begin(), clade_files.end());
 
    // Create a list of all clades and fill each clade with its taxa. 
    std::vector<std::pair<std::string, std::vector<std::string>>> clades;
    for (auto cf : clade_files) {
        auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );
        std::sort(taxa.begin(), taxa.end());
        clades.push_back(std::make_pair(cf, taxa));
    }
 
    // Process all samples. 
    for (size_t i = 0; i < 154; i++) {
        LOG_INFO << "============================================================";
        LOG_INFO << "Sample " << i;
 
        // Read placement file. 
        PlacementMap map;
        std::string jfile = base_dir + "samples/sample_" + std::to_string(i) + ".jplace";
        if( !JplaceProcessor().from_file(jfile, map) ) {
            LOG_WARN << "Could not read jplace file " << jfile;
            return 1;
        }
 
        // Output information. 
        LOG_INFO << "Pquery count    " << map.pquery_size();
        LOG_INFO << "Placement count " << map.placement_count();
 
        // For each clade, make a list of all edges. We use a vector to preserve order. 
        std::vector<std::pair<std::string, std::unordered_set<PlacementTree::EdgeType*>>> clade_edges;
 
        // Make a set of all edges that do not belong to any clade (basal branches). 
        // We first fill it with all edges, then remove the clade-edges later. 
        std::unordered_set<PlacementTree::EdgeType*> basal_branches;
        for (auto it = map.tree().begin_edges(); it != map.tree().end_edges(); ++it) {
            basal_branches.insert(it->get());
        }
 
        // Extract clade subtrees. 
        LOG_INFO << "Extract clade subtrees...";
        for (auto& clade : clades) {
            std::vector<PlacementTree::NodeType*> node_list;
 
            // Find the nodes that belong to the taxa of this clade. 
            for (auto taxon : clade.second) {
                PlacementTree::NodeType* node = find_node( map.tree(), taxon_prefix + taxon);
                if (node == nullptr) {
                    node = find_node( map.tree(), taxon);
                }
                if (node == nullptr) {
                    LOG_WARN << "Cannot find taxon " << taxon;
                    continue;
                }
                node_list.push_back(node);
            }
 
            // Find the edges that are part of the subtree of this clade. 
            auto bps = BipartitionSet<PlacementTree>(map.tree());
            auto smallest = bps.find_smallest_subtree (node_list);
            auto subedges = bps.get_subtree_edges(smallest->link());
 
            // Add them to the clade edges list. 
            clade_edges.push_back(std::make_pair(clade.first, subedges));
 
            // Remove edges from the non-clade edges list 
            std::vector<std::string> clades_extr;
            for (auto& e : subedges) {
                if (e->primary_node()->is_leaf()) {
                    clades_extr.push_back(e->primary_node()->data.name);
                }
                if (e->secondary_node()->is_leaf()) {
                    clades_extr.push_back(e->secondary_node()->data.name);
                }
 
                basal_branches.erase(e);
            }
 
            // Only write out inferred clades in first iteration. 
            // (as the reference tree is the same for all 154 samples, the other iterations 
            // will yield the same result, so we can skip this) 
            if (== 0) {
                std::sort(clades_extr.begin(), clades_extr.end());
                for (auto ce : clades_extr) {
                    utils::file_append(base_dir + "clades_extr/" + clade.first, text::replace_all(ce, " ", "_") + "\n");
                }
            }
        }
        clade_edges.push_back(std::make_pair("basal_branches", basal_branches));
 
        // Normalize. 
        map.normalize_weight_ratios();
 
        // Collect the accumulated positions within the clades for each pquery. 
        for (auto& pqry : map.pqueries()) {
            std::vector<double> edge_clade_vec (clade_edges.size(), 0.0);
 
            // For each placement, find its edge and accumulate the edge's clade counter by 
            // the placements like weight ratio. 
            for (auto& place : pqry->placements) {
                bool found_edge = false;
 
                for (size_t i = 0; i < clade_edges.size(); ++i) {
                    if (clade_edges[i].second.count(place->edge) > 0) {
                        edge_clade_vec[i] += place->like_weight_ratio;
 
                        // Make sure that we do not count this placement twice. 
                        // (can only happen if clade_edges is wrong). 
                        assert(found_edge == false);
                        if (found_edge){
                            LOG_WARN << "Already found this edge!";
                        }
 
                        found_edge = true;
                    }
                }
 
                // If the placement was not found within the clades, clade_edges is wrong. 
                if (!found_edge) {
                    LOG_WARN << "Edge not found!";
                }
                assert(found_edge);
            }
 
            // Check total like weight ratio sum. If too different from 1.0, there is something 
            // wrong and we need to manually inspect this pquery. Could just be a weird result 
            // of EPA, so nothing too serious, but better make sure we check it. 
            double sum = std::accumulate (edge_clade_vec.begin(), edge_clade_vec.end(), 0.0);
            if (std::abs(sum - 1.0) > 0.01) {
                LOG_WARN << "Placement with sum " << sum;
            }
 
            // If there is a clade that has more than 95% of the placements weight ratio, 
            // this is the one we assign the pquery to. 
            assert(edge_clade_vec.size() == clade_edges.size());
            bool found_max = false;
            std::string all_line = pqry->names[0]->name;
            for (size_t i = 0; i < edge_clade_vec.size(); ++i) {
                if (edge_clade_vec[i] >= precision) {
                    assert(!found_max);
                    found_max = true;
 
                    utils::file_append(base_dir + "names/" + clade_edges[i].first, pqry->names[0]->name + "\n");
                }
 
                if (edge_clade_vec[i] > 0.0) {
                    all_line += " " + clade_edges[i].first + "(" + std::to_string(edge_clade_vec[i]) + ")";
                }
            }
 
            // If there is no sure assignment (<95%), we put the pquery in an extra list of 
            // uncertain sequences. 
            if (!found_max) {
                utils::file_append(base_dir + "names/uncertain", pqry->names[0]->name + "\n");
 
                std::string line = pqry->names[0]->name;
                for (size_t i = 0; i < edge_clade_vec.size(); ++i) {
                    if (edge_clade_vec[i] > 0.0) {
                        line += " " + clade_edges[i].first + "(" + std::to_string(edge_clade_vec[i]) + ")";
                    }
                }
            }
        }
    }
 
    LOG_INFO << "Finished.";
    return 0;
}

This program expects a base_dir directory path as input which contains the following subdirectories and files:

To run it, call

./bin/extract_sequence_names path/to/base_dir/

from the genesis main directory.

The program then creates two output directories with the following contents:

Remove Duplicates

As mentioned in the query sequence preparation, there are duplications in the sequences resulting from our splitting step. Those sequence names now occur multiple times in the extracted names, which we thus need to clean. This is done with the following script, which needs to be call in the base_dir.

#!/bin/bash 
 
rm -rf names_uniq
mkdir names_uniq
cd names
 
for f in `ls`
do
    cat $f | sort -u > ../names_uniq/$f
done
 
cd ..

It creates a new directory names_uniq, which contains the unique sequence names per clade.

Create New Sequences Files

In a last step, we want to create new FASTA files given the sequences names. For this, we use the following script extract_sequences.py:

#!/usr/bin/python 
 
from sets import Set
import ossys
 
# This script looks for files with query sequence names in names/ 
# and then extracts all sequences from the according fasta file 
# into single fasta files for each list in names/. 
 
# Input file names. 
seq_file  = "path/to/sequences.fasta" # Amplicons.fasta or OTUs.fasta 
 
# Output directory. 
out_dir   = "seqs"
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
 
# Get clade name from command line. 
if len(sys.argv) != 2:
    print "Expecting clade name as argument."
    exit()
clade_name = sys.argv[1]
print "Processing clade"clade_name
 
# Set file names. 
list_file  = "names/"+clade_name
 
# Prepare a set of read names from the list file. 
count = 0
read_set = Set([])
print "Reading list file"list_file
with open(list_file"r") as list_in:
    for line in list_in:
        rn = line.rstrip()
        if rn.startswith("q_"):
            rn = rn[2:]
        read_set.add(rn)
        count += 1
 
print "There are"count"sequences in"list_file
 
# Process the sequence file. 
print "Reading sequence file "+seq_file
out = open(out_dir+"/"+clade_name+".fasta""w")
with open(seq_file"r") as seq_in:
    while True:
        line1    = seq_in.readline().strip()
        sequence = seq_in.readline()
        if not sequence:
            break
 
        if line1[0] != ">":
            print "Sequence does not start with >, aborting."
            exit
 
        name = line1[1:].split("_")[0]
        if name in read_set:
            out.write(">"+name+"\n"+sequence)
 
out.close()
print "Done."

It needs to be run for every clade name in names (or names_uniq, it doesn't matter, because the script itself does the reduplication again). For this, call the following script from the base_dir.

#!/bin/bash 
 
mkdir seqs
for f in `ls names`
do
    python extract_sequences.py $f
done

The result is stored in a new directory called seqs. It contains FASTA files for each of the clades provided in the names directory. Each FASTA file contains those sequences from the original sequence files (Amplicons.fasta or OTUs.fasta) that were placed in the respective clade. This step needs to be run for the Euks amplicons and OTUs (E-M, E-O).

The resulting FASTA files are then the input for the second round of analyses, i.e., the 4 Apicomplexan analysis runs (AU-M, AU-O, AC-M, AC-O). The files are first used again in the Alignment step in those analyses.

7. Visualization

7.1. Clade Visualization

In a first step, we visualized the clades on the tree. This is mostly an error checking and preparation step for later visualizations. The result was also used for determining the basal branches, which are shaded gray in the clade annotated trees.

It is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.

// file visualize_clades.cpp 
 
#include <string>
#include <unordered_set>
#include <vector>
 
#include "tree/bipartition/bipartition_set.hpp"
#include "tree/default/functions.hpp"
#include "tree/default/newick_processor.hpp"
#include "tree/default/tree.hpp"
#include "tree/io/newick/color_mixin.hpp"
#include "tree/io/newick/processor.hpp"
#include "tree/tree.hpp"
#include "utils/core/logging.hpp"
#include "utils/io/nexus/document.hpp"
#include "utils/io/nexus/taxa.hpp"
#include "utils/io/nexus/trees.hpp"
#include "utils/io/nexus/writer.hpp"
#include "utils/tools/color.hpp"
#include "utils/tools/color/names.hpp"
 
using namespace genesis;
 
//     write_color_tree_nexus 
 
void write_color_tree_nexus(
    DefaultTree const& tree,
    std::vector<color::Color> color_vec,
    std::string filename
) {
    typedef NewickColorMixin<DefaultTreeNewickProcessor> ColorProcessor;
 
    auto proc = ColorProcessor();
    proc.edge_colors(color_vec);
 
    std::string tree_out = proc.to_string(tree);
 
    auto doc = nexus::Document();
 
    auto taxa = make_unique<nexus::Taxa>();
    taxa->add_taxa(node_names(tree));
    doc.set_block( std::move(taxa) );
 
    auto trees = make_unique<nexus::Trees>();
    trees->add_tree( "tree1", tree_out );
    doc.set_block( std::move(trees) );
    std::ostringstream buffer;
 
    auto writer = nexus::Writer();
    writer.to_stream( doc, buffer );
    auto nexus_out = buffer.str();
 
    utils::file_write(filename, nexus_out);
}
 
//     clade_color_tree 
 
void clade_color_tree( std::string base_dir )
{
    // List of clade files. 
    std::vector<std::string> clade_files;
    utils::dir_list_files(base_dir + "clades", clade_files);
 
    // Sativa prepends a "r_" to all taxa names. We remove this. 
    std::string taxon_prefix = "r_";
 
    // Create a list of all clades and fill each clade with its taxa. 
    std::vector<std::pair<std::string, std::vector<std::string>>> clades;
    for (auto cf : clade_files) {
        auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );
        std::sort(taxa.begin(), taxa.end());
        clades.push_back(std::make_pair(cf, taxa));
    }
 
    // Read tree file. 
    std::string tfile  = base_dir + "best_tree.newick";
    DefaultTree tree;
    if( utils::file_exists( tfile ) ) {
        DefaultTreeNewickProcessor().from_file( tfile, tree );
    } else {
        LOG_WARN << "Tree file " << tfile << " does not exists.";
        return;
    }
 
    // Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs. 
    for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {
        auto& n = **nit;
 
        if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {
            n.data.name = n.data.name.substr(taxon_prefix.size());
        }
    }
 
    // Initialize color vector with pink to mark unprocessed edges 
    // (there should be none left after the next steps). 
    auto color_vec = std::vector<color::Color>( tree.edge_count(), color::Color(255,0,255) );
 
    // Make a set of all edges that do not belong to any clade. 
    // We first fill it with all edges, then remove the clade-edges later. 
    std::unordered_set<DefaultTree::EdgeType*> non_clade_edges;
    for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {
        non_clade_edges.insert(it->get());
    }
 
    // Define a nice color scheme (based on web colors). 
    std::vector<std::string> scheme = {
        "Crimson",
        "DarkCyan",
        "DarkGoldenRod",
        "DarkGreen",
        "DarkOrchid",
        "DeepPink",
        "DodgerBlue",
        "DimGray",
        "GreenYellow",
        "Indigo",
        "MediumVioletRed",
        "MidnightBlue",
        "Olive",
        "Orange",
        "OrangeRed",
        "Peru",
        "Purple",
        "SeaGreen",
        "DeepSkyBlue",
        "RoyalBlue",
        "SlateBlue",
        "Tomato",
        "YellowGreen"
    };
 
    LOG_INFO << "Examining clades...";
    size_t clade_num = 0;
    for (auto& clade : clades) {
        std::vector<DefaultTree::NodeType*> node_list;
 
        // Find the nodes that belong to the taxa of this clade. 
        for (auto taxon : clade.second) {
            DefaultTree::NodeType* node = find_node( tree, taxon_prefix + taxon );
            if (node == nullptr) {
                node = find_node( tree, taxon);
            }
            if (node == nullptr) {
                LOG_WARN << "Couldn't find taxon " << taxon;
                continue;
            }
            node_list.push_back(node);
        }
 
        // Find the edges that are part of the subtree of this clade. 
        auto bps = BipartitionSet<DefaultTree>(tree);
        auto smallest = bps.find_smallest_subtree (node_list);
        auto subedges = bps.get_subtree_edges(smallest->link());
 
        // Color all edges that fall into this calde with one of the color scheme colors. 
        for (auto& e : subedges) {
            // Error check. 
            if( non_clade_edges.count(e) == 0 ) {
                LOG_WARN << "Edge at " << e->primary_node()->data.name
                         << e->secondary_node()->data.name << " already done...";
            }
 
            // Remove this edge from the non-clade edges list. Apply color. 
            non_clade_edges.erase(e);
            color_vec[e->index()] = color::get_named_color( scheme[clade_num] );
        }
 
        ++clade_num;
    }
 
    // Debug info. 
    LOG_INFO << "Out of clade edges: " << non_clade_edges.size();
 
    // Color all basal branches, then write the tree file to nexus format. 
    for (auto& e : non_clade_edges) {
        color_vec[e->index()] = color::Color(192,192,192);
    }
    write_color_tree_nexus(tree, color_vec,  base_dir + "clade_colors.nexus");
}
 
//     main 
 
int main( int argc, char** argv )
{
    // Activate logging. 
    Logging::log_to_stdout();
 
    if (argc != 2) {
        LOG_WARN << "Need to provide base dir.";
        return 1;
    }
 
    auto base_dir = text::trim_right( std::string( argv[1] ), "/") + "/";
    LOG_INFO << "base dir: " << base_dir;
 
    clade_color_tree( base_dir );
 
    LOG_INFO << "Finished.";
    return 0;
}

This program expects a base_dir directory path as input which contains the following subdirectories and files:

To run it, call

./bin/visualize_clades path/to/base_dir/

from the genesis main directory.

The program then creates a file clade_colors.nexus, which is a tree in nexus format where all branches that belong to a certain clade are colored the same (and different clades in different colors). This file can be seen using e.g. FigTree. This visualization serves as a check whether the clades are correct. It is also used to determine the basal branches (which are colored gray). Those are the branches which are shaded in the clade annotated tree in the main text.

This step is run for the Euks and Apis tree, and for the Constrained and Unconstrained case (EU, EC, AU, AC).

7.2. Placement Count Visualization

This is the main visualization, which results in the trees shown in the main text and supplement with branches colored in a light blue, purple and black gradient.

Create the Tree Files

In the first step, we use the EPA result (.jplace files) to create a tree with branches colored according to the placement count per branch.

// file visualize_placements.cpp 
 
#include <algorithm>
#include <assert.h>
#include <cmath>
#include <numeric>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
 
#include "placement/functions.hpp"
#include "placement/io/edge_color.hpp"
#include "placement/io/jplace_processor.hpp"
#include "placement/io/newick_processor.hpp"
#include "placement/io/serializer.hpp"
#include "placement/placement_map.hpp"
#include "tree/bipartition/bipartition_set.hpp"
#include "tree/default/functions.hpp"
#include "tree/default/newick_processor.hpp"
#include "tree/io/newick/color_mixin.hpp"
#include "tree/io/newick/processor.hpp"
#include "tree/tree.hpp"
#include "utils/core/logging.hpp"
#include "utils/io/nexus/document.hpp"
#include "utils/io/nexus/taxa.hpp"
#include "utils/io/nexus/trees.hpp"
#include "utils/io/nexus/writer.hpp"
#include "utils/tools/color.hpp"
#include "utils/tools/color/gradient.hpp"
#include "utils/tools/color/names.hpp"
#include "utils/tools/color/operators.hpp"
 
using namespace genesis;
 
//     write_color_tree_nexus 
 
void write_color_tree_nexus(
    PlacementTree const& tree,
    std::vector<color::Color> color_vec,
    std::string filename
) {
    typedef NewickColorMixin<PlacementTreeNewickProcessor> ColorProcessor;
 
    auto proc = ColorProcessor();
    proc.enable_edge_nums(false);
    proc.edge_colors(color_vec);
 
    std::string tree_out = proc.to_string(tree);
 
    auto doc = nexus::Document();
 
    auto taxa = make_unique<nexus::Taxa>();
    taxa->add_taxa(node_names(tree));
    doc.set_block( std::move(taxa) );
 
    auto trees = make_unique<nexus::Trees>();
    trees->add_tree( "tree1", tree_out );
    doc.set_block( std::move(trees) );
    std::ostringstream buffer;
 
    auto writer = nexus::Writer();
    writer.to_stream( doc, buffer );
    auto nexus_out = buffer.str();
 
    utils::file_write(filename, nexus_out);
}
 
//     placement_count_color_tree 
 
void placement_count_color_tree( std::string base_dir )
{
    // ---------------------------------------------------- 
    //     Clade Init 
    // ---------------------------------------------------- 
 
    // List of clade files. 
    std::vector<std::string> clade_files;
    utils::dir_list_files(base_dir + "clades", clade_files);
 
    // Create a list of all clades and fill each clade with its taxa. 
    std::vector<std::pair<std::string, std::vector<std::string>>> clades;
    for (auto cf : clade_files) {
        auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );
        std::sort(taxa.begin(), taxa.end());
        clades.push_back(std::make_pair(cf, taxa));
    }
 
    std::unordered_map<size_t, std::string> clade_num_map;
    std::unordered_map<size_t, std::string> edge_index_to_clade_map;
 
    std::unordered_map<std::string, size_t> clade_count;
    std::unordered_map<std::string, double> clade_mass;
 
    // ---------------------------------------------------- 
    //     Branch Init 
    // ---------------------------------------------------- 
 
    std::vector<int>    index_to_edgenum;
    std::vector<size_t> placement_count;
    std::vector<double> placement_mass;
 
    std::unordered_map<std::string, int> taxa_done;
    size_t taxa_inconsistent = 0;
 
    // Sativa prepends a "r_" to all taxa names. We remove this. 
    std::string taxon_prefix = "r_";
 
    PlacementTree tree0;
 
    // ---------------------------------------------------- 
    //     Iterate all Jplace files 
    // ---------------------------------------------------- 
 
    // Iterate all samples and collect placement counts. 
    size_t total_placement_count = 0;
    for (size_t i = 0; i < 154; i++) {
        // -------------------------------- 
        //     Read files 
        // -------------------------------- 
 
        // Read placement file. 
        PlacementMap map;
        std::string jfile  = base_dir + "samples/sample_" + std::to_string(i) + "_max.jplace";
        if( !JplaceProcessor().from_file(jfile, map) ) {
            LOG_ERR << "Couldn't read jplace file " << jfile;
            return;
        }
        total_placement_count += map.placement_count();
 
        auto& tree = map.tree();
 
        // Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs. 
        for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {
            auto& n = **nit;
 
            if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {
                n.data.name = n.data.name.substr(taxon_prefix.size());
            }
        }
 
        // -------------------------------- 
        //     Check Properties 
        // -------------------------------- 
 
        // Init vectors in first iteration... 
        if( i == 0 ) {
            index_to_edgenum = std::vector<int>(tree.edge_count(), 0);
            placement_count  = std::vector<size_t>(tree.edge_count(), 0.0);
            placement_mass   = std::vector<double>(tree.edge_count(), 0.0);
 
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                index_to_edgenum[e.index()] = e.data.edge_num;
            }
 
            tree0 = tree;
 
        // ... and check for correctness in later iterations. 
        } else {
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                if( index_to_edgenum[e.index()] != e.data.edge_num ) {
                    LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "
                            << index_to_edgenum[e.index()] << " != " << e.data.edge_num;
                    return;
                }
            }
        }
 
        // -------------------------------- 
        //     Clade Extraction 
        // -------------------------------- 
 
        // Make a set of all edges that do not belong to any clade. 
        // We first fill it with all edges, then remove the clade-edges later. 
        std::unordered_set<PlacementTree::EdgeType*> non_clade_edges;
        for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {
            non_clade_edges.insert(it->get());
        }
 
        // Examining clades... 
        size_t clade_num = 0;
        for (auto& clade : clades) {
            std::vector<PlacementTree::NodeType*> node_list;
 
            // Find the nodes that belong to the taxa of this clade. 
            for (auto taxon : clade.second) {
                PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);
                if (node == nullptr) {
                    node = find_node( tree, taxon);
                }
                if (node == nullptr) {
                    LOG_DBG2 << "couldn't find taxon " << taxon;
                    continue;
                }
                node_list.push_back(node);
 
                // Check clade num consistency 
                if( clade_num_map.count(clade_num) == 0 ) {
                    if( i != 0 ) {
                        LOG_WARN << "clade " << clade.first << " not found in sample 0! (num)";
                        return;
                    }
                    clade_num_map[clade_num] = clade.first;
                } else if( clade_num_map[clade_num] != clade.first ) {
                    LOG_WARN << "clade num " << clade_num << " does not match " << clade.first;
                    return;
                }
            }
 
            // Find the edges that are part of the subtree of this clade. 
            auto bps = BipartitionSet<PlacementTree>(tree);
            auto smallest = bps.find_smallest_subtree (node_list);
            auto subedges = bps.get_subtree_edges(smallest->link());
 
            // Extract all sequences from those edges and write them to files. 
            for (auto& e : subedges) {
                // Remove this edge from the non-clade edges list 
                if( non_clade_edges.count(e) == 0 ) {
                    LOG_WARN << "edge at " << e->primary_node()->data.name
                             << e->secondary_node()->data.name << " already done...";
                }
                non_clade_edges.erase(e);
 
                // Check edge index consistency 
                if( edge_index_to_clade_map.count(e->index()) == 0 ) {
                    if( i != 0 ) {
                        LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";
                        return;
                    }
                    edge_index_to_clade_map[e->index()] = clade.first;
                } else if( edge_index_to_clade_map[e->index()] != clade.first ) {
                    LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;
                    return;
                }
            }
 
            ++clade_num;
        }
 
        // Add remaining edges to "basal_branches" clade 
        for( auto& e : non_clade_edges ) {
            if( edge_index_to_clade_map.count(e->index()) == 0 ) {
                if( i != 0 ) {
                    LOG_WARN << "clade basal_branches not found in sample 0!";
                    return;
                }
                edge_index_to_clade_map[e->index()] = "basal_branches";
            } else if( edge_index_to_clade_map[e->index()] != "basal_branches" ) {
                LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";
                return;
            }
        }
 
        // -------------------------------- 
        //     Count collection 
        // -------------------------------- 
 
        // Collect the placement counts and masses. 
        for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
            auto& e = **eit;
 
            // Add all new placement counts and masses to the counters. 
            for( auto& p : e.data.placements ) {
                if( p->pquery->name_size() != 1 ) {
                    LOG_WARN << "name size == " << p->pquery->name_size();
                    return;
                }
                auto name = p->pquery->name_at(0).name;
 
                // If the placement is new, add it. If not, check whether it is consistent. 
                if( taxa_done.count(name) == 0 ) {
                    placement_count[e.index()] += 1;
                    placement_mass[e.index()]  += p->like_weight_ratio;
 
                    taxa_done[name] = p->edge_num;
 
                    // Count clade placements. 
                    if( edge_index_to_clade_map.count(e.index()) == 0 ) {
                        LOG_WARN << "no clade for edge " << e.index();
                        return;
                    }
                    std::string clade_name = edge_index_to_clade_map[e.index()];
                    clade_count[clade_name] += 1;
                    clade_mass[clade_name]  += p->like_weight_ratio;
                } else {
                    if( taxa_done[name] != p->edge_num ) {
                        ++taxa_inconsistent;
                        LOG_WARN << "placement not consistent between samples: " << name;
                    }
                }
            }
        }
    }
 
    // ---------------------------------------------------- 
    //     Summarize Information 
    // ---------------------------------------------------- 
 
    LOG_INFO << "uniq taxa count: " << taxa_done.size();
    LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;
    taxa_done.clear();
 
    // ---------------------------------------------------- 
    //     Branch counts 
    // ---------------------------------------------------- 
 
    LOG_INFO << "total_placement_count " << total_placement_count;
 
    // Write counts. 
    std::string placement_count_list;
    for( auto& pv : placement_count ) {
        placement_count_list += std::to_string(pv) + "\n";
    }
    utils::file_write(base_dir + "placement_count_list", placement_count_list);
 
    // Write masses. 
    std::string placement_mass_list;
    for( auto& pv : placement_mass ) {
        placement_mass_list += std::to_string(pv) + "\n";
    }
    utils::file_write(base_dir + "placement_mass_list", placement_mass_list);
 
    if( placement_count.size() != placement_mass.size() ) {
        LOG_ERR << "placement_count.size() != placement_mass.size() : "
                << placement_count.size() << " != " << placement_mass.size();
        return;
    }
 
    // Sum up everything 
    auto count_sum = std::accumulate(placement_count.begin(), placement_count.end(), 0);
    auto count_max = *std::max_element (placement_count.begin(), placement_count.end());
 
    auto mass_sum = std::accumulate(placement_mass.begin(), placement_mass.end(), 0.0);
    auto mass_max = *std::max_element (placement_mass.begin(), placement_mass.end());
 
    LOG_INFO << "sum count " << count_sum;
    LOG_INFO << "max count " << count_max;
 
    LOG_INFO << "sum mass " << mass_sum;
    LOG_INFO << "max mass " << mass_max;
 
    // ---------------------------------------------------- 
    //     Clade counts 
    // ---------------------------------------------------- 
 
    LOG_INFO;
    LOG_INFO << "Clade counts:";
    for( auto& cp : clade_count ) {
        LOG_INFO << cp.first << "\t" << cp.second << "\t" << ( (double)cp.second / count_sum );
    }
 
    LOG_INFO;
    LOG_INFO << "Clade masses:";
    for( auto& cp : clade_mass ) {
            LOG_INFO << cp.first << "\t" << cp.second << "\t" << ( cp.second / mass_sum );
    }
 
    // ---------------------------------------------------- 
    //     Colour branches 
    // ---------------------------------------------------- 
 
    // Create color gradient in "blue pink black". 
    auto gradient = std::map<double, color::Color>();
    gradient[ 0.0 ] = color::color_from_hex("#81bfff");
    gradient[ 0.5 ] = color::color_from_hex("#c040be");
    gradient[ 1.0 ] = color::color_from_hex("#000000");
    auto base_color = color::color_from_hex("#81bfff");
 
    // Make count color tree. 
    auto count_color_vec_lin = std::vector<color::Color>( placement_count.size(), base_color );
    auto count_color_vec_log = std::vector<color::Color>( placement_count.size(), base_color );
 
    for( size_t i = 0; i < placement_count.size(); ++) {
        if( placement_count[i] > 0 ) {
            double val;
            val = static_cast<double>(placement_count[i]) / count_max;
            count_color_vec_lin[i] = color::gradient(gradient, val);
 
            val = log(static_cast<double>(placement_count[i])) / log(count_max);
            count_color_vec_log[i] = color::gradient(gradient, val);
        }
    }
 
    write_color_tree_nexus(tree0, count_color_vec_lin, base_dir + "tree_count_lin.nexus");
    write_color_tree_nexus(tree0, count_color_vec_log, base_dir + "tree_count_log.nexus");
 
    // Make mass color tree. 
    auto mass_color_vec_lin = std::vector<color::Color>( placement_mass.size(), base_color );
    auto mass_color_vec_log = std::vector<color::Color>( placement_mass.size(), base_color );
 
    for( size_t i = 0; i < placement_mass.size(); ++) {
        if( placement_mass[i] > 0 ) {
            double val;
            val = static_cast<double>(placement_mass[i]) / mass_max;
            mass_color_vec_lin[i] = color::gradient(gradient, val);
 
            val = log(static_cast<double>(placement_mass[i])) / log(mass_max);
            mass_color_vec_log[i] = color::gradient(gradient, val);
        }
    }
 
    write_color_tree_nexus(tree0, mass_color_vec_lin, base_dir + "tree_mass_lin.nexus");
    write_color_tree_nexus(tree0, mass_color_vec_log, base_dir + "tree_mass_log.nexus");
 
    clade_count.clear();
    clade_mass.clear();
}
 
//     main 
 
int main( int argc, char** argv )
{
    // Activate logging. 
    Logging::log_to_stdout();
 
    // Get base dir. 
    if (argc != 2) {
        LOG_WARN << "Need to provide base dir.";
        return 1;
    }
    auto base_dir = text::trim_right( std::string( argv[1] ), "/") + "/";
    LOG_INFO << "base dir: " << base_dir;
 
    // Run. 
    placement_count_color_tree( base_dir );
 
    LOG_INFO << "Finished.";
    return 0;
}

This program expects a base_dir directory path as input which contains the following subdirectories and files:

To run it, call

./bin/visualize_placements path/to/base_dir/ > log.txt

from the genesis main directory.

The program output in the terminal is used for subsequent steps, so it is stored in a log file log.txt. See the next sections for more information on how this output is used.

The program also creates the following files:

The two count trees visualize the counts of the placements (their number per branch), while the mass trees show the masses of these placements (measured in like_weight_ratio). As we did not set those masses, they default to 1.0. This effectively results in identical trees for both the counts and masses. However, this might be useful for future approaches: abundance data or some other data might be interesting to visualize here as well.

Furthermore, the two lin trees use linear scaling, while the two log trees use logarithmic scaling for determining the color per branch. The latter one is more useful, as the number of placements per branch is highly unevenly distributed: There are many branches with only a few placements on them, while just a few branches accumulated most of the placements. In a linear scaling, this would result in most branches being light blue (the color for 0.0 in our gradient), while only very few highly populated branches would turn black (the other end of the gradient). Using logarithmic scaling prevents this and thus also makes it possible to see the in-between counts and colors.

For the main text, we used the tree_count_log.nexus files. This program can be run for all of the 8 analyses, to give tree visualizations for each of them.

Process the Tree Files

The previous step yields tree files in nexus format, that can be read by FigTree. Furthermore, the program output itself contains valuable information needed for proper visualizing the data.

One line of output is a count of "inconsistent taxa", which are the numbers shown in section Inconsistent Placements.

Furthermore, summaries of the counts and masses of placements on the tree are outputted: sum count, max count, sum mass and max mass show the sum and the maximal value of placements for their counts and masses. The max count is the value used for setting the axis values for the color scale shown next to the trees.

The workflow to create publication quality figures from the nexus files of the previous step is as follows:

  1. FigTree

    1. Load tree_count_log.nexus.
    2. Chose Polar layout.
    3. Scale to 105 (Euks), 168 (Apis).
    4. Root tree according to outgroup.
    5. Ladderize.
    6. Save to svg file.
  2. Inkscape.

    1. Load svg file.
    2. Insert color scale.
    3. Some minor corrections (color of the root branch, line widths, etc).
    4. Save to svg and pdf files.

The color scale was created in Inkscape, using the same gradient used for the tree visualization. It is a linear gradient with the following stops:

The axis markers were added by hand and had to be adjusted according to the maximum value of placement mass on the particular tree. In order to get the correct position for each axis marker, we used the following Python script:

#!/usr/bin/python 
 
import math
 
max_val = 2487
rounded = 2500
 
print "max val"max_val
 
val = math.log(rounded) / math.log(max_val)
print "%.0f" % rounded"at""%.3f" % val
 
for i in range(17):
    val = math.log(math.pow(10.0i)) / math.log(max_val)
    print (" " * (6-i)) + "%.0f" % math.pow(10.0i)"at""%.3f" % val

For each tree, the value max_val has to be set to the value that the previous step gave as output for the max count value. The value of rounded is then set to a value greater than that which is used as the maximum value displayed on the scale. The output of this script gives the relative positions of the markers for the scale, which then have to be set by hand in Inkscape. To achieve this, we recommended creating a gradient with 100 units hight, so that the relative positions for the markers can be translated to Inkscape units in a straight forward manner. The resulting scale image can the be inserted into the tree (see steps above).

Furthermore, for the clade annotated trees (the ones with clades instead of taxa names, e.g., the one in the main text), we shaded the inner (basal) branches. Those are the ones that to not belong to any clade. Those branches were marked in gray in the clade tree (see section Clade Visualization).

8. Comparing Constrained and Unconstrained Placements

As the final step of the analysis, we compared how the placements differ when using taxonomically constrained and unconstrained reference trees. Thus, the step can be done for both the Euks and Apis trees, and for amplicons and OTUs, respectively (E-M, E-O, A-M, A-O).

As the trees for the constrained and unconstrained case differ, it is not possible to do a straight forward comparison of the counts of placements per branch. Instead, we counted how many placements were placed into each clade of those trees. This yielded the table in the supplement.

To get this information, we used the following program, which is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.

// file compare_constr_unconstr.cpp 
 
#include <algorithm>
#include <assert.h>
#include <cmath>
#include <numeric>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
 
#include "placement/functions.hpp"
#include "placement/io/jplace_processor.hpp"
#include "placement/io/newick_processor.hpp"
#include "placement/io/serializer.hpp"
#include "placement/placement_map.hpp"
#include "tree/bipartition/bipartition_set.hpp"
#include "tree/default/functions.hpp"
#include "tree/tree.hpp"
#include "utils/core/fs.hpp"
#include "utils/core/logging.hpp"
#include "utils/math/matrix.hpp"
 
using namespace genesis;
 
//     compare_constr_unconstr 
 
void compare_constr_unconstr( std::string base_dir )
{
    auto samples_a = base_dir + "samples_Unconstr/";
    auto samples_b = base_dir + "samples_Constr/";
    LOG_INFO << "samples_a : " << samples_a;
    LOG_INFO << "samples_b : " << samples_b;
 
    // -------------------------------- 
    //     Clade Init 
    // -------------------------------- 
 
    // List of clade files. 
    std::vector<std::string> clade_files;
    utils::dir_list_files(base_dir + "clades", clade_files);
    std::sort(clade_files.begin(), clade_files.end());
 
    // Create a list of all clades and fill each clade with its taxa. 
    std::vector<std::pair<std::string, std::vector<std::string>>> clades;
    for (auto cf : clade_files) {
        auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );
        std::sort(taxa.begin(), taxa.end());
        clades.push_back(std::make_pair(cf, taxa));
    }
 
    // -------------------------------- 
    //     Placement Init 
    // -------------------------------- 
 
    std::vector<int>    index_to_edgenum;
    std::unordered_map<size_t, size_t> edge_index_to_clade_num;
    std::unordered_map<std::string, int> taxa_done;
    std::unordered_map<std::string, size_t> read_to_clade_num_map;
 
    std::string taxon_prefix = "r_";
    std::string query_prefix = "q_";
    size_t total_placement_count = 0;
    size_t taxa_inconsistent = 0;
 
    // -------------------------------- 
    //     Result Matrix Init 
    // -------------------------------- 
 
    auto result_matrix = Matrix<size_t>( clades.size() + 1, clades.size() + 1, 0 );
 
    // -------------------------------------------------------- 
    //     Iterate all Jplace files in base dir A 
    // -------------------------------------------------------- 
 
    LOG_INFO << "Reading 154 samples from " << samples_a;
    for (size_t i = 0; i < 154; i++) {
 
        // -------------------------------- 
        //     Read files 
        // -------------------------------- 
 
        // Read placement file. 
        PlacementMap map;
        std::string jfile  = samples_a + "sample_" + std::to_string(i) + "_max.jplace";
        if( !JplaceProcessor().from_file(jfile, map) ) {
            LOG_ERR << "Couldn't read jplace file " << jfile;
            return;
        }
        total_placement_count += map.placement_count();
 
        auto& tree = map.tree();
 
        // Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs. 
        for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {
            auto& n = **nit;
 
            if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {
                n.data.name = n.data.name.substr(taxon_prefix.size());
            }
        }
 
        // -------------------------------- 
        //     Check Tree Consistency 
        // -------------------------------- 
 
        // Init vectors in first iteration... 
        if( i == 0 ) {
            index_to_edgenum = std::vector<int>(tree.edge_count(), 0);
 
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                index_to_edgenum[e.index()] = e.data.edge_num;
            }
 
        // ... and check for correctness in later iterations. 
        } else {
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                if( index_to_edgenum[e.index()] != e.data.edge_num ) {
                    LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "
                            << index_to_edgenum[e.index()] << " != " << e.data.edge_num;
                    return;
                }
            }
        }
 
        // -------------------------------- 
        //     Clade Extraction 
        // -------------------------------- 
 
        // Make a set of all edges that do not belong to any clade. 
        // We first fill it with all edges, then remove the clade-edges later. 
        std::unordered_set<PlacementTree::EdgeType*> non_clade_edges;
        for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {
            non_clade_edges.insert(it->get());
        }
 
        // Examining clades... 
        for( size_t ci = 0; ci < clades.size(); ++ci ) {
            auto& clade = clades[ci];
 
            std::vector<PlacementTree::NodeType*> node_list;
 
            // Find the nodes that belong to the taxa of this clade. 
            for (auto taxon : clade.second) {
                PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);
                if (node == nullptr) {
                    node = find_node( tree, taxon);
                }
                if (node == nullptr) {
                    LOG_WARN << "couldn't find taxon " << taxon;
                    continue;
                }
                node_list.push_back(node);
            }
 
            // Find the edges that are part of the subtree of this clade. 
            auto bps = BipartitionSet<PlacementTree>(tree);
            auto smallest = bps.find_smallest_subtree (node_list);
            auto subedges = bps.get_subtree_edges(smallest->link());
 
            // Extract all sequences from those edges and write them to files. 
            for (auto& e : subedges) {
                // Remove this edge from the non-clade edges list 
                if( non_clade_edges.count(e) == 0 ) {
                    LOG_WARN << "edge at " << e->primary_node()->data.name
                             << e->secondary_node()->data.name << " already done...";
                }
                non_clade_edges.erase(e);
 
                // Check edge index consistency 
                if( edge_index_to_clade_num.count(e->index()) == 0 ) {
                    if( i != 0 ) {
                        LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";
                        return;
                    }
                    edge_index_to_clade_num[e->index()] = ci;
                } else if( edge_index_to_clade_num[e->index()] != ci ) {
                    LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;
                    return;
                }
            }
        }
 
        // Add remaining edges to "basal_branches" clade 
        for( auto& e : non_clade_edges ) {
            if( edge_index_to_clade_num.count(e->index()) == 0 ) {
                if( i != 0 ) {
                    LOG_WARN << "clade basal_branches not found in sample 0!";
                    return;
                }
                edge_index_to_clade_num[e->index()] = clades.size();
            } else if( edge_index_to_clade_num[e->index()] != clades.size() ) {
                LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";
                return;
            }
        }
 
        // -------------------------------- 
        //     Iterate all Placements 
        // -------------------------------- 
 
        // Collect the placement counts and masses. 
        for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
            auto& e = **eit;
 
            // Add all new placement counts and masses to the counters. 
            for( auto& p : e.data.placements ) {
                if( p->pquery->name_size() != 1 ) {
                    LOG_WARN << "name size == " << p->pquery->name_size();
                    return;
                }
                auto name = p->pquery->name_at(0).name;
                if( name.substr(0, query_prefix.size()) == query_prefix ) {
                    name = name.substr(query_prefix.size());
                }
 
                // If the placement is new, add it. If not, check whether it is consinstent. 
                if( taxa_done.count(name) == 0 ) {
                    taxa_done[name] = p->edge_num;
 
                    // Find the clade num for this read and store it. 
                    //
                    if( edge_index_to_clade_num.count(e.index()) == 0 ) { 
                        LOG_WARN << "no clade for edge " << e.index();
                        return;
                    }
 
                    if( read_to_clade_num_map.count(name) != 0 ) {
                        LOG_WARN << "read " << name << " was somehow already processed...";
                    }
 
                    size_t clade_num = edge_index_to_clade_num[e.index()];
                    read_to_clade_num_map[name] = clade_num;
                } else {
                    if( taxa_done[name] != p->edge_num ) {
                        ++taxa_inconsistent;
                        LOG_WARN << "placement not consistent between samples: " << name;
                    }
                }
            }
        }
    }
 
    LOG_INFO << "total_placement_count " << total_placement_count;
    LOG_INFO << "uniq taxa count: " << taxa_done.size();
    LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;
    LOG_INFO;
 
    // --------------------------------------------------------- 
    //     Iterate all Jplace files in base dir B 
    // --------------------------------------------------------- 
 
    edge_index_to_clade_num.clear();
    index_to_edgenum.clear();
    taxa_done.clear();
 
    total_placement_count = 0;
    taxa_inconsistent = 0;
 
    LOG_INFO << "Reading 154 samples from " << samples_b;
    for (size_t i = 0; i < 154; i++) {
 
        // -------------------------------- 
        //     Read files 
        // -------------------------------- 
 
        // Read placement file. 
        PlacementMap map;
        std::string jfile  = samples_b + "sample_" + std::to_string(i) + "_max.jplace";
        std::string bfile  = samples_b + "sample_" + std::to_string(i) + "_max.bplace";
 
        if( utils::file_exists( bfile ) ) {
            PlacementMapSerializer::load(bfile, map);
        } else {
            if( !JplaceProcessor().from_file(jfile, map) ) {
                LOG_ERR << "Couldn't read jplace file " << jfile;
                return;
            }
            PlacementMapSerializer::save(map, bfile);
        }
        total_placement_count += map.placement_count();
 
        auto& tree = map.tree();
 
        // Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs. 
        for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {
            auto& n = **nit;
 
            if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {
                n.data.name = n.data.name.substr(taxon_prefix.size());
            }
        }
 
        // -------------------------------- 
        //     Check Tree Consistency 
        // -------------------------------- 
 
        // Init vectors in first iteration... 
        if( i == 0 ) {
            index_to_edgenum = std::vector<int>(tree.edge_count(), 0);
 
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                index_to_edgenum[e.index()] = e.data.edge_num;
            }
 
        // ... and check for correctness in later iterations. 
        } else {
            for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
                auto& e = **eit;
                if( index_to_edgenum[e.index()] != e.data.edge_num ) {
                    LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "
                            << index_to_edgenum[e.index()] << " != " << e.data.edge_num;
                    return;
                }
            }
        }
 
        // -------------------------------- 
        //     Clade Extraction 
        // -------------------------------- 
 
        // Make a set of all edges that do not belong to any clade. 
        // We first fill it with all edges, then remove the clade-edges later. 
        std::unordered_set<PlacementTree::EdgeType*> non_clade_edges;
        for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {
            non_clade_edges.insert(it->get());
        }
 
        // Examining clades... 
        for( size_t ci = 0; ci < clades.size(); ++ci ) {
            auto& clade = clades[ci];
 
            std::vector<PlacementTree::NodeType*> node_list;
 
            // Find the nodes that belong to the taxa of this clade. 
            for (auto taxon : clade.second) {
                PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);
                if (node == nullptr) {
                    node = find_node( tree, taxon);
                }
                if (node == nullptr) {
                    LOG_WARN << "couldn't find taxon " << taxon;
                    continue;
                }
                node_list.push_back(node);
            }
 
            // Find the edges that are part of the subtree of this clade. 
            auto bps = BipartitionSet<PlacementTree>(tree);
            auto smallest = bps.find_smallest_subtree (node_list);
            auto subedges = bps.get_subtree_edges(smallest->link());
 
            // Extract all sequences from those edges and write them to files. 
            for (auto& e : subedges) {
                // Remove this edge from the non-clade edges list 
                if( non_clade_edges.count(e) == 0 ) {
                    LOG_WARN << "edge at " << e->primary_node()->data.name
                             << e->secondary_node()->data.name << " already done...";
                }
                non_clade_edges.erase(e);
 
                // Check edge index consistency 
                if( edge_index_to_clade_num.count(e->index()) == 0 ) {
                    if( i != 0 ) {
                        LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";
                        return;
                    }
                    edge_index_to_clade_num[e->index()] = ci;
                } else if( edge_index_to_clade_num[e->index()] != ci ) {
                    LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;
                    return;
                }
            }
        }
 
        // Add remaining edges to "basal_branches" clade 
        for( auto& e : non_clade_edges ) {
            if( edge_index_to_clade_num.count(e->index()) == 0 ) {
                if( i != 0 ) {
                    LOG_WARN << "clade basal_branches not found in sample 0!";
                    return;
                }
                edge_index_to_clade_num[e->index()] = clades.size();
            } else if( edge_index_to_clade_num[e->index()] != clades.size() ) {
                LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";
                return;
            }
        }
 
        // -------------------------------- 
        //     Iterate all Placements 
        // -------------------------------- 
 
        // Collect the placement counts and masses. 
        for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {
            auto& e = **eit;
 
            // Add all new placement counts and masses to the counters. 
            for( auto& p : e.data.placements ) {
                if( p->pquery->name_size() != 1 ) {
                    LOG_WARN << "name size == " << p->pquery->name_size();
                    return;
                }
                auto name = p->pquery->name_at(0).name;
                if( name.substr(0, query_prefix.size()) == query_prefix ) {
                    name = name.substr(query_prefix.size());
                }
 
                // If the placement is new, add it. If not, check whether it is consinstent. 
                if( taxa_done.count(name) == 0 ) {
                    taxa_done[name] = p->edge_num;
 
                    // Find the clade num for this read and store it. 
 
                    if( edge_index_to_clade_num.count(e.index()) == 0 ) {
                        LOG_WARN << "no clade for edge " << e.index();
                        return;
                    }
 
                    if( read_to_clade_num_map.count(name) == 0 ) {
                        LOG_WARN << "read " << name << " is in B but not in A!!!";
                    }
 
                    size_t clade_num_a = read_to_clade_num_map[name];
                    size_t clade_num_b = edge_index_to_clade_num[e.index()];
 
                    ++result_matrix.at(clade_num_a, clade_num_b);
                } else {
                    if( taxa_done[name] != p->edge_num ) {
                        ++taxa_inconsistent;
                        LOG_WARN << "placement not consistent between samples: " << name;
                    }
                }
            }
        }
    }
 
    LOG_INFO << "total_placement_count " << total_placement_count;
    LOG_INFO << "uniq taxa count: " << taxa_done.size();
    LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;
    LOG_INFO;
 
    std::string csv;
    for( size_t i = 0; i < clades.size(); ++) {
        csv += "" + clades[i].first;
    }
    csv += ", basal_branches\n";
 
    for( size_t i = 0; i < result_matrix.rows(); ++) {
        if( i < clades.size() ) {
            csv += clades[i].first;
        } else {
            csv += "basal_branches";
        }
 
        for( size_t j = 0; j < result_matrix.cols(); ++) {
            csv += "" + std::to_string(result_matrix(i, j));
        }
 
        csv += "\n";
    }
 
    utils::file_write(base_dir + "compare_constr_unconstr.csv", csv);
    LOG_INFO << "finished";
}
 
//     main 
 
int main( int argc, char** argv )
{
    // Activate Logging 
    Logging::log_to_stdout();
 
    // Get base dir. 
    if (argc != 2) {
        LOG_WARN << "Need to provide base dir.";
        return 1;
    }
    auto base_dir = text::trim_right( std::string( argv[1] ), "/") + "/";
    LOG_INFO << "base dir  : " << base_dir;
 
    // Run. 
    compare_constr_unconstr( base_dir );
 
    LOG_INFO << "Finished.";
    return 0;
}

This program expects a base_dir directory path as input which contains the following subdirectories and files:

To run it, call

./bin/compare_constr_unconstr path/to/base_dir/

from the genesis main directory.

The program then creates a file compare_constr_unconstr.csv in the base_dir, which is the raw resulting table for comparison as shown in the supplement. This file can be opened with spreadsheet applications like Microsoft Excel or OpenOffice Calc.

The table answers the question: How many sequences are there in total that were placed in clade A in the unconstrained analysis and in clade B constrained analysis? The table shows the unconstrained clades at the left and constrained ones at the top. Thus, each cell shows the number of sequences that were placed in the unconstrained clade corresponding to its row and the constrained clade corresponding to its column. Example: The value x in the first column and third row means: There are x sequences that were placed in the clade of the third row in the unconstrained analysis, but in the clade of the first column in the cConstrained analysis.

In order to turn those absolute numbers into relative ones, all values have to be divided by the total number of sequences (which is simply the sum of all cells). Furthermore, for better visualization, the cells can be colored according to their value by using the conditional formatting mechanism of the spreadsheet application.

The tables can be obtained for all analyses (E-M, E-O, A-M, A-O).

9. Confidence Analysis of the Placements Positions

As described in the Supplement Chapter "Confidence Analysis of the Placements Positions", we assessed the quality of the placement positions by calculating histograms of their likelihood weights and their expected distances. See the chapter for details.

The following code was run for EU-M only, but it is easily possible to adapt it to the other analyses as well.

Caveat: As we ran this analysis during the reviewing phase of the article, we used a later version of Genesis. Please use Genesis v0.12.0 to run this code.

// file placement_uncertainty.cpp 
 
#include "genesis.hpp"
 
#include <algorithm>
#include <fstream>
#include <string>
 
using namespace genesis;
using namespace genesis::placement;
 
int main( int argc, char** argv )
{
    // -------------------------------- 
    //     Init. 
    // -------------------------------- 
 
    (void) argc;
    (void) argv;
 
    utils::Logging::log_to_stdout();
    LOG_INFO << "Started " << utils::current_time();
 
    // -------------------------------- 
    //     Input and output directories. 
    // -------------------------------- 
 
    std::string indir = "path/to/samples";
    std::string oudir = "path/to/output";
 
    // -------------------------------- 
    //     Prepare Histogram Accus. 
    // -------------------------------- 
 
    utils::HistogramAccumulator accu_all;
    utils::HistogramAccumulator accu_edpl;
 
    std::vector<utils::HistogramAccumulator> accu_set;
    accu_set.resize(7);
 
    // -------------------------------- 
    //     Read all jplace files and accumulate data for the Histograms. 
    // -------------------------------- 
 
    auto files = utils::dir_list_files( indir, "sample_[0-9]+_all\\.jplace" );
    LOG_INFO << "Found " << files.size() << " files";
    std::sort( files.begin(), files.end() );
 
    for( auto file : files ) {
        LOG_DBG1 << "Sample " << file;
 
        Sample smp;
        JplaceReader().from_file( indir + "/" + file, smp );
        sort_placements_by_weight( smp );
 
        for( auto const& pquery : smp.pqueries() ) {
            for( auto const& place : pquery.placements() ) {
                accu_all.increment( place.like_weight_ratio );
            }
 
            size_t set_size = std::min( pquery.placement_size(), accu_set.size() );
            for( size_t i = 0; i < set_size; ++) {
                accu_set[i].increment( pquery.placement_at(i).like_weight_ratio );
            }
        }
 
        auto edpl_vec = edpl(smp);
        for( auto v : edpl_vec ) {
            accu_edpl.increment(v);
        }
    }
 
    // -------------------------------- 
    //     Basic output for consistency checking. 
    // -------------------------------- 
 
    LOG_INFO << "accu all  values " << accu_all.added_values() << " min " << accu_all.min() << " max " << accu_all.max();
    LOG_INFO << "accu edpl values " << accu_edpl.added_values() << " min " << accu_edpl.min() << " max " << accu_edpl.max();
 
    for( size_t i = 0; i < accu_set.size(); ++) {
        LOG_INFO << "accu " << i << " values " << accu_set[i].added_values() << " min " << accu_set[i].min() << " max " << accu_set[i].max();
    }
 
    // -------------------------------- 
    //     Build Histograms from the Accus. 
    // -------------------------------- 
 
    auto hist_all  = accu_all.build_uniform_ranges_histogram( 50, 0.0, 1.0 );
    auto hist_edpl = accu_edpl.build_uniform_ranges_histogram( 50, true );
 
    LOG_INFO << "hist all  mean " << utils::mean( hist_all );
    LOG_INFO << "hist edpl mean " << utils::mean( hist_edpl );
 
    // -------------------------------- 
    //     Write Histogram Data to files, as tab separated data. 
    // -------------------------------- 
 
    std::ofstream file_all( oudir + "/all_hist" );
    std::ofstream file_edpl( oudir + "/edpl_hist" );
 
    auto print_hist = [] ( utils::Histogram const& h, std::ostream& os ) {
        for (size_t i = 0; i < h.bins(); ++i) {
            auto range = h.bin_range(i);
            os << "[" << range.first << "" << range.second << "): \t";
            os << range.first << " to " << range.second << "\t";
            os << range.first << "\t" << range.second << "\t";
            os << h[i] << "\n";
        }
    };
 
    print_hist( hist_all,  file_all );
    print_hist( hist_edpl, file_edpl );
 
    for( size_t i = 0; i < accu_set.size(); ++) {
        auto hist_i  = accu_set[i].build_uniform_ranges_histogram( 50, 0.0, 1.0 );
        LOG_INFO << "hist " << i << " mean " << utils::mean( hist_i );
        std::ofstream file_i( oudir + "/hist_" + std::to_string(i) );
        print_hist( hist_i,  file_i );
    }
 
    LOG_INFO << "Finished " << utils::current_time();
    return 0;
}

For reasons of simplicity, the paths to the files are hardcoded in this program. Adjust them so that the path/to/samples points to the place where the result from 4.1. Unconstrained is stored.

The program prints some useful information about the histograms, and writes files with all histogram data to the output directory.