Installing and running HybPiper to process sequence capture data

Sequence capture experiments have become very common among evolutionary biologists because of their high throughput, affordable prices, and the large amounts of data they produce. Because of their popularity, there has been a high demand for approaches to process the data in an efficient way. One of these methods is HybPiper, a recently published pipeline designed and developed by Matthew Johnson and Norman Wickett from the Botanic Chicago Garden.

The pipeline is very thorough and comprehensive, and Matt and Norm thought of everything your would need to process your data. Starting with demultiplexed clean reads (no barcodes or adapters), the pipeline 1) sorts the reads by mapping them to target sequences, using BLASTx (protein targets) or BWA (nucleotide targets), 2) assembles contigs for each gene separately, and 3) aligns contigs to target sequences and extracts exon (coding) sequence. It also produces very informative statistics for every step, identifies non-coding flanking regions (introns), and it even warns you of putative paralogous sequences, and gives you methods to help distinguish ancient from recent paralogs! What else can you ask for?

I recently used the pipeline with the latest data I received from 152 samples of Burmesitera, Centropogon, and Siphocampylus, and I could not be happier or more impressed with HybPiper. Plus the customer service is great, thanks Matt! ūüôā

Here is my experience with HybPiper ran in a parallel computing cluster. You might need to modify some commands and definitively paths to files and/or scripts!

Cleaning your data

Before running HybPiper, make sure your data are demultiplexed and clean of barcodes and adaptors. You can check if the data are clean with FastQC.

fastqc file

In case the data need to be cleaned, do this with SeqyClean or Trimmomatic.
Actually, it's a good idea to clean the data to make sure that they don't contain adapters and to discard reads with low qualities.

In SeqyClean

seqyclean -1 forward_read -2 reverse_read -o basename -qual

For many files in a directory, code from here, and Information on how this works here:

#Directory for clean reads
mkdir Cleaned_Data_SeqyClean_May_2017

#A for loop to run the program on many reads changing the variable mid loop to include the reverse read.
for f1 in *_R1.fastq.gz
do
# The line below changes R1 for R2. The way it is doing it is by stripping part of the name
# stored in the variable (i.e., R1.fastq.gz) and replacing it by something else (i.e., R2.fastq.gz)
f2=${f1%%R1.fastq.gz}"R2.fastq.gz"
# This can also be done with
# f2=${f1/R1.fastq.gz/}"R2.fastq.gz"

seqyclean -1 $f1 -2 $f2 -o ./Cleaned_Data_SeqyClean_May_2017/$f1"_Cleaned" -qual
done


#Dealing with the additional files SeqyClean produces

mkdir Reports
cd Cleaned_Data_SeqyClean_May_2017
mv *.txt *.tsv ../Reports
cd ../Reports
#appends all summary files into one
cat *.tsv >> All_Stats_temp.tsv
#removes duplicated headers
awk '!seen[$0]++' All_Stats_temp.tsv > All_Stats.tsv
rm *temp*
sed -i '/^$/d' All_Stats.tsv

#appends all report files into one
cat *.txt >> All_Reports.txt

In Trimmomatic

java -jar path-to-program/trimmomatic-0.36.jar PE -phred33 Forward_reads Reverse_reads output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:path-to-adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:10:20 MINLEN:40 [other options]

For many files in a directory

for f1 in RAPiD*R1_001.fastq.gz
do
f2=${f1%%R1_001.fastq.gz}"R2_001.fastq.gz"
java -jar /Applications/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 $f1 $f2 $f1"_output"_forward_paired.fq.gz $f1"_output"_forward_unpaired.fq.gz $f1"_output"_reverse_paired.fq.gz $f1"_output"_reverse_unpaired.fq.gz ILLUMINACLIP:/Applications/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:10:20 MINLEN:40
done

I went with SeqyClean because it produces only three files, PE1, PE2, and SE, which are accepted in HybPiper

Renaming the reads

My raw reads had some very long and uninformative names. I chose to rename the files to match the sample name (species) and some text informing me that they have been cleaned and if it's the forward (PE1), reverse (PE2) or single (SE) read.

We made a "sample sheet" that we sent to RapidGenomics with the names of each sample and they appended the corresponding filename for each sample. The spreadsheet looked like this:

Filename Species
filename1 species 1
filename2 species 2

Because I know that SeqyClean will result in three files (PE1, PE2, and SE), I needed to triplicate each row. This was easily done in TextWrangler, searching for each row (^(.+)) and replacing it with itself three times (\1\r\1\r\1). Once I had this file with three identical lines per row, I added a PE1, PE2, and SE to one of each lines per species and the file extension.

Filename Species
filename1 species1_PE1.fastq
filename1 species1_PE2.fastq
filename1 species1_SE.fastq
filename2 species2_PE1.fastq
filename2 species2_PE2.fastq
filename2 species2_SE.fastq

Finally, I added the command to move (mv) to the beginning of each line and ran it on the terminal:

mv filename1 species1_PE1.fastq
mv filename1 species1_PE2.fastq
mv filename1 species1_SE.fastq
mv filename2 species2_PE1.fastq
mv filename2 species2_PE2.fastq
mv filename2 species2_SE.fastq

Most of this was done with search and replace in TextWrangler.

Running HybPiper

HybPiper requires a few programs to run. The easiest way to get everything installed, at least for Macs, is with Homebrew.

Install HomeBrew if it's not on your computer yet, it's the easiest way to deal with convoluted installations.

On the terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install programs and dependencies

brew tap homebrew/science
brew install exonerate
brew install bwa
brew install samtools
brew install spades
brew install blast
brew install parallel
brew install gcc

The pipeline runs on python 2.7 or later (already available in Macs) but it also requires Biopython. The easiest way to get Biopython installed on your machine, both PCs and Macs, is with Anaconda.

Once you have all the programs installed via Homebrew and Biopython installed via Anaconda, download and install HybPiper either by downloading the zip file from GitHub or from the command line with

git clone https://github.com/mossmatters/HybPiper.git

If you have any problems, there are very detailed instruction on how to install HybPiper and all its dependencies here

Once everything is downloaded and installed, check that everything is working properly!

python path_to_HybPiper-master/reads_first.py --check

Additional suggested installs

There are two programs that I cannot live without while working with DNA sequences. Both of them deal with format conversion (e.g., fasta to nexus) or concatenation very smoothly. These programs are:

Download and install Phyutility to do file format conversion.

Download and install NCLconverter for further file format conversion.

Get your data ready for HybPiper

The Target File

You need a "Target File" containing the genes used to designed the capture probes. If there are multiple exons per gene, the exons need to be concatenated.
Our genes come from three different analyses and have different names. The first thing I did was to standardize the names within each probe analysis. For example, a gene containing multiple exons called Hydra\_AT1G07970\_exon1, Hydra\_AT1G07970\_exon2, Hydra\_AT1G07970\_exon3 was renamed to just Hydra\_AT1G07970. This way I was able to concatenate the multiple exons easier. Because my sequences for the genes were all in a single fasta file, I renamed them in TextWrangler using search and replace. Also, make sure that there are no empty lines between the multiples genes.
For my example:

Search: _exon.+
Replace: "nothing"
Search: ^\s+
Replace: "nothing"

With those changes, I was able to split the large files with all the fasta sequences into multiple files containing a single fasta sequence. This was necessary to concatenate the multiple exons per genes into a single continuous sequence. I did this using split, using the option -l 2 to include two lines per files and -a 15 to increase the suffix length for the new files created, i.e., this adds a unique letter combination to each file after its name.

mkdir Single_Fasta
split -l 2 -a 15 Final_Probes_Muchhala_et_al_Named_Sets.fasta ./Single_Fasta/gene
cd Single_Fasta

The next step is to concatenate the different exons for each gene. I'm sure there are more elegant ways to do this, but I used Phyutility to treat each new fasta file as a gene and to make a single large Nexus file. However, After over four hours, this process hadn't finished so I canceled it.

**This didn't finish:**
phyutility -concat -in gene* -out All_Genes_Concatenated.nex

I proceeded to break the concatenation step into smaller number of files. In my case, there are 2872 files, so to just work with one fourth of them (718 files) at a time instead of all of them I did:

#How many batches (directories) do you want to work with?
DIRECTORIES=4

#Counts number of files in a directory and stores it in a variable
NUMFILES=$(ls | wc -l)
#Divides the variable by a number specified in a variable called `DIRECTORIES`, which is the number of files I wanted to work with. It stores this number and its decimals (files/$DIRECTORIES) into a variable
NUMFILESDIR=$(echo "$NUMFILES/$DIRECTORIES"|bc -l)
#Divides the variable by four, which is the number of files I wanted to work with. It stores this number with NO decimals into a variable
NUMFILESDIR_NO_FLOAT=$(($NUMFILES / $DIRECTORIES))
#Informs how many files need to go to each directory
echo "<< The number of files per directory is ***$NUMFILESDIR*** >>"

eval mkdir {01..$DIRECTORIES} Used_Gene_Files

# In a case working with only 4 batches!
#Moves the right number of files into each directory (see below for a caveat)
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./01; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./02; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./03; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./04; done

IMPORTANT
If the number of files per directory is divisible by 4, giving you an entire number (see the prompt on the terminal "The number of files per directory is X"), you don't have to do anything else. If NOT, only a portion of the files has been moved to the directories you just created because you can only move files in entire numbers! You'll need to manually move the remaining files to the directory number 1 before continuing with the code below, which should take 10 to 15 minutes to run.

cd 1; phyutility -concat -in gene* -out temp1.nex; mv gene* ../Used_Gene_Files; mv temp* ..; cd ../2
phyutility -concat -in gene* -out temp2.nex; mv gene* ../Used_Gene_Files; mv temp* ..; cd ../3
phyutility -concat -in gene* -out temp3.nex; mv gene* ../Used_Gene_Files; mv temp* ..; cd ../4
phyutility -concat -in gene* -out temp4.nex; mv gene* ../Used_Gene_Files; mv temp* ..; cd ..
rm -r 1 2 3 4

Now that you have your sequences in multiple (4 in this case) NEXUS files, it's time to concatenate them into a single one to make sure that you don't have alleles from the same locus as different sequences.

These files are big (~130-200 sequences and ~50Mb) and concatenating them using Phyutility takes way too long! To solve this, I tried two approaches. The first one was to import the NEXUS files into Geneious and concatenate them there. This was easy and took only a minute. I then exported the concatenated file in NEXUS and fasta format.

An alternative approach for non-Geneious users can be done in biopython with the code below, which I found here. I wrote the code in a file an executed from the command line: python Name_of_file.py

from Bio.Nexus import Nexus

file_list = ['temp1.nex', 'temp2.nex', 'temp3.nex', 'temp4.nex']
nexi =  [(fname, Nexus.Nexus(fname)) for fname in file_list]

combined = Nexus.combine(nexi)
combined.write_nexus_data(filename=open('All_Sequences_Concatenated.nex', 'w'))

After you have a single (or multiple) Nexus file with all the exons concatenated

These two methods create a matrix as a NEXUS files (~1.1Gb!) where every sequence has the same length and, thus, adds a dash "-" to sites/regions that are missing (i.e., it treats those as missing data in a phylogenetic analysis). The file has to be striped of those dashes and modified to only have the fasta information. First, convert the NEXUS file to a fasta file using NCLconverter, you can skip this step if you exported directly to fasta from Geneious. The code below will also compress the nexus file (from 1Gb to 1Mb) and delete the original file. Don't leave huge files laying around in your computer, they eat up disk space quickly!

for i in *.nex; do NCLconverter $i -efasta -o$i; done
for i in *nex; do tar -czf $i.tar.gz $i; rm $i; done

Now that you have a fasta, get rid of the dashes (or question marks, depending on the conversion method used) in the sequences, delete any spaces (" ") in the species names that NCLconverter added and replace them for underscores ("_"), and delete any possible empty lines. This will reduce the file size from ~1Gb to ~1Mb! The code below will overwrite the original file. If you don't want this, remove the "-i" option and specify an output file.

for i in *.fasta; do sed -i "" -e 's/-//g' -e 's/?//g' -e 's/ /_/g' -e '/^$/d' $i ; done

Totally Optional

These two methods produce interleaved fasta files, which work in every program downstream but I don't like. If you, like me, want to change these files to non-interleaved (one line per sequence) file, use the code below:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file_in.fasta > file_out.fasta

OK, we are done. Now rename your concatenated, processed fasta file to Targets_Final.fasta. This file now contains all of your genes with the exons concatenated in a single sequences. This is the file that you'll use as the "Target File" (targets.fasta in the tutorial) in HypPiper. The last step to finish this file is to be sure that it matches the format that HybPiper requires, which is: >SpeciesName-GeneName. It shouldn't have any spaces and, apparently, not underscores. This works >Text-GeneSomeNumber but this doesn't >Text_Second_Text_Numbers. Do this with a few search and replace steps in TextWrangler or other text editor.

Name List

In order to run HybPiper on many samples with a single command, it is necessary to have a file with the names of every sample (one per line). This was easy because this information is already in the Sample Sheet file we sent to RapidGenomics. Simply copy the column with the sample names (just one per sample) and paste it in a text file. Called the file Name_List.txt. In my case, because I have three clean read files per sample, the names of the sample include a _Cleaned_ at the end. This allows me to call either the paired or the unpaired reads with the correct command in HybPiper (more on this later).

Running HybPiper

The tutorial is very thorough and all the details can be found there. The only thing worth mentioning is that if you are using three files for your reads (PE1, PE2, and SE [single reads without a sister]), you have to use the option --unpaired File_with_SE

So the command for running HybPiper would look something like this:

/path_to_HybPiper_directory/reads_first.py -b /path_to_Targets_Final.fasta -r /path_to_Paired_reads*.fastq --prefix SampleName --bwa --unpaired /path_to_Unpaired_reads.fastq

(–cpu N limits the number of processors, the default is all).

Running multiple samples with one command

To run multiple samples with a single command, you'll need the Name_List.txt and the following command:

while read name;
do /path_to_HybPiper_directory/reads_first.py -b /path_to_Targets_Final.fasta -r $name"P*.fastq" --prefix $name --bwa --unpaired $name"S*.fastq"
done < Name_List.txt

The code above reads the first line of the Name_List.txt and uses it to start the run for that sample, each line is read and store in the variable name. That's why it was important for me to add the _Cleaned_ to the names so that the variable name becomes Species_1_Cleaned_ and I could add either P for paired reads or S for unpaired reads. The only problem/limitation with this code is that it runs sequentially and not in parallel.

To run multiple samples in parallel in a cluster using SLURM

This code will run an array of 152 jobs in the cluster. The indices of the array are used to read one line of the Name_List.txt at a time and submit a job for that sample/line. It will submit jobs in batches of 30 using 8 processors for each job. Invaluable information to read the names of the files with the indices came from here.

IMPORTANT

If you run the code below in a cluster with parallel computing, you might experience some problems if the cluster can't deal with the large number of files that are read and written. This will result in the recovery of very very few loci. For example, I was recovering only 12 loci out of 745 possible genes for a specific sample after running it on the cluster. I ran the same sample on my laptop and recovered 675!. It's worth it to test one or a few samples on a standalone computer and then move to the cluster.

#!/bin/bash

#This will run an array of 152 jobs. The indices of the array are used to read one line of the Name_List.txt at a time and submit a job for that sample.

#SBATCH -J HybPiper
#SBATCH -o info.output.o%j
#SBATCH -e info.error.e%j
#SBATCH --array=1-152%30


#Set the long queue
#SBATCH -A long

#Set amount of time to run
#SBATCH -t 7-00:00:00

#Set the number of nodes
#SBATCH -N 1

# Set the number of tasks/cpus
#SBATCH -n 8

module load hybpiper
module load openmpi

# Run the MPI program

#The option below doesn't work
#name=$(tail -n $PBS_ARRAYID Name_File.txt | head -1)

#Use the array's index to read the file with names into the $name variable
name=$(sed -n "$SLURM_ARRAY_TASK_ID"p Name_List.txt)

#Run HybPiper
reads_first.py -b Targets_Final.fasta -r $name"P*.fastq" --prefix $name --bwa --unpaired $name"S*.fastq" --cpu 8

Get the length of your sequences with this code in the cluster

#!/bin/bash

#SBATCH -J HybPiper
#SBATCH -o info.output.Visualize
#SBATCH -e info.error.Visualize

#Set the long queue
#SBATCH -A long

#Set amount of time to run
#SBATCH -t 7-00:00:00

#Set the number of nodes
#SBATCH -N 1

# Set the number of tasks/cpus
#SBATCH -n 8

module load hybpiper
module load openmpi
module load python

# Run the MPI program

#Visualize results
#Get sequence length to visualize HybPiper's success at recovering genes at each locus, across all the samples.
python /opt/modules/biology/hybpiper/1.2/bin/get_seq_lengths.py Targets_Final.fasta Name_List.txt dna > Seq_Length.txt

Compute statistics about your experiment

#!/bin/bash

#SBATCH -J HybPiper
#SBATCH -o info.output.Stats
#SBATCH -e info.error.Stats

#Set the long queue
#SBATCH -A long

#Set amount of time to run
#SBATCH -t 7-00:00:00

#Set the number of nodes
#SBATCH -N 1

# Set the number of tasks/cpus
#SBATCH -n 8

module load hybpiper
module load openmpi
module load python

# Run the MPI program

#Produce Stats
#This script will summarize target enrichment and gene recovery efficiency for a set of samples.
python /opt/modules/biology/hybpiper/1.2/bin/hybpiper_stats.py Seq_Length.txt Name_List_Taxa_without_Genes_Deleted.txt > Stats.txt

Teasing introns and exons apart and getting their sequences

#!/bin/bash

#This will run an array of 152 jobs. The indices of the array are used to read one line of the Name_List.txt at a time and submit a job for that sample.

#SBATCH -J HybPiper
#SBATCH -o info.output.o%j
#SBATCH -e info.error.e%j
#SBATCH --array=1-152%30


#Set the long queue
#SBATCH -A long

#Set amount of time to run
#SBATCH -t 7-00:00:00

#Set the number of nodes
#SBATCH -N 1

# Set the number of tasks/cpus
#SBATCH -n 8

module load hybpiper
module load openmpi

# Run the MPI program

#Use the array's index to read the file with names into the $name variable
name=$(sed -n "$SLURM_ARRAY_TASK_ID"p Name_List.txt)

#Run HybPiper
python /opt/modules/biology/hybpiper/1.2/bin/intronerate.py --prefix $name

#Or run this line in a sequential way:
while read name;
do python /opt/modules/biology/hybpiper/1.2/bin/intronerate.py --prefix $name
done < Name_List.txt

Retrieve all the genes that you recovered, either nucleotides (dna), proteins (aa), introns and exons (supercontig), or just introns (intron)

#!/bin/bash

#SBATCH -J HybPiper
#SBATCH -o info.output.Retrieving
#SBATCH -e info.error.Retrieving

#Set the long queue
#SBATCH -A long

#Set amount of time to run
#SBATCH -t 7-00:00:00

#Set the number of nodes
#SBATCH -N 1

# Set the number of tasks/cpus
#SBATCH -n 8

module load hybpiper
module load openmpi
module load python/2.7.3

# Run the MPI program

#Retrieving Sequences
#This script fetches the sequences recovered from the same gene for many samples and generates an unaligned multi-FASTA file for each gene.
#Change dna below for supercontig, contig, or aa to retrieve different type of data.

python /opt/modules/biology/hybpiper/1.2/bin/retrieve_sequences.py Targets_Final.fasta . dna > exons.log
python /opt/modules/biology/hybpiper/1.2/bin/retrieve_sequences.py Targets_Final.fasta . aa > proteins.log
python /opt/modules/biology/hybpiper/1.2/bin/retrieve_sequences.py Targets_Final.fasta . intron > introns.log
python /opt/modules/biology/hybpiper/1.2/bin/retrieve_sequences.py Targets_Final.fasta . supercontig > supercontigs.log

mkdir Final_Genes
mkdir ./Final_Genes/Exons ./Final_Genes/Proteins ./Final_Genes/Supercontig ./Final_Genes/Introns
mv *FNA exons.log ./Final_Genes/Exons
mv *introns* ./Final_Genes/Introns
mv *FAA proteins.log ./Final_Genes/Proteins
mv *supercontig* ./Final_Genes/Supercontig

Analyzing potential paralog loci

Matt has a really good explanation on what you can expect and do with the possible paralog gene. Go read it here

To get a list of the number of copies per gene in each sample to stdout, type the following:

while read i
do
echo $i
python /opt/modules/biology/hybpiper/1.2/bin/paralog_investigator.py $i
done < Name_List.txt

You'll need to select all and copy/paste the list to a text file.

Now you have to create a file with the genes that have potential paralogs. Open the file you created in the previous step (number of copies per gene per sample) and remove all the information from it except for the gene name. Do this in a text editor with search and replace.

Example: this line of text 2 paralogs written for Gene000002284419 changes to Gene000002284419. After you've done this, delete any duplicated lines. In BBEdit go to "Text->Process Duplicated Lines". Finally, you need to delete all returns \r and replace them for spaces so that all the genes are in a single line.

Now you have a list of genes that have multiple copies.

Finally run the following command:

# Using a file with one gene per line:
mkdir Final_Genes/Paralogs
parallel "python /opt/modules/biology/hybpiper/1.2/bin/paralog_retriever.py Name_List.txt {} > {}.paralogs.fasta" :::: ./Final_Genes/Paralog_Genes.txt 2> ./Final_Genes/Paralog_table.txt
mv *paralogs* Final_Genes/Paralogs
# Using the gene names
mkdir Final_Genes/Paralogs
parallel "python /opt/modules/biology/hybpiper/1.2/bin/paralog_retriever.py Name_List.txt {} > {}.paralogs.fasta" ::: Gene000000000064 Gene000000000099 Gene000000000106 Gene000000000107 Gene000000000108 Gene000000000111 Gene000000000119 Gene000000013958 Gene000000026916 Gene000000034941 Gene000000072925 Gene000000088452 Gene000000093415 Gene000000095243 Gene000000105734 Gene000000109936 Gene000000117000 Gene000000139714 Gene000000140381 Gene000000148794 Gene000000151236 Gene000000155182 Gene000000155339 Gene000000166541 Gene000000168219 Gene000000179355 Gene000000185393 Gene000000185460 Gene000000198698 Gene000000210298 Gene000000213937 Gene000000216209 Gene000000226910 Gene000000252250 Gene000000261154 Gene000000272966 Gene000000284062 Gene000000314285 Gene000000370726 Gene000000372933 Gene000000408989 Gene000000456987 Gene000000467115 Gene000000480359 Gene000000510272 Gene000000520052 Gene000000560456 Gene000000630862 Gene000000665080 Gene000001291309 Gene000001298132 Gene000002089674 Gene000002090980 Gene000002093127 Gene000002162473 Gene000002165004 Gene000002165930 Gene000002168236 Gene000002182899 Gene000002213682 Gene000002219132 Gene000002219408 Gene000002221889 Gene000002240027 Gene000002281776 Gene000002284419 Gene000002305250 Gene000002341791 Gene000002346357 Gene000002348025 Gene000002349895 Gene000002410184 Gene000002412774 Gene000002413865 Gene000002420086 Gene000002420898 Gene000002437988 GeneAT1G01180 GeneAT1G04110 GeneAT1G04640 GeneAT1G07740 GeneAT1G10460 GeneAT1G15510 GeneAT1G21840 GeneAT1G26900 GeneAT1G32520 GeneAT1G53600 GeneAT1G65380 GeneAT2G16880 GeneAT2G28790 GeneAT2G29900 GeneAT2G42700 GeneAT2G43760 GeneAT3G05340 GeneAT3G06920 GeneAT3G21470 GeneAT3G48150 GeneAT3G58520 GeneAT4G21300 GeneAT4G23890 GeneAT4G31850 GeneAT4G33990 GeneAT5G06370 GeneAT5G18390 GeneAT5G18475 GeneAT5G21970 GeneAT5G43280 GeneAT5G50290 2> ./Final_Genes/Paralog_table.txt
mv *paralogs* Final_Genes/Paralogs

Separate by Clade

I'm going to separate the final sequences (the supercontigs in this case) in each fasta file by taxonomic group.

Start with the non-interleaved fasta files that HybPiper created:

cd Supercontig
mkdir Non-Interleaved
for i in *fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $i > $i"_non_interleaved.fasta"; done
mv *non-interleaved.fasta Non-Interleaved/

For the Burmeistera species that have good names and enough loci, plus some outgroups:

for i in *.aln; do grep -A 1 -e \>Burmeistera_almedae_2016_001_Combined -e \>Burmeistera_anderssonii_2016_113_Combined -e \>Burmeistera_aspera_2016_089_Combined -e \>Burmeistera_auriculata_2016_199_Combined -e \>Burmeistera_borjensis_2016_219_Combined -e \>Burmeistera_brachyandra_2016_114_Combined -e \>Burmeistera_brighamioides_2016_121_Combined -e \>Burmeistera_bullatifolia_2016_129_Combined -e \>Burmeistera_ceratocarpa_2016_233_150bp -e \>Burmeistera_cf_aeribacca_2016_087_Combined -e \>Burmeistera_chiriquiensis_2016_161_Combined -e \>Burmeistera_crassifolia_2016_200_Combined -e \>Burmeistera_crispiloba_2016_201_Combined -e \>Burmeistera_cyclostigmata_2016_104_Combined -e \>Burmeistera_cylindrocarpa_2016_235_Combined -e \>Burmeistera_darienensis_2016_149_Combined -e \>Burmeistera_domingensis_2016_192_Combined -e \>Burmeistera_draconis_2016_133_Combined -e \>Burmeistera_dukei_2016_153_Combined -e \>Burmeistera_fuscoapicata_2016_128_Combined -e \>Burmeistera_glabrata_2016_227_Combined -e \>Burmeistera_holm-nielsenii_2016_210_Combined -e \>Burmeistera_huacamayensis_2016_115_Combined -e \>Burmeistera_litensis_2016_214_Combined -e \>Burmeistera_loejtnantii_2016_006_Combined -e \>Burmeistera_lutosa_2016_209_Combined -e \>Burmeistera_mcvaughii_2016_155_Combined -e \>Burmeistera_multiflora_2016_196_Combined -e \>Burmeistera_obtusifolia_2016_108_Combined -e \>Burmeistera_oyacachensis_2016_230_Combined -e \>Burmeistera_panamensis_2016_151_Combined -e \>Burmeistera_parviflora_LL_20_Combined -e \>Burmeistera_pirrensis_2016_148_Combined -e \>Burmeistera_quercifolia_2016_009_Combined -e \>Burmeistera_ramosa_2016_007_Combined -e \>Burmeistera_refracta_2016_195_Combined -e \>Burmeistera_resupinata_2016_240_Combined -e \>Burmeistera_resupinata_var_heilbornii_2016_116_Combined -e \>Burmeistera_rubrosepala_2016_239_Combined -e \>Burmeistera_smaragdi_2016_236_Combined -e \>Burmeistera_smooth_2016_135_Combined -e \>Burmeistera_sodiroana_2016_197_Combined -e \>Burmeistera_succulenta_2016_143_Combined -e \>Burmeistera_succulenta_2016_188_Combined -e \>Burmeistera_tenuiflora_2016_158_Combined -e \>Burmeistera_truncata_2016_144_Combined -e \>Burmeistera_utleyi_2016_156_Combined -e \>Burmeistera_vulgaris_2016_100_Combined -e \>Burmeistera_xerampelina_2016_086_Combined -e \>Burmeistera_zurquiensis_2016_098_Combined -e \>LL6_C_smithii_Combined -e \>LL49_C_incanus_Combined -e \>LL61_C_asclepideus_Combined -e \>LL69_C_nigricans_Combined -e \>LL83_C_brittonianus_Combined -e \>LL86_C_mandonis_Combined -e \>LL88_S_ayersiae_Combined -e \>LL159_S_jelskii_Combined -e \>LL334_S_aureus_Combined -e \>LL363_S_krauseanus_Combined  $i > Species_with_Good_Data/$i"_Good_Data.aln"; done

#Clean the two dashes in extra lines

cd Species_with_Good_Data
for i in *.aln; do sed -i "" 's/^--$//g' $i; done

For all Burmeistera, including some outgroup species:

cd Non-Interleaved
mkdir Burmeistera_and_Outgroups

for i in *non_interleaved.fasta; do grep -A 1 -e \>Burm -e \>LL69_C_nigricans -e \>LL363_S_krauseanus -e \>LL6_C_smithii -e \>LL159_S_jelskii -e \>LL61_C_asclepideus -e \>LL334_S_aureus -e \>LL49_C_incanus -e \>LL83_C_brittonianus -e \>LL86_C_mandonis -e \>LL88_S_ayersiae $i > Burmeistera_and_Outgroups/$i"_Burmeistera_and_Outgroups.fasta"; done

#Clean the two dashes in extra lines

cd Burmeistera_and_Outgroups
for i in *.fasta; do sed -i "" 's/^--$//g' $i; done

Alignment, Clean up, and Phylogenetics

Before running this code, read how to "parallelize" the work load below!

If you don't have access to a cluster with many nodes and cores but have access to a few machines with many core, you can still work on parallel by assigning some work to each computer.

You need to figure out how many files you have, how many "batches" you are going to divide the work in, move the right amount of files into each batch, and run the analyses in each batch.

Using the same code I used during concatenation step if the target genes:

#Delete any empty files
find . -size 0 -delete

#How many batches (directories) do you want to work with?
DIRECTORIES=30

#Counts number of files in a directory and stores it in a variable
NUMFILES=$(ls | wc -l)
#Divides the variable by a number specified in a variable called `DIRECTORIES`, which is the number of files I wanted to work with. It stores this number and its decimals (files/$DIRECTORIES) into a variable
NUMFILESDIR=$(echo "$NUMFILES/$DIRECTORIES"|bc -l)
#Divides the variable by four, which is the number of files I wanted to work with. It stores this number with NO decimals into a variable
NUMFILESDIR_NO_FLOAT=$(($NUMFILES / $DIRECTORIES))
#Informs how many files need to go to each directory
echo "<< The number of files per directory is ***$NUMFILESDIR*** >>"

eval mkdir {01..$DIRECTORIES}

#Move the corresponding number of files into each directory (number of lines must match $DIRECTORIES)

for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./01; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./02; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./03; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./04; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./05; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./06; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./07; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./08; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./09; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./10; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./11; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./12; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./13; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./14; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./15; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./16; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./17; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./18; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./19; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./20; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./21; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./22; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./23; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./24; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./25; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./26; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./27; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./28; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./29; done
for file in $(ls -p | grep -v / | tail -$NUMFILESDIR_NO_FLOAT); do mv $file ./30; done

## Check that you moved all the files!!!!!
## If not, move the remainder manually to the last (largest number) directory.

Within the directory with the cleaned fasta files, either execute the code below or, even better, copy paste it into a file and run it as a script. I called my script Align_Clean_Phylo.sh, and I copied the script into in each directory/batch of files with the code below:

echo 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | xargs -n 1 cp Align_Clean_Phylo.sh

The run the script with ./Align_Clean_Phylo.sh

Code for phylogenetics

#!/bin/bash

# Load modules if running on a cluster
module load mafft
module load phyutility
module load fasttree

# Delete any empty files
find . -size 0 -delete

# Create directroies

mkdir Alignments Fasta Phylo

# Align with Mafft, clean with Phyutility, and analyze with FastTree

for i in *.fasta
do
# Align
echo ""
echo "~~~Aligning with Mafft~~~"
echo ""
time mafft --auto --thread 6 --preservecase $i > $i.aln

# Cleaninig Alignment
echo ""
echo "~~~Cleaninig alignment with Phyutility at 0.5~~~"
echo ""
time phyutility -clean 0.5 -in $i".aln" -out $i"_cleaned_05.aln"

# Phylogenetics
echo ""
echo "~~~Building phylogeny with FastTre~~~"
echo ""
time FastTree -nt -gtr < $i"_cleaned_05.aln" > $i".tre"

# House keeping
mv $i".aln" $i"_cleaned_05.aln" Alignments
mv $i".tre" Phylo
mv $i Fasta
done

If you need to standardize names within the trees

Maybe I'll need to make the tips match for species tree analyses or concatenation. The code below will get rid off the appended gene information:

perl -pe 's/_Gene\w+//g' InFile > OutFile

Species Tree Inference

After all the gene trees are built, it's time to make a species tree under the multispecies coalescent. I'm going to use ASTRAL II and SVDquartets for this purpose.

ASTRAL II

Starting in a directory with all the gene trees (the best tree from RAxML for example), make a file with all the gene trees in it:

cat genetree* > All_gene_trees.tre

Make sure that the names of the species in each gene tree are the same!! Otherwise the species tree program won't be able to know they are the same. This can be skipped if you standardized the names with the perl step above.

For example, "species1_gene1" for gene tree 1 needs to be changed to "species1". This is easily done in a text editor.

Now that they all match, run ASTRAL II

java -jar /Applications/ASTRAL-master/Astral/astral.4.11.1.jar -i All_gene_trees.tre -o Species_tree_Astral.tre

SVDquartets

Start with a file with all loci concatenated into a NEXUS file. Get the latest command-line version of Paup and type the following:

paup

#Within Paup

exe file.nex

#No Bootstrap
SVDQuartets nthreads=25 nquartets=10000000

#Save the tree
SaveTrees file = Burmeistera_SVDquartet.tre format = Newick brLens = yes supportValues = Both trees = all

#With Bootstrap
SVDQuartets nthreads=25 nquartets=10000000 bootstrap=standard nreps=50 treeFile= Burmeistera_SVDquartet_Bootstrap_Good_Data.trees

Cleaning up

run the code below to delete unnecessary files, mostly the results from SPAdes. This will reduce to size of the directories by 75%.

python /opt/modules/biology/hybpiper/1.2/bin/cleanup.py

Final Commands and Notes

  • Remember to compress your reads after you are done!
gzip ./Cleaned_Data_SeqyClean_May_2017/*fastq
  • Be in the directory that contains the Name_List.txt and Targets_Final.fasta. The reads/genes will be processed here.
  • Make sure that the paths on your commands point to the directory of HybPiper and the appropriate reads. Important If the cluster/computer you are using gives you an error that the read files don't exist, even though you triple checked that the path is correct and that there are not typos in the names, you will need to move the reads to the directory you are processing the genes in.
  • Use the flag --cpu N to specify the number of processors available to HybPiper. This is especially important if you are using a cluster or if you want to limit the available resources of your computer.
  • Have fun!

Analyzing Genes with Statistical Binning

Image from Bayzid et al., 2015journal.pone.0129183.g001

So, you have your multiple independent nuclear genes from your genomic experiment‚ÄĒ maybe you did Microfluidics, Hyb-Seq, RADSeq or some other cool high-throughput sequencing method‚ÄĒand are now ready to run your coalescent-based species tree estimation analyses. But what happens if you are unable to run the¬†*BEAST analysis you have always dreamed of because you have too much data? Or if your gene trees are not informative enough for a summary method like ASTRAL II or MP-EST? This is where Statistical Binnig comes to the rescue!

In short, Statistical Binnig concatenates¬†genes that have congruent gene trees into ‘super-genes’. The reasoning behind this is that by reducing the number of loci you need to analyze, you reduce the number of independent parameters that a program like *BEAST needs to estimate, thus making it run faster and, hopefully, converge in a meaningful result. ¬†Another important reason for¬†doing an Statistical Binning analysis is to have gene trees that are better¬†resolved. Summary methods (ASTRAL II, MP-EST…) provide a robust species tree estimation as long as the gene trees that go in to the program are resolved and robust. Many times this is not possible, especially if the gene trees are estimated from exon-based capture probes and are not very informative (I have loci with¬†one¬†substitution in >600bp!) or if you are working on a recent group of taxa. By concatenating these uninformative loci into a longer, and thus, more informative ‘super-gene’, your chances to estimating a more resolved gene tree are¬†higher. ¬†I’m currently using this approach¬†as part of the “Preliminary Data” section of an NSF proposal due at the beginning of August‚ÄĒfingers crossed :).

First, download the Statistical Binning code from their GitHub repository¬†and follow their instructions on how to install and run it‚ÄĒit’s pretty simple.

The only things I struggled with were 1) how to define “BINNING_HOME” as an¬†environmental variable and 2) that the program was telling me that I didn’t need to do the binning at all.

The first issue was resolved with a¬†quick Google search and it’s done like this:

export BINNING_HOME="path to the directory where the program lives"

The above code will set BINNING_HOME as a variable that will be available until you close the current shell you are working on.

The second issue was fixed by using trees from a proper ML analysis. I was first experimenting with the method and was using trees that were generated with only a few (i.e., 2) bootstrap replicates, oops! After I used trees with an appropriate number of replicates (in my case 1000), this problem went away.

Here is¬†a list of the commands I typed, using¬†a low bootstrap threshold of 35 (modify as needed). I was already in the program’s directory but since I set the path to¬†it in the variable, I could have run it from a different location.

export BINNING_HOME=/Applications/Statistical_Binning
./makecommands.compatibility.sh genes_dir 35 Output RAxML_bipartitions.Final.tre
sh commands.compat.35.genes_dir 
cd Output
ls| grep -v ge|sed -e "s/.35$//g" > genes
python $BINNING_HOME/cluster_genetrees.py genes 35

The program might tell you that a “bin” contains only one gene and that you might want to link the¬†gene tree you already estimated instead of running the analysis again for that gene. Using the program’s default parameters, it will create a symlink but, at least for me, this link didn’t point to the right location of the gene tree.¬†The manual says that you can modify one of the scripts to create¬†symlinks matching your file structure but I haven’t done this yet. If I do, I’ll update this post to show you how it’s done.

After you have your congruent loci concatenated into ‘super-genes’, it is time to estimate ‘super-gene-trees’ with your program of choice. A cool feature of Statistical Binnig is that it produces a partition file for each “super-gene” with information on how many loci comprise that “super-gene” and the positions they start and finish. This is very useful if you want (and should) estimate the “super-gene-tree” in a partitioned analysis, for example in RAxML use¬†the “-q” flag. After the “super-gene-trees’ are ready, use them as input in your summary method or *BEAST analysis and voil√°‚ÄĒyou now have a, hopefully, better estimation of the species tree.

Generating Unique Locus Tags for Plastomes

You need (should) to associate a unique identifier, know as a locus tag, to each gene when you are annotating a genome. This should be done for every feature you annotate (tRNA, rRNA, CDS, genes) with the exception of miscellaneous features (e.g., Large Single Copy [LSC]) and repeated region (e.g., Inverted Repeat [IR]).

Locus tags can be anything you want as long as you are consistent and systematic on how you name them. A convention I’ve heard of before is: two letters for the genus, followed by an underscore, followed by a three letter abbreviation of the species, followed by underscore, two letter for the genome you are working with (e.g., chloroplast), underscore, and the¬†position where the gene starts (sometimes rounded to the next 100th base). An example of a chloroplast gene in Neobartsia ramosa starting at position 4657 would be: Ne_ram_cp_4657.

I haven’t found a simple, fast, and efficient way to create these locus tags in any annotation program, and I don’t want to do it by hand! Geneious does have the option of creating locus¬†tags in batch mode, but they are all identical and you have to write the unique part (the starting position) one at a time to make them, well, ¬†unique.¬†So here is what I did to create these tags¬†in 19 annotated plastomes. This is cumbersome and not at all fancy, but it saved me a whole bunch of time.

I annotated my plastomes in Geneious, creating a generic locus tag for each species (i.e., Ne_ram_cp_), and exported the annotated sequences¬†as GenBank flat files. I kept the Geneious labeling, selecting¬†the “non-strict” GenBank file.

Then, I combined all the files into a single file using “cat” in the terminal, and opened it in TextWrangler. There,¬†I did a few search and replaces using regular expression to move some values to a certain position. You might need to modify these if you are not using Geneious.

***WARNING: There is a space after “Search:” and”Replace:” that should not be included in the patterns!***

Search: (complement\()([0-9]+)(.+) -- Replace: \t\t\t\1\t\2\t\3


Search: (.+locus_tag.+)" -- Replace: \t\t\t\1


Search: ^( +gene) +([0-9]+)(.+) -- Replace: \1\t\t\t\t\2\t\3


Search: ^( +CDS) +([0-9]+)(.+) -- Replace: \1\t\t\t\t\2\t\3


Search: (join\()([0-9]+)(.+) -- Replace: \t\t\t\1\t\2\t\3


Search: ^( +exon) +([0-9]+)(.+) -- Replace: \t\t\t\1\t\2\t\3


Search: ^( +intron) +([0-9]+)(.+) -- Replace: \t\t\t\1\t\2\t\3

Search: ^( +trna) +([0-9]+)(.+) -- Replace: \1\t\t\t\t\2\t\3 

Search: ^( +rrna) +([0-9]+)(.+) -- Replace: \1\t\t\t\t\2\t\3

These searches moved the start position¬†of the genes and the locus_tags to a place separated by a determined number of tabs. I then opened this file in Excel and concatenated the cell with the locus tag with the one with the starting position of the gene. Something like: =CONCATENATE(cell-with-locus-tag,cell-with-starting-position,””””). The last quotes are to insert a closing quote that goes in the GenBank file. I deleted it during the search and replace step. The copy this formula to the whole column.

You should have ended with a column that now has unique tags for every gene, and should look something like this:¬†/locus_tag=”Ne_ram_cp_413″. Double check that all, or most, of the loci have their tags filled. I noticed there were¬†a few (2-3) that had¬†their start position somewhere else, and had to do these few by hand.

Now, you need to replace the generic locus tag with the one we created doing some search and replace steps, and reverse all the moving around of cells from the previous steps.

Search: /locus_tag="\w+\t\t\t\t + -- Replace: nothing

Search: \t"\t -- Replace: nothing

Search: \t\t\t -- Replace: nothing

Search: complement\("$ -- Replace: nothing

Last thing is to separate the file into single species again (or you could do this on each flat file separately). Check that everything looks identical to an unaltered file! I noticed that I deleted some tabs I shouldn’t have and now the start position of the gene is by the type of annotation, e.g.,¬†CDS4780..6276. This happened with other annotation types too (grrr). I had to add the corresponding spaces!

Search:¬†(CDS)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬† ¬† \2

Search:¬†(gene)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬† ¬†\2

Search:¬†(exon)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬† ¬†\2

Search:¬†(trna)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬† ¬†\2

Search:¬†(rrna)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬† ¬†\2

Search:¬†(intron)(\w) — Replace:¬†\1 ¬† ¬† ¬† ¬† ¬†\2

Search:¬†^(complement.+) — Replace:¬†¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†\1

Search:¬†join\(“$ — Replace: nothing

Reimport them to Geneious, check that you have the same number of annotations, and if everything looks good, use their plug-in to submit to GenBank.

Pretty silly way to do it but it works.

 

 

 

Choosing form Multiple Sets of Capture Probes

So I managed to create various sets of capture probes, or baits, from four transcriptomes and 19 genome skim libraries of¬†the plant genus¬†Burmeistera. As I mentioned in a previous post, I did this using the pipeline Sondovac¬†and SPAdes as the assembler. The problem then was to compare the four different probes (one for each transcriptome and all the libraries combined) to each other and pick the best one, or a combination of a few‚ÄĒbut how?

The first idea is¬†to pick probes that are shared among all my sets‚ÄĒthis would give me “safe” genes that would likely be captured from across my ~500 samples, but also, genes that could also be very conserved. The opposite idea of course, is¬†pick the ones that do not overlap, i.e., the non-redundant ones, hoping that these genes are still shared across samples but that are not as conserved as the ones found in every transcriptome. The latter set would also include a larger number of probes. Wait what? Let see it in a figure¬†to clear up my mind:

Capture_Probes_Design

Additional to these four probe sets, I have a set (one transcriptome-all reads)¬†done with Sodovac but using Geneious as the assembler instead of SPAdes. This is actually the way Sondovac is supposed to be run but because my data is so large, Geneious would crash trying to assemble the contigs. After three days of continuous running and finally crashing my computer, I managed to assemble 40% of the contigs for this set. Even though the assembler didn’t finish, this probe set contains ~1,800 genes comprised of ~5,300 exons! Another advantage of this set is that it has exon-intron boundary information that is crucial to avoid designing baits across the junction of two transcribed exons‚ÄĒthese probes would likely fail during capture because the introns wouldn’t have been spliced in the genome. This is key and it’s something that the sets using SPAdes as the assembler lack.

Finally, I have another set of probes designed by my collaborator Monica Carlsen‚ÄĒin a pipeline I’m calling Hydra‚ÄĒthat contains 158 genes with intron-exon boundary information and 471 genes without this information.

The first thing I did was to compare these sets using BLAT to see how many genes they had in common. Short answer, not many.

blat database query -tileSize=11 -minIdentity=90 -out=pslx output.pslx

So it was obvious that I would need to combine these sets somehow and have enough target¬†genes for the 10,000 probes (1.2 million bp) that we are ordering from Rapid Genomics.¬†After long conversations with people near and far (Story on Twitter) about what type of genes to target‚ÄĒmostly if targeting genes with multiple exons is better than genes with a single exon‚ÄĒwe decided to opt for a pluralistic set including genes¬†with and without multiple exons, but leaning a little bit more on the ones with multiple exons.
Our final probe set contains:

-158 genes (517 exons) with intron-exon information from Hydra.
-500 genes (2,198 exons) with intron-exon information from Sondovac (one transcriptome-All genome skim libraries assembled in Geneious until 40%).
-300 genes without intron-exon information from the non-redundant set from Sondovac (each transcriptome-All genome skim libraries assembled in SPAdes, combine all four sets, and ran through a non-redundancy program [CD-HIT-EST]).

cd-hit-est -i input.fasta -o output.fasta -c 0.9 -n 10 -g 1 -p 1 -d 0 -T 0

After a final step in CD-HIT-EST (90% similarity) to get rid of redundancy among the last three sets combined we have:

Total exons: 2,872
Total length: 1,412,629 bp
Max length of exon: 3,764 bp
Min length of exon: 120 bp
Mean length of exon: 491.86 bp

Note that the length here is for exons and not for genes, and that most of the smallest (120 bp) exons are part of a gene comprised of a few exons. For example, the longest gene from the Sondovac-Geneious set is 4,175 bp and has 4 exons, and the second one is 3,947 and has 14 exons.

So the plan is to submit a bunch of genes and hope that most of them work. In a worst case scenario, if only 200 or so genes work it would still be a pretty good multilocus data set to do coalescent-based species tree estimation on Burmeistera!

We’ll see what happens, fingers crossed! ūüėĚ

Capture probe design with Sondovac

My postdoc project involves designing nuclear capture probes to sequence hundreds of loci. Recently, a publication (Schmickl et al., 2016)¬†on phylogenomics of the plant genus Oxalis,¬†showed the utility of a¬†new pipeline (Sondovańć) to¬†design such orthologous low-copy nuclear probes for target enrichment. The script is well written and straightforward and requires minimal bioinformatic experience‚ÄĒit even installs the programs it needs! The probes are designed based on¬†transcriptome and genome skim data (shotgun libraries), which you can get from a pilot study or from publicly available databased like the 1kp. This is¬†a huge ¬†advantage especially for non-model organisms that don’t have large genomic resources. Although very promising, it has a few minors caveats¬†that could be easily addressed: i)¬†it requires paired-end sequence data for a mapping step with Bowtie2, and ii) it uses Geneious to do¬†de novo assemblies. The problem with Bowtie2 is easily fixed in the script by changing the command for the mapper, the problem with Geneious has proven to be a bit trickier…

I have been using the pipeline to design probes for¬†Burmeistera¬†(Campanulaceae), with the goal of sequencing ~500‚Äď1000 nuclear genes in ~120 species. I had 19 genome skim libraries (100bp single-end) and six transcriptomes to work with, and here is what I have done so far.

To address the problem with Bowtie and my single-end reads, I modified the script so that Bowtie2 would only use single-end reads and I commented out a step that uses FLASH to merge the paired-end reads into longer sequences. Both of these steps make the pipeline much more effective and I wish I had paired-end reads, but alas! A copy of the modified script can be found on my GitHub (HERE).

Because the pipeline is designed to work with one transcriptome and one genome skim library, I started my analyses using one transcriptome and one genome skim library (those with the most reads). This worked well, recovering 574 genes (>600bp) comprised of ~2300 exons. You can modify some option in Sodovac, e.g., minimum gene length, size of the bait etc., but I went with the default values.

However, I was not using all the available data I had and this felt silly to do. To overcome this issue, I generated a few files by combining similar type of data (all transcriptomes, all genome skim libraries, or a combination of those) and ran various analyses on these data combinations. To make it more clear, I ran, for example, an analysis with all combined genome skim libraries and one transcriptome, or one genome skim library against all combined transcriptomes, or all combined libraries against all combined transcriptomes. I wanted to really explore my data and see what kind of results I would get with different amounts of data.

Sondovac worked well with these large files‚ÄĒI think it took a day or two to run on the larger data combinations‚ÄĒbut the problem came after part “a” of the script when I tried to assemble the reads in Geneious. Geneious is a great program but it’s very slow (read useless) with¬†large files. I was actually never able to assemble the larger data combinations (e.g., one transcriptome and all libraries = ~4 million sequences and 228Mb file size), and thus,¬†hit a wall. Nevertheless, thanks to Roswitha Schmickl and¬†Vojtńõcht Zeisek (Sondovac’s developers) I was able to find a walk around.

Instead of dealing with Geneious, I used Newbler and SPAdes to assemble the sequences identified by part “a” of Sondovac, and with a bit of script modification, I was able to make the pipeline (part “b”)¬†accept the resulting assemblies from these programs. One main thing is to get the resulting fasta files from Newbler and SPAdes into the right naming format‚ÄĒI did this with sed as follows:

¬†Warning: this will¬†modify the file your are reading in! Make a copy of the file or remove the -i “”¬†part of the command and write (“>”) to a new file

For Newbler

sed -i "" -e 's/contig//g' -e 's/\>\([0-9]*\) *length=[0-9]* *numreads=\([0-9]*\)/\>\1_Contig_\1 Assembly of \2 reads/g' 454AllContigs.fna

For SPAdes

sed -i "" -e 's/NODE_/00000/g' -e 's/\>\(.*\)_len.*cov_[0-9]*\.\(.*\)/\>\1_Contig_\1 Assembly of \2 reads/g' contigs.fasta

Now, the part b of Sondovac requires a .tsv file that will not be generated with SPAdes or Newbler. To get around this requirement, I modified part b of Sondovac deleting (or commenting out) every line that called this file. I placed the modified version of part b on the script on my GitHub page. In case you need it, it’s HERE. Be sure to call this modified script without the “-x” option.

./sondovac_part_b_no_tsv.sh -n -c Plastome_Reference.fasta -z Consensus_sequences.fasta -k 720

The difference was immense! Both SPAdes and Newbler finished assembling the large files in half an hour or less. The only downfall of using these assemblers is that there is no way to get the intron-exon boundary information that you get with Geneious. This isn’t a huge¬†deal because in target capture experiments you aim to sequence exons and only get to sequences very little of the introns anyways.

The best part of being able to use the larger data combinations is that I’m getting a ton of loci! For example, by using one transcriptome and all combined genome skim libraries, Sondovac recovered over 1700 genes longer than 600bp, which¬†is very promising given the fact that many of these won’t work during the actual experiment.

After all the different experiment I did, I can say a few concluding things: 1) SPAdes results consistently in more genes than Newbler. I used the “–careful” option in SPAdes but other than that, both programs were run with¬†default parameters (runAssembly and –only-assembler in Newbler and SpAdes, respectively). 2) Oddly enough, the combination of all transcriptomes and all genome skim data didn’t result in the most genes. Maybe the transcriptomes are too similar to each other and Sondovac is removing loci based on this. 3) Combining genome skim libraries made a huge impact on the final number of genes. I think this makes a lot of sense if you think of how these shotgun libraries are constructed. These are randomly sheared into smaller fragments that are likely not the same among all samples. By¬†combining shotgun libraries you are increasing the number of genomic regions you sequence in the group. You do have to understand that you end up with a¬†chimeric genome for the genus‚ÄĒwith sequenced regions belonging to different species‚ÄĒbut since these are all closely related species, and their genomes are probably not too divergent, I think that’s ok. 4) Finally, the combination of one transcriptome and all genome skim data was the best‚ÄĒwith different results depending on the transcriptome I used.

I’m now left to compare the multiple results from different data combinations, and pick the best one or figure out a way to combine some. I’ll do the compassion in BLAT.

Designing Capture Probes

I’ll be using RNA-Hybridization probes (Hyb-Seq or capture probes) to capture hundreds of loci for all species of¬†Burmeistera¬†for my postdoc project. Of course, I have to design these probes first using the HTS data that have already been generated in my lab.

The ingredients are:

-Six transcriptomes‚ÄĒfour for¬†Burmeistera¬†and two for¬†Siphocampylus.

-20 libraries of genome skim data (100bp single-end reads)‚ÄĒ19 Burmeistera¬†and one¬†Centropogon.

I tried the MarkerMiner pipeline the ¬†other day using¬†the six transcriptomes and Arabidopsis thaliana as reference. This resulted in 1355 possible loci where I could design the probes from. However,¬†none¬†of the genes found were shared among¬†my transcriptomes, i.e., only one transcriptome had data for that locus and that’s not good…

My next approach, the one I’m working on right now, is to use Sondovac‚ÄĒa new pipeline developed by Roswitha Schmickl et al., which was just published in Molecular Ecology Resources [http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12487/abstract]. Their pipeline is very intuitive and has the great advantage that it uses only¬†your own data. This is very important for us because there aren’t any closely related genomes available!

I’m still working out some hiccups but I think this will be the way to go!

Learning a new cluster – SLURM!

I recently started a postdoctoral position at the University of Missouri РSt. Louis working on genomics and phylogenomics in Burmeistera (Campanulaceae), a plant genus of ~140 species distributed in the cloud forests of northern South America.

I’ll soon be assembling transcriptomes using Trinity but the very first thing I had to learn for this¬†project, was how to use UM’s new cluster (Lewis3) that runs SLURM! Here are a few things I figured out, mostly so I don’t forget!

  1. UM’s website with some instructions: https://umbc.rnet.missouri.edu/resources/Submitting_Jobs_via_SLURM.html
  2. A detailed website with all of the options available: http://slurm.schedmd.com/sbatch.html
  3. To run a job allocating nodes and time, use these command: sbatch -N¬†nodes -n processors –t¬†time_for_run. Time is give in:¬†“minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. For example sbatch -N 1 -n 8 -t 2-01 script.sh will allocate 10 nodes for two days and one hour.
  4. To figure out when your job will run, add the¬†–test-only¬†option to your submission command.
  5. A generic submission script looks like this:
    #!/bin/bash
    #SBATCH -J Name
    #SBATCH -o info.output.o%j
    #SBATCH -e info.error.e%j
    Set the number of nodes
    #SBATCH -N 1

    # Set the number of tasks/cpus

    #SBATCH -n 8

    ##Lets add the modules in with the lines:
    module load module_you_want_to_use_1
    module load module_you_want_to_use_2

    # Run the program
    Program and option to run

    Now call the submission script at the prompt with: sbatch submission_script.sh
    Program and option to run

That’s it for now!