Skip to content

Read Filtering Walkthrough

Important Note

The following walkthrough is independent of the Snakemake workflows. All commands given below are bash commands. This walkthrough also utilizes Docker, while the Snakemake workflows utilize SIngularity.

The following walkthrough covers the steps in the read filtering and quality assessment workflow. This walkthrough covers the use of Docker to interactively run the workflow on a fresh Ubuntu 16.04 (Xenial) image, and requires sudo commands to be run.

This walkthrough presumes that the steps covered on the Installing page have been run, and that a version of Python, Conda, Snakemake, and Docker are available. See the Installing page for instructions on installing required software.

Walkthrough Steps

Starting with a fresh image, go through the installation instructions on the Installing page.

Install the open science framework command-line client:

$ pip install osfclient

Install docker with the following shell commands:

$ wget -qO- https://get.docker.com/ | sudo sh
$ sudo usermod -aG docker ubuntu

Make a directory called data and retrieve some data using the osfclient. Specify the path to files.txt or move it to your working directory.

mkdir data
cd data

for i in $(cat files.txt)
do 
    osf -p dm938 fetch osfstorage/data/${i}
done

Link the data and run fastqc

mkdir -p ~/data/qc/before_trim

docker run -v ${PWD}:/data -it biocontainers/fastqc fastqc /data/SRR606249_subset10_1.fq.gz -o /data/qc/before_trim

docker run -v ${PWD}:/data -it biocontainers/fastqc fastqc /data/SRR606249_subset10_2.fq.gz -o /data/qc/before_trim

Grab the adapter sequences:

cd ~/data
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa

Link the data and run trimmomatic

for filename in *_1*.fq.gz
do
    # first, make the base by removing .fq.gz using the unix program basename
    base=$(basename $filename .fq.gz)
    echo $base

    # now, construct the base2 filename by replacing _1 with _2
    base2=${base/_1/_2}
    echo $base2

    docker run -v ${PWD}:/data -it quay.io/biocontainers/trimmomatic:0.36--4 trimmomatic PE /data/${base}.fq.gz \
                /data/${base2}.fq.gz \
        /data/${base}.trim.fq.gz /data/${base}_se \
        /data/${base2}.trim.fq.gz /data/${base2}_se \
        ILLUMINACLIP:/data/TruSeq2-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
done

Now run fastqc on the trimmed data:

mkdir -p ~/data/qc/after_trim


docker run -v ${PWD}:/data -it biocontainers/fastqc fastqc /data/SRR606249_subset10_1.trim.fq.gz -o /data/qc/after_trim

docker run -v ${PWD}:/data -it biocontainers/fastqc fastqc /data/SRR606249_subset10_2.trim.fq.gz -o /data/qc/after_trim

Interleave paired-end reads using khmer. The output file name includes 'trim2' indicating the reads were trimmed at a quality score of 2. If other values were used change the output name accordingly.

cd ~/data
for filename in *_1.trim.fq.gz
do
    # first, make the base by removing _1.trim.fq.gz with basename
    base=$(basename $filename _1.trim.fq.gz)
    echo $base

    # construct the output filename
    output=${base}.pe.trim2.fq.gz

    docker run -v ${PWD}:/data -it quay.io/biocontainers/khmer:2.1--py35_0 interleave-reads.py \
        /data/${base}_1.trim.fq.gz /data/${base}_2.trim.fq.gz --no-reformat -o /data/$output --gzip

done