Example analysis using the GSE35626 public dataset

The GSE35626 dataset is a TCRα repertoire analysis of CD8+ cells, in duplicates.

Preparation

This example was ran in a Ubuntu Precise LTS virtual machine in the Amazon Elastic Compute Cloud.

Software installation

Since this is run in a virtual machine that can be erased after use, some steps of the installation are not recommended on local systems. See the README for more precise installation instructions.

Update the package repostitory to Quantal.

sudo sed -i 's/precise/quantal/g' /etc/apt/sources.list
sudo apt-get update

Install the programs needed by clonotypeR and for other analysis steps.

sudo apt-get install bioperl bwa emboss r-base

Install git, download clonotypeR, build the R package and install it. Note that on this cloud instance, the temporary storage area for large files is in /mnt/.

sudo apt-get install git
sudo install -d /mnt/clonotyper -o ubuntu -g ubuntu
ln -s /mnt/clonotyper .
git clone git://clonotyper.branchable.com/ clonotyper/
R CMD build clonotyper/
sudo R CMD INSTALL clonotypeR_0.1.tar.gz

Install the package for the NCBI SRA toolkit, to convert the sequences downloaded from the Gene Expression Omnibus.

sudo apt-get install sra-toolkit

Data download

The files are larger than gigabytes, so the download and conversion to FASTQ format will take roughly one hour.

mkdir clonotyper/GSE35626
cd clonotyper/GSE35626
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP010%2FSRP010815/SRR407172/SRR407172.sra
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP010%2FSRP010815/SRR407173/SRR407173.sra
fastq-dump SRR407172.sra 
fastq-dump SRR407173.sra

Data analysis

In shell

../scripts/clonotypeR detect SRR407172.fastq
../scripts/clonotypeR detect SRR407173.fastq
../scripts/clonotypeR extract SRR407172
../scripts/clonotypeR extract SRR407173

cat clonotypes/*tsv > clonotypes.tsv

R

This prepares a large (3.2 Gb) ?table of clonotypes, and starts R.

In R

library(clonotypeR)
clonotypes <- read_clonotypes('clonotypes.tsv')
a <- clonotype_table(from=clonotypes, feat=c("V","J"))
colSums(a > 0)
# SRR407172 SRR407173 
#     3653      3677 

3,653 V–J pairs were detected in SRR407172.