Example analysis
This example analysis assumes a Unix system (Linux, Mac OS, …)
Example data
The data provided on-line in example_data is a sub-sample of three sequence librairies (2,000 reads each) made on the 454 Titanium or the 454 junior platforms. The original libraries will be deposited in public databanks after publication in a peer-reviewed journal.
These example libraries are called A
, B
and C
, and are in FASTQ format,
with entries like the following (the sequence was truncated for the convenience
of the display).
@HKTLYLP01B0MTM
gactGTCCATCTTCCTTTTATCGGACACTGAAGTATGGATATCAGAAGTGCAgggccttcccacgggaacg
+
IIIIIIIIIIIHHFF::::G>IIIGGGIIIIIIIIIGGIIIIIIFEBDCDC<//-5522------
Detection of V segments
Run the command clonotyper detect A.fastq
in the same directory as a copy of
the file A.fastq
.
The result is stored in a temporary directory called extraction_files
, that
will be created if it does not already exist.
clonotyper detect
compares the sequences to the reference V segments using
BWA, and produces output like the following.
[bsw2_aln] read 2000 sequences/pairs (843395 bp)...
[samopen] SAM header is present: 167 sequences.
[main] Version: 0.6.2-r126
[main] CMD: bwa bwasw -t8 /usr/share/clonotypeR/references/V-C/index A.fastq
[main] Real time: 1.099 sec; CPU: 8.225 sec
This indicates that 2,000 reads have been processed, representing 843,395 base
pairs in total. There were 167 reference V segments, and the version number of
BWA was 0.6.2-r126
. The whole process took less than 10 seconds.
Process the example libraries B
and C
similarly with the commands
clonotyper detect B.fastq
and clonotyper detect C.fastq
.
Extraction of CDR3 regions
Run the command clonotyper extract A
in the same directory as where you ran
clonotyper detect A.fastq
. The result is a table stored in a directory
called clonotypes
, that will be created if it does not already exist.
The output is quite voluminous, and indicates which V / J combinations are being found, like on the following.
TRAV14-3 233
TRAJ61 0
TRAJ60 0
TRAJ59 0
TRAJ58 1
TRAJ57 39
TRAJ56 2
TRAJ55 0
The format of the table is explained in the manual
page of the function read_clonotypes()
of the
R package.
For each library (A
, B
and C
), one file is available in the clonotypes
directory. With BWA 0.6.2-r126
, the following numbers of clonotypes are found.
1072 clonotypes/A.tsv
924 clonotypes/B.tsv
689 clonotypes/C.tsv
The files need to be concatenated before analysis in R
, with the following command.
find clonotypes/ -name '*tsv' | xargs cat > clonotypes.tsv
Data analysis in R
Start the R
statistical analysis software with the command “R
”. The commands entered after this are executed by R
.
Load the clonotypeR library: library(clonotypeR)
Load the data in a R object called clonotypes: clonotypes <- read_clonotypes('clonotypes.tsv')
The command summary(clonotypes)
already provides useful information.
> summary(clonotypes)
lib V J read
A:1072 TRAV14-1 :944 TRAJ31 : 380 Length:2684
B: 924 TRAV14-2 :237 TRAJ23 : 270 Class :character
C: 688 TRAV14-3 :251 TRAJ22 : 257 Mode :character
TRAV14D-3/DV8 :242 TRAJ37 : 156
TRAV14N-1_14D-1:604 TRAJ34 : 141
TRAV14N-2_14D-2:235 TRAJ40 : 104
TRAV14N-3 :171 (Other):1376
dna qual pep unproductive
Length:2684 Length:2684 Length:2684 Mode :logical
Class :character Class :character Class :character FALSE:2130
Mode :character Mode :character Mode :character TRUE :554
NA's :0
Identify unique clonotypes, count their sequences in the libraries A
, B
and C
, and store the result as a table arbitrarly named abc
.
> abc <- clonotype_table(c('A','B','C'))
> head(abc)
A B C
TRAV14-1 AAASSGSWQLI TRAJ22 1 0 0
TRAV14-1 AACNNRIF TRAJ31 1 0 0
TRAV14-1 AAGAKLT TRAJ39 3 0 0
TRAV14-1 AAGGSWQLI TRAJ22 1 0 0
TRAV14-1 AAGTNTGKLT TRAJ27 1 0 0
TRAV14-1 AAHDTNAYKVI TRAJ30 1 0 0
> summary(abc)
A B C
Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean : 0.7599 Mean : 0.6606 Mean : 0.5018
3rd Qu.: 1.0000 3rd Qu.: 1.0000 3rd Qu.: 0.0000
Max. :18.0000 Max. :22.0000 Max. :121.0000
The summary shows that the most frequent clonotype is in C
. Using R
index vectors, we can see that its CDR3 sequence is AASDSNNRIF and that it was not found in the other libraries.
> abc[C == 121,]
A B C
TRAV14N-1_14D-1 AASDSNNRIF TRAJ31 0 0 121
The clonotype_table
function can also produce a count table for and combination of V, CDR3 or J segments.
> clonotype_table(c('A','B','C'), "V")
A B C
TRAV14-1 239 493 0
TRAV14-2 131 61 0
TRAV14-3 79 9 113
TRAV14D-3/DV8 140 50 4
TRAV14N-1_14D-1 78 24 388
TRAV14N-2_14D-2 81 61 49
TRAV14N-3 94 34 2
> head(clonotype_table(c('A','B','C'), c("V","J")))
A B C
TRAV14-1 TRAJ11 1 1 0
TRAV14-1 TRAJ12 2 2 0
TRAV14-1 TRAJ13 2 1 0
TRAV14-1 TRAJ15 10 11 0
TRAV14-1 TRAJ16 4 3 0
TRAV14-1 TRAJ18 1 21 0