Tips
Collecting information on the FASTQ files.
for LIB in $(ls fastq/*.fastq) ; do echo -ne "$(basename $LIB .fastq)\t" ; infoseq --filter --noheading $LIB | wc -l ; done
Keeping one of the FASTQ sequence file or BAM alignment files.
The BAM files (see below) contain the full sequence information, so in theory, the FASTQ files can be discarded. The command ‘samtools bam2fq’ can produce FASTQ files from the BAM files, but beware that some sequences will be duplicated, as they contained more than one V segment, and therefore appear more than once in the BAM file.
Count the number of sequences per library using the alignment files.
for BAM in extraction_files/*.bam ; do echo -ne "$(basename $BAM .bam)\t" ; samtools view $BAM | cut -f1 | sort -u | wc -l ; done
Before CDR3 extraction, count the number of V segments detected
If the sequence file name was A.fastq
.
samtools idxstats extraction_files/A.bam | awk '{OFS="\t"} {if ($3 > 0) print $1,$3}'
Count the number of clonotypes (one per line) in each clonotype file.
for LIB in $(ls clonotypes/*.tsv) ; do echo -ne "$(basename $LIB .tsv)\t" ; wc -l $LIB | cut -f1 -d' '; done
Reverse-complement and append mate-pair sequence to the first read
If the first read is RCms10001_R1.fastq
and the mate pair is RCms10001_R2.fastq
revseq fastq-sanger::RCms10001_R2.fastq fastq-sanger::stdout | sed -e '1~4s/.*//' -e '3~4s/.*//' | paste -d '' RCms10001_R1.fastq - > RCms10001.fastq
revseq
is an EMBOSS command.