Tips

Collecting information on the FASTQ files.

for LIB in $(ls fastq/*.fastq) ; do echo -ne "$(basename $LIB .fastq)\t" ; infoseq --filter --noheading $LIB | wc -l ; done

Keeping one of the FASTQ sequence file or BAM alignment files.

The BAM files (see below) contain the full sequence information, so in theory, the FASTQ files can be discarded. The command ‘samtools bam2fq’ can produce FASTQ files from the BAM files, but beware that some sequences will be duplicated, as they contained more than one V segment, and therefore appear more than once in the BAM file.

Count the number of sequences per library using the alignment files.

for BAM in extraction_files/*.bam ; do echo -ne "$(basename $BAM .bam)\t" ; samtools view $BAM | cut -f1 | sort -u | wc -l ; done

Before CDR3 extraction, count the number of V segments detected

If the sequence file name was A.fastq.

samtools idxstats extraction_files/A.bam | awk '{OFS="\t"} {if ($3 > 0) print $1,$3}'

Count the number of clonotypes (one per line) in each clonotype file.

for LIB in $(ls clonotypes/*.tsv) ; do echo -ne "$(basename $LIB .tsv)\t" ; wc -l $LIB | cut -f1 -d' '; done

Reverse-complement and append mate-pair sequence to the first read

If the first read is RCms10001_R1.fastq and the mate pair is RCms10001_R2.fastq

revseq fastq-sanger::RCms10001_R2.fastq fastq-sanger::stdout | sed -e '1~4s/.*//' -e '3~4s/.*//' | paste -d '' RCms10001_R1.fastq - > RCms10001.fastq

revseq is an EMBOSS command.