SIMPLE SOLUTIONS

ADAPTERREMOVAL(1) - Linux man page online | User commands

Remove adapters from sequences in either single end or paired end experiments.

Chapter
2017-07-17
ADAPTERREMOVAL(1) User Contributed Perl Documentation ADAPTERREMOVAL(1)

NAME

AdapterRemoval - Remove adapters from sequences in either single end or paired end experiments

SYNOPSIS

AdapterRemoval --file1 filenames [--file2 filenames] [--interleaved] [--interleaved-input] [--interleaved-output] [--combined-output] [--basename filename] [--identify-adapters] [--trimns] [--maxns max] [--trimqualities] [--trimwindows length] [--minquality minimum] [--collapse] [--version] [--mm mismatchrate] [--minlength len] [--minalignmentlength len] [--qualitybase base] [--qualitybase-output base] [--shift num] [--adapter1 sequence] [--adapter2 sequence] [--adapter-list filename] [--barcode-list filename] [--barcode-mm num] [--barcode-mm-r1 num] [--barcode-mm-r2 num] [--demultiplex-only] [--output1 filename] [--output2 filename] [--singleton filename] [--outputcollapsed filename] [--outputcollapsedtruncated filename] [--discarded filename] [--settings filename] [--seed seed] [--gzip] [--gzip-level level] [--threads num] [--version] [--help]

DESCRIPTION

AdapterRemoval reads either one FASTQ file (single ended mode) or two FASTQ files (paired ended mode). It removes the residual adapter sequence from the reads and optionally trims Ns from the reads, and low qualities bases using the quality string, and collapses overlapping paired ended mates into one read. Reads are discarded if the remaining genomic part is too short, or if the read contains more than an (user specified) amount of amigious nucleotides ('N'). These operations may be combined with simultaneous demultiplexing. Alternatively, AdapterRemoval may attempt to reconstruct a consensus adapter sequences from paired-ended data, in order to allow the identification of the adapter sequences originally used, and thereby ensure proper trimming of these reads. The reads and adapters are transformed to upper case for comparison. It is assumed that the letter 'N' is used for an unknown nucleotide, but in case the program encounters a '.' in the sequence, they will be treated as (and translated into) Ns. The program tries to check for invalid input and / or nonsensical combinations of parameters but please report strange behaviour, bugs and such to @gmail.com If you use this program, please cite the paper: Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 12;9(1):88 http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2

OPTIONS

--file1 filename [...] Read FASTQ reads from one or more files. This contains either the single ended (SE) reads or, if paired ended, the mate 1 reads. If running in paired end mode, both --file1 and --file2 must be set. The files may optionally be gzip or bzip2 compressed. --file2 filename [...] Read one or more FASTQ files containing mate 2 reads for a paired end run. If specified, --file1 must also be set. The files may optionally be gzip or bzip2 compressed. --interleaved Enables --interleaved-input and --interleaved-output. --interleaved-input If set, input is expected to be a single FASTQ file specified using --file1, in which pairs of paired-end reads are listed one after each other (read1/1, read1/2, read2/1, read2/2, etc.). --interleaved-ouput If set, and AdapterRemoval is processing paired-end reads, retained pairs of reads are written to a single FASTQ file, one pair after each other (read1/1, read1/2, read2/1, read2/2, etc.). By default, this file is named basename.paired.truncated, but this may be changed using the --output1 option. --combined-output If set, all reads are written to the same file(s), specified by --output1 and --output2. Each read is further marked by either a "PASSED" or a "FAILED" flag, and any read that has been FAILED (including the mate for collapsed reads) are replaced with a single 'N' with Phred score 0. This option can be combined with --interleaved / --interleaved-output to write all reads to a single output file specified with --output1. --basename filename Determines the default filename for output files, unless overridden using the specific output file settings. For single-ended mode, the following filenames are used: basename.truncated, basename.discarded, and basename.settings. In paired end mode, the following filenames are used: basename.pair1.truncated, basename.pair2.truncated, basename.singleton.truncated, basename.discarded, and basename.settings. If collapsing of reads is enabled for paired ended mode, the following filenames are also used: basename.collapsed, and basename.collapsed.truncated. The default basename is your_output. If gzip compression is enabled, the extension ".gz" is added to all files but the filename.settings file, while the extension ".bz2" is used if bzip2 compression is enabled. --identify-adapters For paired ended reads only. In this mode, AdapterRemoval will attempt to reconstruct the adapter sequences used for a set of paired ended reads, by locating fully overlapping read-pairs, and generating a consensus sequence from the bases identified as adapter sequence. The minimum overlap is controlled by minalignmentlength. The values passed to the --adapter1 and --adapter2 command- line options are used for visual comparison with the consensus sequence, but otherwise not used in the consensus building. --trimns Remove stretches of Ns from the output reads in both the 5' and 3' end. If quality trimming is also enabled, stretches of mixed low-quality bases and/or Ns are trimmed. --maxns max If a read has more than max Ns after trimming, it is discarded (default is not to use). --trimqualities Remove consecutive stretches of low quality bases (threshold set by minquality) from both the 5' and 3' end of the reads. All bases with minquality or lower are trimmed. If trimming of Ns is also enabled, stretches of mixed low-quality bases and/or Ns are trimmed. --trimwindows length Remove low quality bases using a sliding window bases approach inspired by sickle: 1. The new 5' is determined by locating the first window where both the average quality and the quality of the first base in the window is greater than minquality. 2. The new 3' is located by sliding the first window right, until the average quality becomes less than or equal to minquality. The new 3' is placed at the last base in that window where the quality is greater than or equal to minquality. 3. If no 5' position could be determined, the read is discarded. The value of length may be a number greater than or equal to 1, in which case that number (rounded down to the nearest whole number) is used as the window length, or it may be a value greater than or equal to zero. In the latter case, that number is multipled by the lenght of each read, to determine the window length. For example, a trimwindow value of 0.1 and a read length of 100 would result in 10 bp windows. If the resulting window length is zero or is greater than the current read length, then the read length is used instead. --minquality minimum Set the threshold for trimming low quality bases. Default is 2. The minimum can be set with or without the Phred quality base. --collapse In paired-end mode, if the two mates overlap, collapse the two reads into one read by merging the two and recalculating the quality scores. In single-end mode, this instead attempts to identify templates for which the entire sequence is available. In both cases, complete "collapsed" reads are written with a 'M_' name prefix, and "collapsed" reads which are trimmed due to quality settings are written with a 'MT_' name prefix. The overlap needs to be at least minalignmentlength nucleotides, with a maximum number of mismatches determined by mm. --mm mismatchrate The allowed fraction of mismatches allowed in the aligned region. If 0 < mismatchrate < 1, the rate is used directly. If mismatchrate > 1, the rate is set to 1/mismatchrate. The default setting is 3, corresponding to a maximum mismatch rate of 1/3. --minlength len The minimum length required after trimming and adapter removal. Reads shorter than len are discarded. Default is 15 nucleotides. --minalignmentlength len The minimum overlap between mate 1 and mate 2 before the reads are collapsed into one, when collapsing paired end reads, or when attempting to identify complete template sequences in single-end mode. Default is 11 nucleotides. --qualitybase base The base of the quality score - either '64' for Phred+Phred (i.e., Illumina 1.3+ and 1.5+) or '33' for Phred+33 (Illumina 1.8+). In addition, the value 'solexa' may be used to specify reads with Solexa encoded scores. Default is 33. --qualitybase-output base The base of the quality score for reads written by AdapterRemoval - either '64' for Phred+Phred (i.e., Illumina 1.3+ and 1.5+) or '33' for Phred+33 (Illumina 1.8+). In addition, the value 'solexa' may be used to specify reads with Solexa encoded scores. However, note that quality scores are represented using PHRED scores internally, and conversion to and from Solexa scores therefore result in a loss of information. The default corresponds to the value given for --qualitybase. --shift num To allow for missing bases in the 5' end of the read, the program can let the alignment slip num bases in the 5' end. This corresponds to starting the alignment maximum num nucleotides in read2 (for paired end) or the adapter (for single end). The default shift valule is 2. --adapter1 sequence --adapter2 sequence Specify the adapter sequences that you wish to trim. The Adapter #2 sequence is only used when trimming paired-ended data. The Adapter #1 and Adapter #2 sequences are expected to be found in the mate 1 and the mate 2 reads respectively, while ignoring any difference in case and treating Ns as wildcards. The default sequences are Adapter #1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG Adapter #2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT Assuming these were the adapters used to generate our data, we should therefore see these in the FASTQ files: $ grep -i "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC......ATCTCGTATGCCGTCTTCTGCTTG" file1 ↲ .fq B<AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTG>AAAAAAAAACAAGAA ↲ T CTGGAGTTCB<AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTG>AAAAAA ↲ A GGB<AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTG>CAAATTGAAAACA ↲ C ... $ grep -i "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" file2.fq CB<AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT>CAAAAAAAGAAAAACATCTT ↲ G GAACTCCAGB<AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT>CAAAAAAAATAG ↲ A GAACTB<AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT>CAAAAACATAAGACCT ↲ A ... Note that --adapter1 and --adapter2 replaces the --pcr[12] options of AdapterRemoval v1.x, for which the --pcr2 sequence was expected to be reverse complemented compared --adaper2. Using the --pcr[12] options is not recommended! --adapter-list filename Read one or more PCR sequences from a table. The first two columns (separated by whitespace) of each line in the file are expected to correspond to values passed to --adapter1 and --adapter2. In single ended mode, only column one is required. Lines starting with '#' are ignored. When multiple PCR sequences or sequence pairs are specified, AdapterRemoval will try each adapter (pair) listed in the table, and select the best aligning adapters for each read processed. --barcode-list filename Read a table of one or two fixed-length barcodes and perform demultiplexing of single or double indexed reads. The table is expected to contain 2 or 3 columns, the first of which represent the name of a given sample, and the second and third of which represent the mate 1 and (optionally) the mate 2 barcode sequence: $ cat barcodes.txt sample_1 ATGCGGA TGAATCT sample_2 ATGGATT ATAGTGA sample_7 CAAAACT TCGCTGC Results are written to ${basename}.${sample_name}.*, using the default names for other output files. A setting file with statistics is written for each sample at ${basename}.${sample_name}.settings, as is a setting file containing the demultiplexing statistics, at ${basename}.settings. When demultiplexing is used, the barcode identified for a given read is automatically added to the adapter sequence, in order to ensure that overlapping reads are correctly trimmed. The .settings file represents this by showing the reverse complemented) barcode sequence added to the --adapter1 and --adapter2 sequences, followed by an underscore (shown here for barcodes pair ATGCGGA / TGAATCT): [Adapter sequences] Adapter1[0]: AGATTCA_AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCT ↲ GCTTG Adapter2[0]: TCCGCAT_AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT Note that the sequence added to each adapter is the reverse complement of the barcode sequence of the other mate, as this sequence is expected to be found immediately before the adapter sequence. --barcode-mm num The maximum number of mismatches allowed for barcodes, when counting mismatches in both the mate 1 and mate 2 barcodes. In conjunction with the --barcode-mm-r1 and --barcode-mm-r2, this allows fine-grained control over the barcode comparisons. If not set, this value is set to the sum of --barcode-mm-r1 and --barcode-mm-r2. For example, to allow one mismatch in either the mate 1 or the mate 2 barcode, one might specify --barcode-mm 1; to allow a mismatch in the mate 1 and / or the mate 2 barcode, one might specify --barcode-mm 2 --barcode-mm-r1 1 --barcode-mm-r2 1, and so on. --barcode-mm-r1 num The maximum number of mismatches allowed in the mate 1 barcode; if not set, this number is equal to the value of --barcode-mm. This number cannot exceed the value specified for --barcode-mm. --barcode-mm-r2 num The maximum number of mismatches allowed in the mate 1 barcode; if not set, this number is equal to the value of --barcode-mm. This number cannot exceed the value specified for --barcode-mm. --demultiplex-only num Only carry out demultiplexing, using the list of barcodes supplied using --barcode-list. Note that trimming and filtering options do not apply to this mode of operation. --output1 file --output2 file --singleton file --outputcollapsed file --outputcollapsedtruncated file --discarded file --settings file Instead of using the default behaviour where the program automatically generates the files needed, you can specify where each type of output is directed. This can be files, pipes etc. thus making it possible to easily zip the output on the fly. Default files are still generated if nothing else is specified. The types of output in single end mode are: output1 contains the trimmed reads. The types of output in paired end mode are: output1 contains trimmed mate1 reads. output2 contains trimmed mate2 reads. singleton contains all reads where the other mate in a pair is discarded. outputcollapsed Contains pairs that overlap and are collapsed into a single read (if --collapse is used). The reads are renamed with an @M_ prefix. outputcollapsedtruncated Contains pairs that overlap and are collapsed into a single read (if --collapse is used) and have further been trimmed due to Ns and/or low quality nucleotides in the 5' or 3' end. The reads are renamed with an @MT_ prefix. The types of output in both single end and paired end mode are: discarded contains all reads that are discarded by the program. settings contains information on the parameters used in the run as well as overall statistics on the reads after trimming such as average length. --seed seed When collaping reads at positions where the two reads differ, and the quality of the bases are identical, AdapterRemoval will select a random base. This option specifies the seed used for the random number generator used by AdapterRemoval. This value is also written to the settings file. Note that setting the seed is not reliable in multithreaded mode, since the order of operations is non- deterministic. --gzip If set, all FASTQ files written by AdapterRemoval will be gzip compressed using the compression level specified using --gzip-level. The extension ".gz" is added to files for which no filename was given on the commandline. --gzip-level Determines the compression level used when gzip'ing FASTQ files. Must be a value in the range 0 to 9, with 0 disabling compression and 9 being the best compression. Defaults to 6. --bzip2 If set, all FASTQ files written by AdapterRemoval will be bzip2 compressed using the compression level specified using --bzip2-level. The extension ".bz2" is added to files for which no filename was given on the commandline. --bzip2-level Determines the compression level used when bzip2'ing FASTQ files. Must be a value in the range 1 to 9, with 9 being the best compression. Defaults to 9. --threads Maximum number of threads to use for current run; note that file IO is single- threaded, regardless of the number of threads specified. --version Output the version of the program. --help Output the summary of available command-line options, including default values and/or values specified on the command-line. EXAMPLE: Single end experiment The following command removes adapters from the file reads_1.fq trims both Ns and low quality bases from the reads, and gzip compresses the resulting files. The --basename option is used to specify the prefix for output files. $ AdapterRemoval --file1 reads_1.fq --basename output_single --trimns --trimqualities --g ↲ zip Since --gzip and --basename is specified, the trimmed FASTQ reads are written to output_single.truncated.gz, the dicarded FASTQ reads are written to output_single.discarded.gz, and settings and summary statistics are written to output_single.settings. Note that by default, AdapterRemoval does not require a minimum number of bases overlapping with the adapter sequence, before reads are trimmed. This may result in an excess of very short (1 - 3 bp) 3' fragments being falsely identified as adapter sequences, and trimmed. This behavior may be changed using the --minadapteroverlap option, which allows the specification of a minimum number of bases (excluding Ns) that must be aligned to carry trimming. For example, use --minadapteroverlap 3 to require an overlap of at least 3 bp. EXAMPLE: Paired end experiment. The following command removes adapters from a paired-end reads, where the mate 1 and mate 2 reads are kept in files reads_1.fq and reads_2.fq, respectively. The reads are trimmed for both Ns and low quality bases, and overlapping reads (at least 11 nucleotides, per default) are merged (collapsed): $ AdapterRemoval --file1 reads_1.fq --file2 reads_2.fq --basename output_paired --trimns ↲ --trimqualities --collapse This command generates the files output_paired.pair1.truncated and output_paired.pair2.truncated, which contain trimmed pairs of reads which were not collapsed, output_paired.singleton.truncated containing reads where one mate was discarded, output_paired.collapsed containing merged reads, and output_paired.collapsed.truncated containing merged reads that have been trimmed due to the --trimns or --trimqualities options. Finally, the output_paired.discarded and output_paired.settings files correspond to those of the single-end run. EXAMPLE: Interleaved FASTQ reads. AdapterRemoval is able to read and write paired-end reads stored in a single, so-called interleaved FASTQ file (one pair at a time, first mate 1, then mate 2). This is accomplished by specifying the location of the file using --file1 and *also* setting the --interleaved command-line option: $ AdapterRemoval --interleaved --file1 interleaved.fq --basename output_interleaved Other than taking just a single input file, this mode operates almost exactly like paired end trimming (as described above); the mode differs only in that paired reads are not written to a 'pair1' and a 'pair2' file, but instead these are instead written to a single, interleaved file, named 'paired'. The location of this file is controlled using the --output1 option. Enabling either reading or writing of interleaved FASTQ files, both not both, can be accomplished by specifying the either of the --interleaved-input and --interleaved-output options, both of which are enabled by the --interleaved option. EXAMPLE: Different quality score encodings. By default, AdapterRemoval expects the quality scores in FASTQ reads to be Phred+33 encoded, meaning that the error probabilities are encoded as (char)('!' - 10 * log10(p)). Most data will be encoded using Phred+33, but Phred+64 and 'Solexa' encoded quality scores are also supported. These are selected by specifying the --qualitybase command-line option (specifying either '33', '64', or 'solexa'):: $ AdapterRemoval --qualitybase 64 --file1 reads_q64.fq --basename phred_64_encoded By default, reads are written using the *same* encoding as the input. If a different encoding is desired, this may be accomplished using the --qualitybase-output option: $ AdapterRemoval --qualitybase 64 --qualitybase-output 33 --file1 reads_q64.fq --basename ↲ phred_33_encoded Note furthermore that AdapterRemoval by default only expects quality scores in the range 0 - 41 (or -5 to 41 in the case of Solexa encoded scores). If input data using a different maximum quality score is to be processed, or if the desired maximum quality score of collapsed reads is greater than 41, then this limit may be increased using the --qualitymax option: $ AdapterRemoval --qualitymax 50 --file1 reads_1.fq --file2 reads_2.fq --collapsed --base ↲ name collapsed_q50 For a detailed overview of Phred encoding schemes currently and previously in use, see e.g. the Wikipedia article on the subject: https://en.wikipedia.org/wiki/FASTQ_format#Encoding EXAMPLE: Paired end reads containing multiple, distinct adapter pairs. It is possible to trim data that contains multiple adapter pairs, by providing a one or two-column table containing possible adapter combinations (for single-end and paired-end trimming, respectively; see e.g. examples/adapters.txt): $ cat adapters.txt AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACCTAATCTCGTATGCCGTCTTCTGCTTG AGATCGGAAGAGCGTCGTGTA ↲ GGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT AAACTTGCTCTGTGCCCGCTCCGTATGTCACAACAGTGCGTGTATCACCTCAATGCAGGACTCA GATCGGGAGTAATTTGGAGGC ↲ AGTAGTTCGTCGAAACTCGGAGCGTCTTTAGCAGGAG CTAATTTGCCGTAGCGACGTACTTCAGCCTCCAGGAATTGGACCCTTACGCACACGCATTCATG TACCGTGAAAGGTGCGCTTAG ↲ TGGCATATGCGTTAAGAGCTAGGTAACGGTCTGGAGG GTTCATACGACGACGACCAATGGCACACTTATCCGGTACTTGCGTTTCAATGCGCATGCCCCAT TAAGAAACTCGGAGTTTGGCC ↲ TGCGAGGTAGCTTGGGTGTTATGAAGAACGGCATGCG CCATGCCCCGAAGATTCCTATACCCTTAAGGTCGCAATTGTTCGAGTAAGCTGTACGCGCCCAT GTTGCATTGACCCGAAGGGCT ↲ CGATGTTTAGGGAGGTCAGAAGTTGAGCGGGTTCAAA This table is then specified using the --adapter-list option: $ AdapterRemoval --file1 reads_1.fq --file2 reads_2.fq --basename output_multi --trimns - ↲ -trimqualities --collapse --adapter-list adapters.txt The resulting .summary file contains an overview of how frequently each adapter (pair) was used. Note that in the case of paired-end adapters, AdapterRemoval considers only the combinations of adapters specified in the table, one combination per row. For single-end trimming, only the first column of the table file is required, and the list may therefore take the form of a file containing one sequence per line. EXAMPLE: Identifying adapter sequences from paired-ended reads If we did not know the adapter sequences for paired-end reads, AdapterRemoval may be used to generate a consensus adapter sequence based on fragments identified as belonging to the adapters through pairwise alignments of the reads, provided that the data set contains only a single adpater sequence (not counting differences in index sequences). In the following example, the identified adapters corresponds to the default adapter sequences with a poly-A tail resulting from sequencing past the end of the insert + templates. It is not necessary to specify this tail when using the --adapter1 or --adapter2 command-line options. The characters shown under each of the consensus sequences represented the phred-encoded fraction of bases identical to the consensus base, with adapter 1 containing the index CACCTA: $ AdapterRemoval --identify-adapters --file1 reads_1.fq --file2 reads_2.fq Attemping to identify adapter sequences ... Processed a total of 1,000 reads in 0.0s; 129,000 reads per second on average ... Found 394 overlapping pairs ... Of which 119 contained adapter sequence(s) ... Printing adapter sequences, including poly-A tails: --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||******|||||||||||||||||||||||| Consensus: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACCTAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA ↲ AAAAAAAAAAAAAA Quality: 55200522544444/4411330333330222222/1.1.1.1111100-00000///..+....--*-)),,++ ↲ +++++**(('%%%$ Top 5 most common 9-bp 5'-kmers: 1: AGATCGGAA = 96.00% (96) 2: AGATGGGAA = 1.00% (1) 3: AGCTCGGAA = 1.00% (1) 4: AGAGCGAAA = 1.00% (1) 5: AGATCGGGA = 1.00% (1) --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Consensus: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAA ↲ AAAAAAAAAAAAAA Quality: 525555555144141441430333303.2/22-2/-1..11111110--00000///..+....--*-),,,++ ↲ +++++**(%'%%%$ Top 5 most common 9-bp 5'-kmers: 1: AGATCGGAA = 100.00% (100) No files are generated from running the adapter identification step. The consensus sequences inferred are compared to those specified using the --adapter1 and --adapter2 command-line options, or with the default values for these if no values have been given (as in this case). Pipes (|) indicate matches between the provided sequences and the consensus sequence, and "*" indicate the presence of unspecified bases (Ns). EXAMPLE: Demultiplexing of paired end reads As of version 2.1, AdapterRemoval supports simultanious demultiplexing and adapter trimming; demultiplexing is carried out using a simple comparison between the specified barcode sequences and the first N bases of the reads, corresponding to the length of the barcodes. Reads identified as containing a specific barcode or pair of barcodes are then trimmed using adapter sequences including these barcodes. Demultiplexing is enabled by creating a table of barcodes, the first column of which species the sample name (using characters [a-zA-Z0-9_]) and the second and (optional) third columns specifies the mate 1 and mate 2 barcode sequences. For example, a table of barcodes from a double-indexed run might be as follows (see examples/barcodes.txt): $ cat barcodes.txt sample_1 ATGCGGA TGAATCT sample_2 ATGGATT ATAGTGA sample_7 CAAAACT TCGCTGC In the case of single-read reads, only the first two columns are required. AdapterRemoval is invoked with the --barcode-list option, specifying the path to this table: $ AdapterRemoval --file1 demux_1.fq --file2 demux_2.fq --basename output_dumux --barcode- ↲ list barcodes.txt This generates a set of output files for each sample specified in the barcode table, using the basename (--basename) as the prefix, followed by a dot and the sample name, followed by a dot and the default name for a given file type. For example, the output files for sample_2 would be output_demux.sample_2.discarded output_demux.sample_2.pair1.truncated output_demux.sample_2.pair2.truncated output_demux.sample_2.settings output_demux.sample_2.singleton.truncated The settings files generated for each sample summarizes the reads for that sample only; in addition, a basename.settings file is generated which summarizes the number and proportion of reads identified as belonging to each sample. The maximum number of mismatches allowed when comparing barocdes is controlled using the options --barcode-mmI, --barcode-mm-r1, and --barcode-mm-r2, which specify the maximum number of mismatches total, and the maximum number of mismatches for the mate 1 and mate 2 barcodes respectively. Thus, if mm_1(i) and mm_2(i) represents the number of mismatches observed for barcode-pair i for a given pair of reads, these options require that 1. mm_1(i) <= --barcode-mm-r1 2. mm_2(i) <= --barcode-mm-r2 3. mm_1(i) + mm_2(i) <= --barcode-mm As of version 2.2, AdapterRemoval can furthermore be used to demultiplex reads without carrying out other forms of read trimming. This is accomplished by specifying the --demultiplex-only option: $ AdapterRemoval --file1 demux_1.fq --file2 demux_2.fq --basename output_only_demux --bar ↲ code-list barcodes.txt --demultiplex-only Trimming and filtering related options to not apply to this mode ("TRIMMING SETTINGS" when viewing 'AdapterRemoval --help'), but compression (--gzip, --bzip2), multi-threading (--threads), interleaving (--interleaved, etc.) and other such options may be used in conjunction with --demultiplex-only.

EXIT STATUS

0 if everything worked as planned, a non-zero value otherwise.

REPORTING BUGS

Report bugs to Mikkel Schubert <@gmail.com>. Your bugreport should always include: · The output of AdapterRemoval --version. If you are not running the latest released version you should specify why you believe the problem is not fixed in that version. · A complete example that others can run that shows the problem.

AUTHOR

Copyright (C) 2011 Stinus Lindgreen <@binf.ku.dk>. Parts of the manual was written by Ole Tange <@binf.ku.dk>. Parts of the manual was written by Mikkel Schubert <@gmail.com>.

LICENSE

Copyright (C) 2011 Stinus Lindgreen <@binf.ku.dk>. Copyright (C) 2014 Mikkel Schubert <@gmail.com>. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or at your option any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

SEE ALSO

perl v5.24.1 2017-07-17 ADAPTERREMOVAL(1)
Download raw manual
Main page User Contributed Perl Documentation (+23303) perl v5.24.1 (+3427) № 1 (+39907)
Go top