Specify paired-end data - User Guide to SeqMan NGen

Depending on the workflow and the read technology selected, the Input Sequences screen may allow you to specify paired reads.

To specify paired reads, check the Paired-end data box. This causes the Pair Distance dialog to pop up. Type in the pair distance and press OK. The default pair distance of 500 bp is suitable for most projects.

Preparing paired-end reads:

Paired end reads are typically in two files with the forward reads in one file and the reverse reads in the other. SeqMan NGen assumes the pair will be from opposite ends of the same DNA fragment, and sequenced from the end of the fragment inwards.

To enable SeqMan NGen to identify pairs, a sequence naming convention must systematically distinguish between different pair reads while specifying which pair reads are associated. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone. Expressions for these naming conventions are created using a subset of regular expressions, which utilize elements of the Grep language. The following rules apply:

Two parallel files must use standard naming convention (e.g. s_7_1_sequence and s_7_2_sequence).

“Forward” and “reverse” reads must be in exactly the same order in the two files.

Both forward and reverse reads must be present for every pair, including pairs where one of the reads failed or is of very low quality.

As an example, forward and reverse Sanger pair files are named as follows: 01f.abi and 01r.abi, where “01” distinguishes that they are members of the same pair. The “f” and “r” at the end of each sequence name distinguishes the orientation.

In Grep, the naming convention would be written as follows:

Forward convention: (.*)f\..*$

Reverse convention: (.*)r\..*$

For more information on Grep name patterns, see Example regular expressions.

SeqMan NGen considers paired-end reads whose fore and reverse reads start at the same position in two reads to be clonal. In these cases, the reads with highest scores are retained, while the other reads are ignored.

Conventions for Sanger pairs:

Paired end Sanger reads are typically all in multiple files with the forward pairs having an “f” or “forward” in the name and the reverse pairs having “r” or “reverse” in the name.

Conventions for Illumina pairs:

Paired end Illumina reads are typically in two files, or a small number of files if they are from multiple runs or lanes. These pairs are specified by a naming convention used in the .fasta file comment line.

For de novo assemblies with paired end reads, SeqMan NGen automatically adds the following information to the script:

setPairSpecifier pairs:
  { {
    forward: “(.*)/1”
    reverse: “(.*)/2”
    min: 0
    max: 750
    key: Illumina
  } }

If reads do not match one of the pair specifiers, or if the forward and reverse specifiers are represented by empty strings (““), the assembler will attempt to match using the whole name of the sequence. If exactly two reads have the same name, they will be considered a match.

For reference-guided assemblies, SeqMan NGen adds the following information:

  {
    is Pair: true
    file: “****”
    SeqTech: “Illumina”
    minDist: 0
    maxDist: 750
  }

For reference-guided assemblies with paired-end reads, SeqMan NGen recognizes the pairs by their file names. The following examples demonstrate some of the filename formats that SeqMan NGen supports for reference-guided pairs. Large-bold text in the examples is used to highlight the region of each filename that specifies the forward and reverse reads:

“R_2011_11_21_11_06_08_user_C29-100_PE_DH10B_11_Auto_C29-100_PE_DH10B_11_4120_reverse_pe2.fastq”,
“R_2011_11_21_11_06_08_user_C29-100_PE_DH10B_11_Auto_C29-100_PE_DH10B_11_4120_forward_pe1.fastq”,

“Strain1234_L7_*R1*_ATCACG_Index1.fastq”,
“Strain1234_L7_*R2*_ATCACG_Index1.fastq”,

“K12-1-B_TGACCA_L006_R1.fastq”,
“K12-1-B_TGACCA_L006_R2.fastq”,

“GBBC920_GGCTAC_L008_R1.filt.50bp.fastq”,
“GBBC920_GGCTAC_L008_R2.filt.50bp.fastq”

“tiny*_1*.txt”,
“tiny*_2*.txt”,

“tiny*_1*_sequence.txt”,
“tiny*_2*_sequence.txt”,

“tiny1._qseq”,
“tiny2._qseq”,

“s_1*_1*_sequence.txt”
“s_1*_2*_sequence.txt”

“C29-129_forward_pe1.fastq”
“C29-129_forward_pe2.fastq”

The Grep used to match the pairFileNames is shown below:

“(?‘name’.*?)_R1_(?‘ext’.*)\\.fastq”,
“(?‘name’.*?)_R2_(?‘ext’.*)\\.fastq”,

“(?‘name’.*?)_R1\\.(?‘ext’.*)\\.fastq”,
“(?‘name’.*?)_R2\\.(?‘ext’.*)\\.fastq”,

“(?‘name’.*?)_forward_pe1(?‘ext_p’\\.fastq)”,
“(?‘name’.*?)_reverse_pe2(?‘ext_p’\\.fastq)”,

“(?‘name’.*?)_{0,1}1\\.fastq”,
“(?‘name’.*?)_{0,1}2\\.fastq”,

“(?‘name’.*?)1\\.fastq”,
“(?‘name’.*?)2\\.fastq”,

“(?‘name’.*?)1_sequence\\.txt”,
“(?‘name’.*?)2_sequence\\.txt”,

“(?‘name’.*?)1\\.txt”,
“(?‘name’.*?)2\\.txt”,

“(?‘name’.*?)1\\._qseq”,
“(?‘name’.*?)2\\._qseq”,

“(?‘name’.*?)1\\.fq”,
“(?‘name’.*?)2\\.fq”,

The following script command can be used to add support for a new filename format. The command must be executed before assembly. The pattern will be used for all subsequent assembleTemplate commands for that run of the reference-guided assembler.

pairFilePattern forward: “(?‘name’.*?)_R1_(?‘ext’.*)\.fastq” reverse: “(?‘name’.*?)_R2_(?‘ext’.*)\.fastq”

Specify read technology

Example regular expressions

Need more help with this?
Contact DNASTAR