Assembly Options (All Others)

The Assembly Options dialog allows you to specify the parameters to use for your assembly. If you are following the transcript annotation workflow, or any workflow other than de novo special reference-guided, templated miRNA or ChIP-Seq or Combined Analysis of Workflows, you will see the following version of the dialog. Only a subset of the options described below will be available, depending on the workflow.

Make any desired changes within the Assembly Options section:

• Mer size – The minimum length of a mer (overlapping region of a fragment read), in bases, required to be considered a match when arranging reads into contigs. Mer size information is used to identify matches during the assembly layout phase. The default mer size is determined by the selected read technology and is shown in the window. For more information, see the Mer Tags section.

o Automatic – Select this button to automatically set the size based on assembly type and sequencing technology.

o Custom – Select this button to choose the size yourself. You must enter the desired number of base pairs in the field at right. Lowering the mer size increases the sensitivity of finding matches, but also increases the likelihood of finding spurious matches in addition to the correct match. Lowering the mer size can also greatly increase the requirements for storing intermediate and temporary files with large projects.

• Limit the number of reads – Check the box and enter a value if you wish to limit the read depth. Utilizing this option can make the assembly proceed faster. This option appears in this dialog only for the transcript annotation workflow. For other workflows, the option is called maximum total reads and is located in the Alignment tab of the Advanced Assembly Options screen.

• Minimum match percentage – Specifies the minimum percentage of matches in an overlap that are required to join two sequences in the same contig. (For more information, see the Match Percentage section.)

o Automatic – Select this button to automatically set the percentage based on assembly type and sequencing technology.

o Custom – Select this button to designate the percentage yourself. You must enter a number in the field at right.

• Layout stringency – Specifies two key settings for placing a read in the layout. When building an assembly, SeqMan NGen uses a three stage strategy: overlap, layout, and alignment. In the overlap stage of a templated assembly, for example, each read and the template are broken up into an overlapping set of substrings or “mers” of a specified length (“mer length” or “mer size”). Identical mer matches are an indication that the read matches the template at that position. The more overlapping mers between two sequences, the stronger the indication that the match is real. The layout stage uses that overlap information and attempts to place each read in its true position on the template. The final layout of all the reads is then sent to an aligner that produces the final fully gapped alignment. Not all the reads in a layout will necessarily be in the final alignment, since reads can be rejected by the aligner as the stringency of its parameters are increased (e.g., by increasing the Minimum match percentage, above). Layout stringency settings can be used to adjust the extent of overlap data required to include a putative match in the final layout. Choose one of these three options:

o Maximum - Lower false discovery rate (FDR) for SNPs.

o Minimum - Higher true positive rate (TPR) for SNPs.

o Other – If you choose this option, there are two additional settings available:

§ Minimum layout length – The minimum number of identical matching bases (from the mer analysis only) for a read to be included in the layout. It is specified by an integer, with a default of 50 nucleotides. For reads shorter than 100 bases, the setting is automatically adjusted to the mer size

§ Layout Align – In cases where a read has an identical, or nearly identical, overlap score to more than one location on the template, indicative of a repeated sequence, the read can be evaluated by attempting a fully gapped alignment to each potential mapping position and selecting the position with the best score. In case of ties, the read is placed in one of the locations at random. The default is for this box to be unchecked.

• Adapter scan – Adapter sequences are added to the ends of fragments during sequencing library preparations, and can interfere with downstream processing, if not removed. Check this box to add either one single- or multiple-sequence .fasta file or one folder of .fasta files containing known or suspected adapter sequences. The file(s) must be in .fasta format. During assembly, sequencing reads will be scanned for the presence of each of the specified adapters and when detected, trimmed off of that read. The trimmed read will then be used in any downstream processing. There is no specific header formatting. There is a minimum exact match length of 11 bases, and a minimum overall match of 15 bases, that allows for some mismatching. Both ends are searched within a specified range (default = 130), and all bases from an identified match to that end of the read are trimmed off.

• SNP filter stringency –The three radio buttons specify stringency levels for “soft” filtering of SNPs. Soft filtering means that SNPs of the least interest to you will be automatically hidden when SNP reports/tables are viewed in SeqMan Pro or ArrayStar. Your selection in this screen controls the three assembly parameters shown in the table below. For more information on PnotRef, see Filter Based on P not Ref.

Stringency	Min SNP %)	PnotRef (%)	Depth
High	15	99.9	20
Low	15	90	20
Medium	15	99	20

If a BED, manifest and/or VCF file was specified during project setup and a SNP table is opened in SeqMan Pro or ArrayStar, then only the variants in the targeted regions and at the positions specified in the VCF within those targeted regions will be shown by default.

Note that these “soft filtered” SNPs are not removed from the assembly, and can be made visible again by changing the SNP filtering parameters in either SeqMan Pro or ArrayStar. This is in contrast to “hard filtering” of SNPs, which is done through the Variant tab of the Advanced Assembly Options dialog.

Make any desired changes within the Analysis Options section:

• Delete Intermediates – (transcript annotation workflow only) Check this box to delete all temporary files created during assembly.

• Variant detection mode – Specifies genome ploidy for SNP detection purposes. Choosing Haploid or Diploid establishes the statistical model SeqMan NGen will use in estimating the probability that that a given called variant is real (i.e., that the sequence really differs from the reference). Selecting Somatic/cancer/heterogeneous (e.g. for a polyploid genome, cancer panel, etc.) prevents SeqMan from calculating probabilities.

• Gender – If the checkbox is present, specify the gender of the subject (Male/Female), if known. Otherwise, select Unknown. This checkbox appears only if you are using a DNASTAR genome template package and have chosen a genome ploidy other than Haploid.

• Detect Novel miRNA Genes – (miRNA workflow only) Keep this box checked if you wish to use the Peak Detection Method “ERANGE2” or “ERANGE3” (below) to identify regions of sequence read coverage meeting the criteria thresholds (see Advanced Assembly Options). Those peaks are then listed in the peak and fragment tables and can be associated with nearby and overlapping genes.

• Import Variant Annotation Database –Check this box if you are working with human samples and would like to import variant annotations from a specific portion of the NCBI RefSeq database maintained on the DNASTAR website.

• Calculate Copy Number Variation – (reference-based Whole Genome or Exome and Gene Panel workflows only) Keep this box checked if you wish to calculate copy number variants (CNV) as part of the assembly. If the box is checked, you may choose between two Normalization method options: RPK-CN and None (i.e., no data normalization). If you also select a Variant detection mode other than Do not calculate variants, then CNVs, SNPs and small indels will be calculated from the assembly. After assembly, you can then use ArrayStar to view all three types, or SeqMan Pro to view only the SNPs and small indels.

• Normalization method – Choose the desired normalization method to be applied to the data on a per isoform basis. See Normalization Methods for detailed information about each of the methods offered in SeqMan NGen, including the workflows with which they can be used. In order to enable the selection of a normalization method in some workflows, you must check Calculate Copy Number Variation, above.

• Peak Detection Method – Choose from three peak detection methods: MACS, ERANGE2 and ERANGE3. See Peak Detection Methods for detailed information about each of the methods offered in SeqMan NGen, including the workflows to which they pertain.

• Advanced Assembly Options – Click this button to open the Advanced Assembly Options dialog, which allows you to select additional assembly parameters. See Alignment, and Variant for information about options in each tab of the dialog.

Once you are finished, click Next > to continue to the next wizard screen.