SeqMan NGen Scripting Manual

This document pertains to DNASTAR's SeqMan NGen version 12 and was last updated on December 4, 2014.

I. SeqMan NGen Assemblers

SeqMan NGen contains two assemblers, XNG and SNG (called SMNG in Linux), with different capabilities and scripting languages. Therefore, it is essential to match the correct assembler with the type of assembly project to be done. 

***************

Reference-guided ("templated") assemblies:

The XNG assembler is used for nearly all reference-guided assemblies. This assembler is capable of assembling data sets of any size, given sufficient disk resources and modest RAM requirements (see http://www.dnastar.com/t-sub-support-technical-reqs-seqman-ngen.aspx for details). The primary output is a BAM-formatted alignment file for each reference sequence. Note that BAM files cannot be edited.

For small genome (less than 30MB) reconstruction projects with fewer than 10 million reads, where editing is required, templated assemblies can also be performed using SNG/SMNG. The SNG/SMNG assembler generates finished assemblies in any of four formats: SQD, ACE, SAM or BAM. SeqMan (SQD) and ACE files are editable in SeqMan Pro, but the number of data reads is limited to 10 million or fewer. BAM files of any size can be created, but may not be edited.

***************

De novo assemblies:

The SNG/SMNG assembler is used for all de novo assemblies. SNG/SMNG generates finished assemblies in SeqMan (SQD) or ACE format. Both are editable in SeqMan Pro, but the number of data reads is limited to 10 million or fewer.

Specifying XNG or SNG/SMNG When Running a Script

To specify which assembler to use to run your script, type xng or sng (smng in Linux) followed by the path and script file name after the command prompt. Alternatively, add either the #!/usr/bin/xng or #!/usr/bin/sng (#!/usr/bin/smng in Linux) command as first line of the script and execute through the command line. 

****************************************************************

II. Scripting Manual Conventions

Due to the constraints of TXT format, the following formatting conventions apply to Parts III and IV of this scripting manual:

"Commands" are listed alphabetically, and are denoted in the list by alphabetical characters (e.g., "A" or "CC"). One command is separated from the next using a long line of asterisks.

"Parameters" for a command are listed here in alphabetical order, not the order in which they are written in a script. Parameters are denoted in the list by the same letter(s) as the command, plus a number (e.g., "A35"). For the optimal organization and usage of parameters in a script, please refer to the Example sections.

"Properties" for a parameter are listed in alphabetical order, and are denoted in the list by the same letter(s) and number as the parameter, plus a lower-case letter (e.g., "A35c"). Properties are indented slightly, and are also bracketed between short lines of asterisks.

"Examples appear below the Commands/Parameters/Properties that they are intended to illustrate. 

****************************************************************

III. XNG Commands

A) assembleTemplate
(required) Initiates the assembly of the loaded sequences using the specified template as a reference.

Parameters for 'assembleTemplate':

A1) assemble: [matchContam|noMatchContam|all]
(optional) Specifies whether to use the part of the query that matches the contaminant sequence(s), the part that doesn't match, or both. Default is 'noMatchContam.'

A2) autoTrim: [true|false]
(optional) Specifies whether mismatching ends of reads should automatically be trimmed. Default is 'true.' 

A3) boneyardAssembly: [true|false]
(optional) Specifies whether sequences not used in the original or incremental XNG assemblies should be added to the assembly project by the SNG assembler. This command pertains only to reference-guided assemblies with gap closure. By default, during this type of assembly, the XNG assembler first finds structural variations (SVs) then splits the contig after each SV. Elements of this process can be modified using this command. The default is 'true.'

A4) contaminant: [directory/filename enclosed in quotes]
Use of this parameter partitions the query data by running an additional mer-match (layout) against the specified contaminant sequence(s). A full assembly is then run using the part of the query that either matches or does not match the contaminant sequence(s). This parameter can be used for removing reads originating from an organism(s) that may have also been present in the query data set (e.g., reads from human DNA present in a metagenomic sample from the human gut). 

A5) dbSNPTable: [directory/filename enclosed in quotes]
(Intended for internal use only). 

A6) delayAlignInserts: [true|false]
Use of this flag turns the delay reads that cause inserts on or off. The default is 'true,' meaning that gap causing reads will be delayed. Reads will be added such that reads causing the lowest number of inserts (length of inserts is not considered) will be added before those causing more inserts.

A7) deleteIntermediates: [true|false]
(optional) Specifies whether intermediate files are saved or deleted. These files can be large with large scale projects. Default is 'false.' 

A8) directoryMer: [directory/filename enclosed in quotes]
(optional) Specifies the path and directory where both the template and query data mer files will be stored. Alternatively, separate directories for the template and query mer files can be specified using the parameters below. If no directory is specified, the mer file will be created in the directory containing the sequence data. 

A9) directoryQueryMer: [directory/filename enclosed in quotes]
(required) Specifies the path and directory where the query mer file will be stored. 

A10) directoryTemplateMer: [directory/filename enclosed in quotes]
(required) Specifies the path and directory where the template mer file will be stored. 

A11) filterDeepLayout: [true|false]
(optional) Specifies that XNG remove superfluous sequences in areas of deep coverage. Default is 'true.'

A12) forceMake: [true|false]
(optional) Specifies whether new intermediate mer files will be created. A value of false means that existing valid intermediate files will be used. Default is 'true.'

A13) format: [BAM|SQD|none]
(optional) Specifies the format of the alignment output file. If 'none' is entered, the assembly is run to include the alignment phase, but no alignment output is generated. This parameter can be used to remove reads from a contaminant source. Default is 'BAM.'

A14) gapPenalty: [number] 
The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Default is 30.

A15) geneticCode [filepath/standard Lasergene genetic code file name]  
This parameter specifies the genetic code to use with a template file.

A16) hits: [directory/filename enclosed in quotes]
(required) Specifies the path and name of the hit file. Incomplete paths will be appended to the default directory. 

A17) increaseRunGapPen: [true|false]
This parameter is a flag to  increase the gap open penalty in HP runs. The default is 'false.'

A18) layout: [directory/filename enclosed in quotes]
(required) Specifies the path and name of the layout file. Incomplete paths will be appended to the default directory. 

A19) layout type: [unique|once|multiple]
Specifies how reads are to be laid out. Default is 'once.'

A20) layoutAlign: [true|false] 
Specifies that a pairwise alignment should be performed at the payout phase in order to pick the best position for a given read.

A21) matchScore: [number]
The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value allows for longer or more frequent gaps, thus forcing bases that match to be assembled together. Default is 10.

A22) MaxGap: [number from 0-1000]
The maximum number of gaps allowed per 1000 bases in the alignment. Default is 6.

A23) maxNCnt: [integer]
(optional) This parameter removes sequential reads of the IUPAC ambiguity code 'N' that are greater than or equal to the number specified. Use of this parameter may help in assemblies whose reads contain large clusters of spurious N's.

A24) maxSeqs: [number]
(optional) Specifies the maximum number of query sequences to add to an assembly. Use of this command can speed up assembly. This parameter does not have a default value.

A25) maxTempInserts: [integer]
Specifies the maximum number of insert positions a read is allowed to open on the template (a max. of 3 is recommended). The default is 0.

A26) merLayoutMin: [number from 11-1000] 
(optional) Specifies the minimum length (in bases) of at least one stretch of matching mers used to identify matches between the reference and query data. The minimum value is equal to the mer. The maximum value is the read length, which would require the entire read be an exact match. For example, with a merSize of 19 and a merLayoutMin of 21, at least one stretch of three consecutive mers in a read would have to match for the read in order to be included in the layout. Default is 25.

A27) merMinimizer: [number]
(Intended for internal use only)

A28) merSize: [number]
(required) Specifies the length (in bases) of mers used to identify matches between the reference and query data. This parameter does not have a default value.

A29) merSkip: [number]
(Intended for internal use only) Specifies the number of positions to ignore or "skip" when creating the template mer file. Normally, mers are only skipped in the query (see 'merSkipQuery,' below). The first and last mer of every read are always included. Increasing the value reduces the size of the intermediate files as well as the overall assembly time. However, larger values can also reduce the number of reads included in the assembly, especially with short read data. Default is 0.

   0 = do not skip
   2 = skip every second base
   3 = skip every third base
   etc. 

A30) merSkipQuery: [number] 
(optional) Specifies the number of positions to ignore or "skip" when creating the query mer file. The first and last mer of every read are always included. Increasing the value reduces the size of the intermediate files as well as the overall assembly time. However, larger values can also reduce the number of reads included in the assembly, especially with short read data. Default is 0.

   0 = do not skip
   2 = skip every second base
   3 = skip every third base
   etc. 

A31) minAlignedLength: [number from 11-1000]
(optional) Specifies the minimum number of bases that must align after trimming for a read to be included in the assembly. Default is 25.

A32) minMatchPercent: [number]
The minimum percentage of matches in an overlap required to join two sequences in the same contig. Default is 93.

A33) mismatchPenalty: [number]
The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Default is 20.

A34) noSVPairSort: [true|false]
Specifies whether to turn off the calculation of pairs for structural variations. This may potentially reduce XNG assembly time. Default is 'false.'

A35) onePackage: [true|false]
(optional) Specifies whether an assembly containing multiple reference sequences should be bundled into a single .assembly package. If 'false' is entered, one .assembly package is created per contig. Default is 'true.'

A36) openInSeqman: [true|false]
(optional; not available for Linux users) Specifies whether the completed assembly should immediately be launched in SeqMan. Default is 'false.'

A37) output: [directory/filename enclosed in quotes]
(required) Specifies the path and directory of the output files. Incomplete paths are appended to the default directory. 

A38) pairDist: [true|false]
(Intended for internal use only)

A39) placeHit: [true|false]
(Intended for internal use only)

A40) probe: [number]
(Intended for internal use only)

A41) query: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name(s) of the query data to be assembled. A folder with one or data files can also be used in place of individual file names. 

***************

Properties for 'query':

  A41a) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

  A41b) isPair: [true|false]
(optional) Specifies whether the query files contain paired end data. Default is 'false.'

  A41c) minDist: [number]
(required if 'isPair' is 'true') Specifies the minimum expected distance in bases between paired end reads. Default is 0.

  A41d) maxDist: [number]
(required if 'isPair' is 'true') Specifies the maximum expected distance in bases between paired end reads. Defaults are 500 for Illumina < 50nt; 3000 for Illumina > 50 nt; and 5000 for all others.

  A41e) seqTech: [normalScore|IonTorrent|SOLiD|Illumina|454|unknown]
(optional) Specifies the offset to be used when converting compressed quality scores into numerical values. These are the offsets used for the technology specified:

normalScore  33
IonTorrent   33
SOLiD        33
Illumina      64
454          33; quality scores for homopolymeric runs of >= 2 are oriented from 5' to 3' on the top strand.
unknown     determined automatically based on the first data file.

Example for 'query': 

  query: {{file: "/data/home/proj/Illumina_s_5_1.txt"}
        {file: "/data/home/proj/Illumina_s_5_2.txt "}
   isPair: true
   minDist: 400
   maxDist: 700
   seqTech: Illumina}

***************

A42) recordSplitsOnly: [true|false]
(optional) Functional only when used in the same program as 'splitTemplateContigs' or 'recordStructVariations.' Specifies whether or not to turn off contig splitting while still recording SVs for later inclusion in the Structural Variation Report. The default is 'false.'

A43) recordStructVariations: [integer between 0-3|true|false]
(optional) Specifies under which circumstances structural variations (SVs) should be calculated and recorded. Default is 2.

0|false   Don't calculate SVs
1|true    Calculate SVs at zero coverage
2	   Calculate SVs at insertions and deletions 
3	   Calculate SVs at zero coverage and at insertions

A44) removeUniqueInserts: [true|false]
This parameter is a flag to remove reads that cause an insert which no other read would create. This parameter is only enabled when delayAlignInserts is 'true.' The default for removeUniqueInserts is 'false.'

A45) repeatCnt: [number from 1-10000]
(optional) Specifies the minimum number of occurrences of a mer in the reference sequence(s) for it to be considered repeated. Mers exceeding this number will not be used for identifying matches. The default is 100.

A46) results: [directory/filename enclosed in quotes]
(optional) Specifies the path and name of the result summary file. This file contains a compilation of assembly statistics and uses the extension fileSize.txt. Incomplete paths will be appended to the default directory. 

A47) saveUnSplitAssembly: [true|false]
(optional) Specifies whether XNG should save both the normal assembly output, [filename].assembly, and the unsplit intermediate assembly, [filename]-noSplit.assembly. The latter file contains SVs but no SNPs, and can be used to validate splits in the final assembly. The default is 'false.'

A48) showCDSVariant: [true|false]
(optional) Specifies whether or not XNG should show all variants of a CDS feature contacted by a SNP. The version number for the CDS variant will then appear in brackets when viewed in the SNP report in SeqMan Pro. Default is 'true.'

A49) sngConvertOptions: [text string]
(Intended for internal use only)

A50) snp: [true|false]
(optional) Specifies whether or not a SNP detection pass of the gapped alignment should be made during the assembly. Default is 'true.'

A51) snp_checkStrandedness: [true|false]
(optional) Specifies whether or not the strand that each read comes from is considered in the SNP calculation. This is ignored by the Simple method. Default is 'false.'

A52) snp_combineSubs: [true|false]
This parameter is used to coalesce adjacent substitutions.

A53) snp_excludeBases3p: [integer]
This parameter causes the specified number of bases from the 3' end of each read to not be considered during variant calling.

A54) snp_excludeBases5p: [integer]
This parameter causes the specified number of bases from the 5' end of each read to not be considered during variant calling.

A55) snp_excludeBasesEdge: [integer]
This parameter causes the specified number of bases from both the 5' and 3' ends of each read to not be considered during variant calling.

A56) snp_limitEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop calculating SNPs. A value between 1 and the length of the template must be entered.

A57) snp_limitStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin calculating SNPs. A value between 1 and the length of the template must be entered. Default is 1.

A58) snp_limitTemplateID: [number]
(optional) Specifies a single template ID for which to calculate SNPs. By default, counting begins from 0.

A59) snp_logEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

A60) snp_logLevel: [whole number from 0-3]
(optional) Specifies the level of detailed logging to store in the "shared" project directory as "SNP.log." Level 0 specifies that no log will be stored. Level 1 stores detailed info on the SNPs which were called, level 2 also logs columns where the preliminary filtered passed but the final filtering failed, and level 3 logs all columns. This is ignored by the simple SNP calling method. Default is 0.

A61) snp_logStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

A62) snp_logTemplateID: [number]
(optional) Specifies a single template from which to store a detailed log of SNP information. By default, counting begins from 0.

A63) snp_maxRun: [integer]
Specifies the maximum length of a homopolymeric run to be considered during variant calling.

A64) snp_maxStrandBias: [integer]
This parameter is given by the formula: SNP% forward  SNP% reverse|/ Overall SNP%

where SNP% forward and SNP% reverse are the percentage of reads on the forward (top) and reverse (bottom) strand, respectively, containing the variant and SNP% is the total percentage of reads containing the variant. 

Values will typically range from zero (perfectly strand balanced) to 1 (all the variant containing reads are on one strand). 

Example for 'snp_maxStrandBias':

In a homozygous case (SNP% = 100) with a depth of 100, where 75 variant containing reads are on the top strand (75%) and 25 variant containing reads are on the bottom strand (25%), the strand bias would equal: (7525)/100 = 0.5.

Note: In cases where all the reads covering a base are on one strand only, the SNP% of the other strand cannot be calculated (0 divided by 0). These positions will not be removed by with this filter. To remove these variants, set snp_minStrandCov to ?1.

A65) snp_minHomopolDelFrac: [integer]
Specifies the minimum fraction of reads required to call a deletion in a homopolymeric run.

A66) snp_minHomopolDelDepth: [integer]
Specifies the minimum read depth required to call a deletion in a homopolymeric run.

A67) snp_minPctToScore: [number from 0-1]
(optional) Specifies minimum percentage of reads in a column which must differ from the reference in order to score the column. For the Simple method, this is the only criteria used to call a SNP. For the Diploid and Haploid methods, this is a filter applied before the other parameters. Default is 0.05.

A68) snp_minProbNonrefToCall: [number from 0-1]
(optional) Specifies the minimum probability of a SNP column which is required to call a SNP, expressed as a number from 0 and 1. The probabilities of all genotypes other than Homozygous Reference are totaled and checked against this number. This is the final filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 0.1, requiring a minimum 10% change.

A69) snp_minStrandCov: [integer]
Specifies the minimum number of reads from each strand required to call a variant at a given position

A70) snp_minVariantDepthToScore: [number from 0-100] 
(required if "snp" is true) Specifies the minimum depth required for a specific base (or deletion) in a column before it is considered usable for SNP calling. This is the second filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 2.

A71) snp_minWeight: [number]
(optional) Specifies the minimum quality score for a base to be considered in the SNP calculation. Default 5.

A72) snp_reportUserMissing: [kParamTypeStrFixedVocab] 
Specifies what kind of positions to put in the 'missingUser' file, including one or more of the following:

  'dbSNP' = dbSNP Pos    
  'user'= in user VCF SNP file
  'zeroCoverage' = include zero coverage regions
  'cosmic' = in COSMIC database 
  'allcaptured' = include all positions in capture regions
  'captured' = include only positions in capture regions

Example for 'snp_reportUserMissing':

  snp_reportUserMissing: [user allcaptured captured]

A73) snp_runVar: [true|false]
Uses a Bayesian probabilistic model to exclude heterozygous insertions and deletions in homopolymeric runs. Intended for use with Ion Torrent data.

A74) snp_showAllFeatures: [true|false]
(optional) Specifies whether XNG should count SNPs multiple times if the SNP contacts different versions (variants) of a CDS feature. Default is 'true.'

A75) snp_writeExtended: [true|false]
(optional) Specifies whether the additional values produced by the Haploid or Diploid SNP calculation methods are included in the SNP table. Default is 'true.'

A76) snpMethod: [simple|haploid|diploid|population]
(optional) Specifies the SNP detection method to use. Simple produces a count of each type of base in the column and calculates the percent of non-reference bases. Haploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base called at that position. Diploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base(s) called at that position. Based on the scores, it also calls the genotype at each position. Default is 'diploid.'

A77) splitTemplateContigs: [integer between 0-3|true|false]
(optional) Specifies under which circumstances contigs should be cut after a templated assembly. Any split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. This command pertains only to reference-guided assemblies with gap closure. By default, during this type of assembly, the XNG assembler first finds structural variations (SVs) then splits the contig after each SV. Elements of this process can be modified using this command. Default is 2.

0|false  Don't split
1|true   Split at locations with zero coverage
2	  Split at insertions and deletions
3	  Split at zero coverage and at insertions

***************

A78) template: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the template file. A folder with one or more template files can also be used in place of individual file names. Each entry must also be enclosed by brackets. If more than template entry is used, the list must also be enclosed by an additional set of brackets. 

***************

Properties for 'template':

  A78a) file: [directory/filename enclosed in quotes]   
Specifies the directory and file/folder. 

  A78b) feature: [directory/filename enclosed in quotes]
optional) Specifies the directory and file name for annotated features when the reference sequence and feature annotations are in separate files. 

Examples for 'template':

Sequence and annotation in one file:

  AssembleTemplate
   template: {{file: "/data/home/proj/MG1655.gbk"} 
              {file: "/data/home/proj/W3110.gbk"}}

Sequence and annotation in separate files:

  AssembleTemplate
   template: {file: "/Library/ABC_proj/references/MG1655.fas" 
             feature: "/Library/ABC_proj/references/MG1655.gff"}

***************

A79) templateHitCntThresh: [number]
(Intended for internal use only)

A80) trimToTargetRegions: [true|false] 

Controls whether reads are trimmed, by default, to the boundaries of the targeted regions, as defined by the .bed or manifest file. Default is 'true,' meaning that the reads are trimmed to the stated boundaries. Selecting 'true' is equivalent to checking the "Trim to targeted regions" box in the Alignment tab of the SeqMan NGen wizard's Advanced Options dialog. If conditions are not met, the SeqMan NGen wizard does not change this parameter to 'false,' but instead omits it from the script. The parameter status is only shown in the script for control workflows.

A81) unassembled: [directory/filename enclosed in quotes]

A82) verify: [true|false]

****************************************************************

B) computeSNP
(optional) Sets parameters for the SNP computation phase of the assembly. The command is designed for use with existing BAM files that have not been analyzed for SNPs, or to re-analyze an existing file with different parameters. The associated parameters are also available for full assemblies under the 'assembleTemplate' command.

Parameters for 'computeSNP':

B1) calcJunctionSeqs: [true|false]
(optional) In the structural variation workflow, specifying 'false' prevents junction sequences from being calculated. Default is 'true.'

B2) concurrentAligns: [number]
(Intended for internal use only)

B3) file: [directory/filename enclosed in quotes]
(required) Specifies the path and name of one or more .assembly projects from which to compute SNPs. 

B4) showCDSVariant: [true|false]
(optional) Specifies whether XNG should show all variants of a CDS feature contacted by a SNP. The version number for the CDS variant will then appear in brackets when viewed in the SNP report in SeqMan Pro. Default is 'true.'

B5) snp_checkStrandedness: [true|false]
(optional) Specifies whether the strand that each read comes from is considered in the SNP calculation. This is ignored by the Simple method. Default is 'false.'

B6) snp_combineSubs: [true|false]
This parameter is used to coalesce adjacent substitutions.

B7) snp_excludeBases3p: [integer]
This parameter causes the specified number of bases from the 3' end of each read to not be considered during variant calling.

B8) snp_excludeBases5p: [integer]
This parameter causes the specified number of bases from the 5' end of each read to not be considered during variant calling.

B9) snp_excludeBasesEdge: [integer]
This parameter causes the specified number of bases from both the 5' and 3' ends of each read to not be considered during variant calling.

B10) snp_limitEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop calculating SNPs. A value between 1 and the length of the template must be entered. Default is 1.

B11) snp_limitStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin calculating SNPs. A value between 1 and the length of the template must be entered. Default is 1.

B12) snp_limitTemplateID: [number]
(optional) Specifies a single template ID for which to calculate SNPs. By default, counting begins from 0.

B13) snp_logEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

B14) snp_logLevel: [number]
(optional) Specifies the level of detailed logging to store in the "shared" project directory as "SNP.log". Level 0 specifies that no log will be stored. Level 1 stores detailed info on the SNPs which were called, level 2 also logs columns where the preliminary filtered passed but the final filtering failed, and level 3 logs all columns. This is ignored by the simple SNP calling method. Default is 0.

B15) snp_logStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

B16) snp_logTemplateID: [number]
(optional) Specifies a single template from which to store a detailed log of SNP information. By default, counting begins from 0.

B17) snp_maxRun: [integer]
Specifies the maximum length of a homopolymeric run to be considered during variant calling.

B18) snp_maxStrandBias: [integer]
This parameter is given by the formula: SNP% forward  SNP% reverse|/ Overall SNP%

where SNP% forward and SNP% reverse are the percentage of reads on the forward (top) and reverse (bottom) strand, respectively, containing the variant and SNP% is the total percentage of reads containing the variant. 

Values will typically range from zero (perfectly strand balanced) to 1 (all the variant containing reads are on one strand). 

Example for 'snp_maxStrandBias':

In a homozygous case (SNP% = 100) with a depth of 100, where 75 variant containing reads are on the top strand (75%) and 25 variant containing reads are on the bottom strand (25%), the strand bias would equal: (7525)/100 = 0.5.

Note: In cases where all the reads covering a base are on one strand only, the SNP% of the other strand cannot be calculated (0 divided by 0). These positions will not be removed by with this filter. To remove these variants, set snp_minStrandCov to ?1.

B19) snp_minHomopolDelFrac: [integer]
Specifies the minimum fraction of reads required to call a deletion in a homopolymeric run.

B20) snp_minHomopolDelDepth: [integer]
Specifies the minimum read depth required to call a deletion in a homopolymeric run.

B21) snp_minPctToScore: [number from 0-1]
(optional) Specifies minimum percentage of reads in a column which must differ from the reference in order to score the column. For the Simple method, this is the only criteria used to call a SNP. For the Diploid and Haploid methods, this is a filter applied before the other parameters. Default is 0.05.

B22) snp_minProbNonrefToCall: [number from 0-1]
Specifies the minimum probability of a SNP column which is required to call a SNP. The probabilities of all genotypes other than Homozygous Reference are totaled and checked against this number. This is the final filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 0.1, requiring a minimum 10% change.

B23) snp_minStrandCov: [integer]
Specifies the minimum number of reads from each strand required to call a variant at a given position

B24) snp_minVariantDepthToScore: [number from 0-100]
Specifies the minimum depth required for a specific base (or deletion) in a column before it is considered usable for SNP calling. This is the second filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 2.

B25) snp_minWeight: [number]
(optional) Specifies the minimum quality score for a base to be considered in the SNP calculation. Default is 5.

B26) snp_reportUserMissing: [kParamTypeStrFixedVocab] 
Specifies what kind of positions to put in the 'missingUser' file, including one or more of the following:

  'dbSNP' = dbSNP Pos    
  'user'= in user VCF SNP file
  'zeroCoverage' = include zero coverage regions
  'cosmic' = in COSMIC database 
  'allcaptured' = include all positions in capture regions
  'captured' = include only positions in capture regions

Example for 'snp_reportUserMissing':

  snp_reportUserMissing: [user allcaptured captured]

B27) snp_runVar: [true|false]
Uses a Bayesian probabilistic model to exclude heterozygous insertions and deletions in homopolymeric runs. Intended for use with Ion Torrent data.

B28) snp_showAllFeatures: [true|false]
(optional) Specifies whether XNG should count SNPs multiple times if the SNP contacts different versions (variants) of a CDS feature. Default is 'true.'

B29) snp_writeExtended: [true|false]
(optional) Specifies whether the additional values produced by the Haploid or Diploid SNP calculation methods are included in the SNP table. Default is 'true.'

B30) snp_writeMissingDBSnps: [true|false]
(optional) In a SNP assembly, specifying 'false' causes missing SNPs not to be recorded, saving time and file space. Default is 'true.'

B31) snpFilter: [true|false]

Specifies whether SNP filtering is turned on or off. 

***************

Properties for 'snpFilter':

  B31a) capture: [true|false]
(optional) Specifies whether there is an exome capture file. In the absence of an exome capture file, the SeqMan NGen wizard automatically sets this property to 'false.'

  B31b) pNotRefMinVal: [number]

  B31c) userOnly: [true|false]
(optional) Specifies whether there is a VCF SNP file. The SeqMan NGen wizard always calls this as 'true,' but ignores the property if no VCF SNP file has been loaded.

  B31d) pNotRef: [number]
(optional) Specifies a PnotRef threshold.

  B31e) minSnpFilter: [number]

  B31f) minDepth: [number]
(option) Specifies a minimum sequence depth threshold.

B32) snpMethod: [simple|haploid|diploid|population]
(optional) Specifies the SNP detection method to use. Simple produces a count of each type of base in the column and calculates the percent of non-reference bases. Haploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base called at that position. Diploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base(s) called at that position. Based on the scores, it also calls the genotype at each position. Default is 'diploid.'

B33) userSNP: [directory/filename enclosed in quotes]
(optional) Specifies a location for storing the VCF SNP table. 

****************************************************************

C) createGenomeTemplate
(Intended for internal use only)

Parameters for 'createGenomeTemplate'

C1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder of the input file.

C2) output: [directory/filename enclosed in quotes]
The path and name of the output file. 

****************************************************************

D) diskPath
(required) Defines the default directory where temporary intermediate files from the assembly will be stored. The files can be large with large scale projects. Visit http://www.dnastar.com/t-sub-support-technical-reqs-seqman-ngen.aspx to view space requirements for a range of representative projects.

Parameters for 'diskPath':

D1) clean: [true|false]
Specifies whether or not to clean the merge disk. When automated scripts are being run simultaneously or sequentially, this command can be useful for emptying the merge disk between assemblies. Default is 'false.'

D2) pathMac: [directory/filename enclosed in quotes]
Specifies the default path and file name for Macintosh. 

D3) pathWin: [directory/filename enclosed in quotes]
Specifies the default path and file name for Windows. 

D4) path: [directory/filename enclosed in quotes]
(required) Specifies the default path and file name. 

Example for 'diskPath':

  diskPath
   path: "/data/proj/"

****************************************************************

E) dumpConsensus
(Intended for internal use only). To convert the binary consensus file created during assembly into a text file.

Parameters for 'dumpConsensus':

E1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

****************************************************************

F) dumpSNP
(Intended for internal use only). Creates a tab delimited text file from one or more SNP containing binary files generated during assembly. SNP binary files include those with the .snpExt suffix contained in an .assembly package as well as those with either the .coverage.missingSNP or .nocoverage.missingSNP suffix contained in the _shared folder. To convert all the .snpExt files in a package simply use the .assembly name.

Parameters for 'dumpSNP':

F1) file: [directory/filename enclosed in quotes]
(required) Specifies the path and name of .assembly package (all SNP files will be included), one or more individual .snpExt files or either/both of the missingSNP files. 

F2) output: [directory/filename enclosed in quotes]
(required) Specifies the path and name of the output file. 

F3) refPos_end: [number]

F4) refPos_start: [number]

F5) snp_maxProbNonrefToCall: [number]

F6) snp_minProbNonrefToCall: [number]

F7) snp_type: [simple|SNP|missing|user]

****************************************************************

G) dumpSplits
(Intended for internal use only). To convert the binary splits file created during assembly into a text file.

Parameters for 'dumpSplits':

G1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

G2) output: [directory/filename enclosed in quotes]
The path and name of the output file. 

****************************************************************

H) execute
(optional) Executes any shell script command.

Parameters for 'execute':

H1) command: [text string]
Text for any shell script command.

****************************************************************

I) extractPairs
(optional) Creates a tab delimited table of pair end information.

Parameters for 'extractPairs':

I1) file: [directory/filename enclosed in quotes]
The path and name of any pair distance file (.pairdist file) from within a project's shared folder. 

I2) output: [directory/filename enclosed in quotes]
The path and name of the output file. 

****************************************************************

J) include
(optional) When building a script, this command can be used to call up additional lines of script previously stored in a text file. In this way, a group of commands can be shared between two or more scripts.

Parameters for 'include':

J1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

****************************************************************

K) loadAssembly
(Intended for internal use only)

Parameters for 'loadAssembly':

K1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

****************************************************************

L) loadBAM
(optional) Sets parameters for analyzing existing BAM files. It allows ungapped BAM files to be converted into a fully gapped assembly file or to re-gap an existing file with different parameters. The command also permits SNPs to be calculated or re-calculated with different parameters starting with an existing BAM file. The associated parameters are also available for full assemblies under the 'assembleTemplate' command.

Parameters for 'loadBAM':

L1) align: [true|false]
(optional) Specifies whether a gapped alignment will be done. Default is 'false.'

L2) format: [BAM|SQD|none]
(optional) Specifies the format of the alignment output file. If 'none' is entered, the assembly will be run including the alignment phase, but no alignment output is generated. This can be used to remove reads from a contaminant source. Default is 'BAM.'

L3) gapPenalty: [number]
The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Default is 30.

L4) layout: [directory/filename enclosed in quotes]
(required) Specifies the path and name of the BAM file. 

L5) matchScore: [number]
The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. Default is 10.

L6) minAlignedLength: [number]
The minimum length of aligned sequence that must be attained between the read and reference for the read to be included in the assembly. Default is 25.

L7) minMatchPercent: [number]
The minimum percentage of matches in an overlap required to join two sequences in the same contig. Default is 93.

L8) mismatchPenalty: [number]
The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Default is 20.

L9) output: [directory/filename enclosed in quotes]
(required) Specifies the path and directory of the output files. 

L10) snp: [true|false]
(optional) Specifies whether a SNP detection pass of the gapped alignment is made during the assembly. Default is 'false.'

L11) snp_checkStrandedness: [true|false]
(optional) Specifies whether the strand that each read comes from is considered in the SNP calculation. This is ignored by the Simple method. Default is 'false.'

L12) snp_limitEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop calculating SNPs. A value between 1 and the length of the template must be entered. Default is 1.

L13) snp_limitStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin calculating SNPs. A value between 1 and the length of the template must be entered. Default is 1.

L14) snp_limitTemplateID: [number]
(optional) Specifies a single template ID for which to calculate SNPs. By default, counting begins from 0.

L15) snp_logEndPos: [number]
(optional) Specifies the 3' most coordinate of the specified template from which to stop storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

L16) snp_logLevel: [number]
(optional) Specifies the level of detailed logging to store in the "shared" project directory as "SNP.log". Level 0 specifies that no log will be stored. Level 1 stores detailed info on the SNPs which were called, level 2 also logs columns where the preliminary filtered passed but the final filtering failed, and level 3 logs all columns. This is ignored by the simple SNP calling method. Default is 0.

L17) snp_logStartPos: [number]
(optional) Specifies the 5' most coordinate of the specified template from which to begin storing a detailed log of SNP information. A value between 1 and the length of the template must be entered. Default is 1.

L18) snp_logTemplateID: [number]
(optional) Specifies a single template from which to store a detailed log of SNP information. By default, counting begins at 0.

L19) snp_minPctToScore: [number from 0-1]
(optional) Specifies minimum percentage of reads in a column which must differ from the reference in order to score the column. For the Simple method, this is the only criteria used to call a SNP. For the Diploid and Haploid methods, this is a filter applied before the other parameters. Default is 0.05.

L20) snp_minProbNonrefToCall: [number from 0-1]
(optional) Specifies the minimum probability of a SNP column which is required to call a SNP. The probabilities of all genotypes other than Homozygous Reference are totaled and checked against this number. This is the final filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 0.1, requiring a minimum 10% change.

L21) snp_minVariantDepthToScore: [number]
Specifies the minimum depth required for a specific base (or deletion) in a column before it is considered usable for SNP calling. This is the second filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 2.

L22) snp_minWeight: [number]
(optional) Specifies the minimum quality score for a base to be considered in the SNP calculation. Default is 5.

L23) snp_writeExtended: [true|false]
(optional) Specifies whether the additional values produced by the Haploid or Diploid SNP calculation methods are included in the SNP table. Default is 'true.'

L24) snpMethod: [simple|haploid|diploid|population]
(optional) Specifies the SNP detection method to use. Simple produces a count of each type of base in the column and calculates the percent of non-reference bases. Haploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base called at that position. Diploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base(s) called at that position. Based on the scores, it also calls the genotype at each position. Default is 'diploid.'

L25) template: [directory/filename enclosed in quotes]
(required) Specifies the path and name of the reference sequence file(s). 

****************************************************************

M) mergeIonTorrentShortReads
(optional) When using Ion Torrent data, use of this command merges overlapping short reads into mini-contigs.

Parameters for 'mergIonTorrentShortReads':

M1) output: [directory/filename enclosed in quotes]
(required) Specifies the path and directory of the output files. 

M2) query: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name(s) of the query data to be assembled. A folder with one or data files can also be used in place of individual file names. 

****************************************************************

N) pairFilePattern
Allows you to specify the pattern for pair files using the GREP language.

Parameters for 'pairFilePattern':

N1) forward: [text string enclosed in quotes]
A naming pattern to match forward clones. 

N2) reverse: [text string enclosed in quotes]
A naming pattern to match reverse clones. 

Example for 'pairFilePattern':

     pairFilePattern
          forward: "(?'name'.*)_R1_(?'ext'.*)\fastq
          reverse: "(?'name'.*)_R2_(?'ext'.*)\fastq

****************************************************************

O) pause
(optional) Creates a pause and can be used when running table scripts to stop at any point.

Parameters for 'pause':

O1) prompt : [text string enclosed in quotes]
Text to appear in the console. The pause is terminated by hitting the Enter key. 

Example for 'pause':

   pause 
      prompt: "Table script paused. Press enter to continue."

****************************************************************

P) quit
(optional) Terminates a script.

****************************************************************

Q) RemoveDuplicateSeqs [true|false]
Coalesces multiple identical reads at the same position into a single read, provided the reads match the template exactly. If this feature is active, at the end of assembly, XNG will print the message: "Coalesced $lld identical reads that matched the template exactly." Default is 'false.'

****************************************************************

R) runScript
(optional) Allows batching of multiple projects of the same type (e.g. assembly, computeSNPs). There are required three file: 1) a runScript file with variables, 2) a file with a table of values for the variables, and 3) a script file specifying the action to be carried out.

Parameters for 'runScript':

R1) script: [directory/filename enclosed in quotes]
The filename and location of the script. 

R2) table: [directory/filename enclosed in quotes]
The filename and location of the file containing text strings and numbers values for each variable. 

Example for 'runScript' (runScript file):

   setDefaultDirectory directory: "."
      set $force: false
      set $DataDisk: "/Volumes/Raid/DataDisk"
      set $ResultDisk: "/Volumes/ResultDisk"
      set $MergeDisk: "/Volumes/MergeDisk0"
      set $snp:true
      set $snpMethod:"Diploid"
      set $repCnt:100
      set $merLayoutMin:19
      diskPath path: {"${MergeDisk}/mergeSort Data"}}
      runScript table: "testAssembly.txt" script: "testAssembly.template.script"

Example for 'runScript' (table file):

defaultDir	template	query	isPair	seqTech	project	merSize	snp	snpMethod
"${ResultDisk}/rice"	${DataDisk}/rice.genome	${DataDisk}/rice 	FALSE	Illumina	rice	21	TRUE	Diploid
"${ResultDisk}/ecoli"	${DataDisk}/Ecoli.gbk	${DataDisk}/ecoli	TRUE	Illumina	Ecoli	21	TRUE	Diploid
"${ResultDisk}/Exome"	${DataDisk}/GRCh37.gbk	${DataDisk}/Sample1	FALSE	454	HuEx	19	TRUE	Diploid

Example for 'runScript' (script file):

   ; "assembly.template.script"
      setMachineMemory memory:32
      setDefaultDirectory directory:	$defaultDir
      compareSeqs template:	$template
      query:	{file: $query 
      isPair: $isPair 
      seqTech: $seqTech}
      directoryMer:	"intermediateFiles"
   ; directoryQueryMer: "intermediateFiles"
      hits: "intermediateFiles/${project}.hits"
      layout:	"intermediateFiles/${project}.layout"
      output:	"results_${mersize}_${merSkipQuery}/${project}"
   ; results per project 
   ; results: "${project}.results.txt"
   ; aggregate all results 
      results: "${ResultDisk}/assembly.results.txt"
      merSize: $mersize
      merSkipQuery: $merSkipQuery
      repeatCnt: $repCnt
      merLayoutMin: $merLayoutMin
      layoutType: once
      maxGap: 6
      format: BAM
      onePackage: true
      snp: $snp
      snpMethod: $snpMethod
   ; snp_writeExtended: true
      forceMake: $force

****************************************************************

S) set
(optional) Used to set variables. See the example below and those under the 'runScript' command.

Example for 'set':

   set $snp:true
   set $snpMethod:"Diploid"

****************************************************************

T) setDefaultDirectory
(required) Defines the default directory for the project. When a default directory is specified, files located in that directory only need to be identified by their subfolder and/or file name in subsequent commands.

Parameters for 'setDefaultDirectory':

T1) directory: [directory/filename enclosed in quotes]
(required) Specifies the default directory. Previously called 'defaultDirectory.'

T2) directoryMac: [directory/filename enclosed in quotes]
Specifies the default directory for Macintosh. Previously called 'defaultMacDirectory.'

T3) directoryWin: [directory/filename enclosed in quotes]
Specifies the default directory for Windows. Previously called 'defaultWinDirectory.'

Example for 'setDefaultDirectory':

   setDefaultDirectory
      directory: "/data/home/proj/"

****************************************************************

U) setMachineMemory
(optional) Defines the amount of random access memory (RAM) that the program will use. Limiting the amount of RAM available to the assembler allows you to use the computer for other purposes while an assembly is running. However, this will likely slow down the assemblies and is not recommended for large projects. 

Parameters for 'setMachineMemory':

U1) memory: [number that is a multiple of 4]
(required) Amount of RAM (in GB) to be used, entered in multiples of four. Entering a value greater than the available RAM causes all RAM to be used. There is no default value.

Example for 'setMachineMemory':

   setMachineMemory
      memory: 32

****************************************************************

V) setParam
Allows you to adjust the stringency of one or more of the assembling parameters for the project. SeqMan NGen will use the default values for any parameter that is not specified within the script.

Parameters for 'setParam':

V1) gapPenalty: [number]
The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Default is 30.

V2) matchScore: [number]
The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. Default is 10.

V3) minAlignedLength: [number]
The minimum length of aligned sequence that must be attained between the read and reference for the read to be included in the assembly. Default is 25.

V4) minMatchPercent: [number]
The minimum percentage of matches in an overlap required to join two sequences in the same contig. Default is 93.

V5) mismatchPenalty: [number]
The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Default is 20.

****************************************************************

IV. SNG Commands

Note: To see how SNG commands and parameters map to equivalent SeqMan NGen wizard settings, open the appendix of the SeqMan NGen help (http://www.dnastar.com/t-help-seqman-ngen.aspx) and select the topic "Equivalence Between Wizard Settings and SNG Scripting Commands."

Part I. Project Management Commands

A) closeProject
(optional) Closes the current project and frees the memory in use so that the system is ready for additional assemblies. This can be useful if you want to run multiple assemblies in one script.

****************************************************************

B) runScript
(optional) Allows you to run a table script within the current script. A table script references variable values for specified parameters and other elements in a script. This enables you to run multiple projects from the same script, substituting new parameter values and other variables each time. SeqMan NGen will run the table script repeatedly, using the variable values from one row of the table for each iteration of the script until all of the rows have been used. For more information, see the Using Table Scripts in SeqMan NGen section.

Parameters for 'runScript':

B1) file: [directory/filename enclosed in quotes]
Specifies the directory and file/folder. 

B2) script: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the table script you wish to run. 

B3) table: [directory/filename enclosed in quotes]
(required) Specifies the delimited text file containing the variable values. 

Example for 'runScript':

   runScript 
      script: "/Library/abc_Project/abc_script.script"
      table: "/Library/abc_Project/table.txt"

****************************************************************

C) saveProject
This command saves the assembly to a project file. By default, the SeqMan Pro project file format (*.sqd) is used. Phrap (*.ace) and FASTA (*.fas) formats may also be specified by using the format parameter, and specifying the desired file extension using the file parameter.

Note: As a command-line tool, SeqMan NGen will not prompt you if you try to save a new project file with the same name as an existing file in the same location. When you run a script multiple times, be sure to change the file name of the project to be saved each time to prevent existing project files from being overwritten.

Parameters for 'saveProject':

C1) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the project file to be saved. 

C2) format [SeqMan|SeqMan8|SeqMan7|Phrap|Fasta|BAM|SAM]
 (optional) Specifies the output file format. Default is 'SeqMan.'

SeqMan       Saves a 64-bit SeqMan Pro project file (*.sqd) that is compatible with SeqMan Pro version 8.1 and higher (default).

SeqMan8     Saves a 32-bit SeqMan Pro project file (*.sqd) that is compatible with SeqMan Pro version 8.0 and higher.

SeqMan7     Saves a 32-bit SeqMan Pro project file (*.sqd) that is compatible with SeqMan Pro version 7.2 and higher. Note that this project file will be much bigger than the same project created in either of the SeqMan formats listed above.

Phrap           Saves an .ace file.
Fasta            Saves .fas and .qual files of the consensus sequence for each contig. 
BAM             Saves a BAM file (SNG/SMNG templated assemblies only).
SAM             Saves a SAM file (SNG/SMNG templated assemblies only).

C3) onePackage: [true|false]
(optional) Specifies whether an assembly containing multiple reference sequences should be bundled into a single .assembly package. If 'false' is entered, one .assembly package is created per contig. Default is 'true.'

C4) openInSeqMan: [true|false]
(not available for Linux users) Specifies whether to automatically launch SeqMan Pro and open the completed assembly once the script has completed. Default is 'true.'	

Example for 'SaveProject':

   SaveProject 
      file: "/Library/My projects/ABC_project.sqd" 
      format:seqman
      openInSeqMan:true

****************************************************************

D) saveReport
(optional) Exports a report as a text file that summarizes assembly statistics, including the parameters used, the number of assembled/unassembled sequences and contigs, average quality scores, and the number of sequences excluded from the assembly due to exceeding the maxAssemblyCoverage parameter. The same information contained within this report is also saved within the SeqMan Pro project file (*.sqd) regardless of whether you choose to export the report by setting this parameter. The report can be viewed in SeqMan Pro using the Project>Report command.

Parameters for 'saveReport':

D1) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the report to be saved. . 

Example for 'saveReport':

   saveReport 
      file: "/Library/abc_Project/abc_report.txt"

****************************************************************

E) WriteUnassembledSeqs
(optional) Saves all sequences that were not assembled in the project as *.fas and *.qual files.

Parameters for 'WriteUnassembledSeqs':

E1) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the unassembled sequences to be saved. 

E2) saveTrimmed: [true|false]
Specifies whether to save only the trimmed portion of the unassembled sequences. Default is 'false.'

Part II. File Loading Commands and Parameters

F) load454PairedEnd
Loads a file of Roche 454 sequences and checks for the presence of a linker defining the paired end sequences. If the linker is found, the linker is removed and the remaining portion is split into two sequences linked with a paired end constraint.

Parameters for 'load454PairedEnd':

F1) DiscardLinkerless: [true|false ]
Specifies whether to discard any read where no portion of the mate pair linker was found. In this way, reads that do not have a linker sequence will be discarded from the assembly. Default is 'false.'

F2) file: [directory/filename enclosed in quotes]
The directory and file name of the .fas, .fna, or .sff file containing the 454 sequences. 

F3) linker: [directory/filename enclosed in quotes]
The directory and file name of the .fas, fna, or .sff file containing the 454 linker sequences. If not specified, SeqMan NGen will use its default 454 linker sequence: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

F4) max: [number]
The maximum distance for the paired end constraint. Default is 10000. (Also called 'maxDistance').

F5) min: [number]
The minimum distance for the paired end constraint. Default is 0. (Also called 'minDistance').

Example for load454PairedEnd':

   load454PairedEnd 
      file: "/Library/454 data/123_Pairedend.fas"
      linker: "/Library/454 data/123_linkerseqs.fas"
      min: 0
      max: 10000
      DiscardLinkerless: false

****************************************************************

G)  LoadConstraint
Loads a constraint file. The file can be in the NCBI ancillary file format, or in the CAP3 constraint file format. SeqMan NGen uses constraint files to identify paired end reads, similar to using the 'setPairSpecifier' command. Constraint files in the NCBI ancillary file format also contain trimming information, which SeqMan NGen will load and use. SeqMan NGen will create a CAP3 file when saving a Phrap project (*.ace) that used paired end constraints.

Parameters for 'LoadConstraint':

G1) file: [directory/filename enclosed in quotes]
The directory and file name of the constraint sequence file. 

Example for 'LoadConstraint':

   loadConstraint 
      file: "/Library/constraints/123_xyz.con"

****************************************************************

H) LoadContaminant
Loads a contaminant sequence file to be used to identify known contaminants, such as primers, in the assembly. Sequences that contain at least 12 matching 17-mers are flagged as contaminant sequences and will be removed from the assembly. See our website (http://www.dnastar.com/t-smgafileformats.aspx) for a list of supported file types.

Parameters for 'LoadContaminant':

H1) file: [directory/filename enclosed in quotes]
The directory and file name of the contaminant sequence file. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and used for contaminant screening. 

Example for 'loadContaminant':

   loadContaminant 
      file: "/Library/contaminants/123_abc.seq"

****************************************************************

I) loadLayout
Loads a layout file to be used for an assembly. The format may be either a SOLiD General Feature Format file (*.gff) or a File of Filenames file (*.fof). When this command is used, SeqMan NGen still aligns each read from the file to the template, but uses the information contained within the specified file to determine the overall layout of reads.

Parameters for 'loadLayout':

I1) layoutFile: [directory/filename enclosed in quotes]
 (required) Specifies the directory and file name of the layout file. Both *.gff and *.fof formats are accepted. 

I2) templateFile: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the template file. 

Example for 'loadLayout':

   loadLayout
      templateFile: "/Library/123_project/template.seq"
      layoutFile: "/Library/123_project/layoutfile.gff"

****************************************************************

J) LoadRepeat
Loads a sequence file to be used to identify repeat sequences in the assembly. All sequences identified as repeats will be added to the assembly last, after all non-repeats have been assembled. See our website (http://www.dnastar.com/t-smgafileformats.aspx) for a list of supported file types.

Parameters for 'LoadRepeat':

J1) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the repeat sequence file. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and used as repetitive sequences. 

Example for 'loadRepeat':

   loadRepeat 
      file: "/Library/repetitive_seqs/123_repeat.seq"

****************************************************************

K) loadSeq
Loads a sequence file or files for assembly. See our website (http://www.dnastar.com/t-smgafileformats.aspx) for a list of supported file types.

Parameters for 'loadSeq':

K1) blockContig: [text string]
(optional) Used in the reference-guided workflow.

K2) blockName: [text string]
(optional) Used in the reference-guided workflow.

K3) blockPos: [number]
(optional) Used in the reference-guided workflow.

K4) DiscardLinkerless: [true|false ]
Specifies whether reads that do not have a linker sequence should be discarded from the assembly. Default is 'false.'

K5) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the sequence file(s) to be loaded. A folder may also be specified, in which case all of the sequence files within that folder will be loaded. 

K6) groupName: [text string]
Used to identify the multi-sample group name for a read file.

K7) isPair: [true|false]
(optional) Specifies whether the query files contain paired end data. Default is 'false.'

K8) linker: [directory/filename enclosed in quotes]
The directory and file name of the .fas, fna, or .sff file containing the 454 linker sequences. If not specified, SeqMan NGen will use its default 454 linker sequence: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

K9) max: [number]
The maximum distance for the paired end constraint. Default is 10000.

K10) maxSeqs: [number]
Specifies the maximum number of reads to load from a file. There is no default value.

K11) mergePairs: [true|false]
Specifies whether the reads are paired end data that overlap and should therefore be merged. Default is 'false.'

K12) min: [number]
The minimum distance for the paired end constraint. Default is 0.

K13) multi-sample: [true|false]
Specifies whether reads are from a multi-sample run. Default is 'false.'

K14) seqTech: [normalScore|IonTorrent|SOLiD|Illumina|454|unknown]
(optional) Specifies the offset to be used when converting compressed quality scores into numerical values. These are the offsets used for the technology specified:

normalScore   33
IonTorrent      33
SOLiD               33
Illumina           64
454                   33; quality scores for homopolymeric runs of >= 2 are oriented from 5' to 3' on the top strand.
unknown         determined automatically based on the first data file.

K15) templateFragment : [number]
(optional) Used in reference-guided assemblies with gap closure.

Example for 'loadSeq':

   loadSeq 
      file: "/Library/ABC_project/ABC_sequences.fas"

****************************************************************

L) LoadTemplate
Loads a sequence file to be used as a template for all other sequences to be assembled to. The template sequence will be displayed as a "reference" sequence in SeqMan Pro for SNP analysis. See our website (http://www.dnastar.com/t-smgafileformats.aspx) for a list of supported file types.

Parameters for 'LoadTemplate':

L1) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the template sequence file to be loaded. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and treated as template sequences. 

Example for 'loadTemplate':

   loadTemplate
      file: "/Library/abc_Project/abc_template.seq"

****************************************************************

M) LoadVector
Loads a vector sequence file to be used for vector trimming. See our website (http://www.dnastar.com/t-smgafileformats.aspx) for a list of supported file types.

Parameters for 'LoadVector':

M1) cloneSite: [number]
This parameter specifies the position of the cloning site on the vector where insertion occurs. There is no default value.

M2) file: [directory/filename enclosed in quotes]
(required) Specifies the directory and file name of the vector sequence file to be used for vector trimming. 

Example for 'loadVector':

   loadVector 
      file: "/Library/vectors/123_vector.seq"
      cloneSite:826

****************************************************************

N) setDefaultDirectory
(required) Defines the default directory for the project. When a default directory is specified, files located in that directory only need to be identified by their subfolder and/or file name in subsequent commands.

Parameters for 'setDefaultDirectory':

N1) directory: [directory/filename enclosed in quotes]
(required) Specifies the default directory. Previously called 'defaultDirectory.'

N2) directoryMac: [directory/filename enclosed in quotes]
Specifies the default directory for Macintosh. Previously called 'defaultMacDirectory.'

N3) directoryWin: [directory/filename enclosed in quotes]
Specifies the default directory for Windows. Previously called 'defaultWinDirectory.'



Examples for 'setDefaultDirectory':

   setDefaultDirectory: "/Library/ABC_proj/"

Once you have set a default directory, you may use two periods .. before a file name to specify that the file you wish to use is located in the parent folder of the default directory you specified.

   loadVector file: "../123Vector.fas"

This specifies that the vector file, 123Vector.fas, is located in the ABC Data folder, the parent folder of the default directory.

Part III. Parameter Settings Commands

O) setContaminantParam
Allows you to adjust the parameters used for scanning for contaminant sequences. In order to be applied, this command must appear in the script before the 'loadContaminant' command, and the 'contamScan' parameter for the 'assemble' command must be set to 'true.'

Parameters for 'setContaminantParam':

O1) MerLength: [number from 5-50]
The minimum length of a mer required to be considered an exact match when scanning for contaminants. Default is 17.

O2) MinMerMatch: [number from 1-50]
The minimum number of matching mers required to mark the sequence as a contaminant. Default is 12.

Example for 'setContaminantParam':

   setContaminantParam MerLength:17
   setContaminantParam MinMerMatch:12

****************************************************************

P) setParam
Allows you to adjust the stringency of one or more of the assembling parameters for the project. SeqMan NGen will use the default values for any parameter that is not specified within the script.

Parameters for 'setParam':

P1) AllowConstraintBased: [true|false]
Specifies whether the assembler should use constraints during assembly. Default is 'true.'

P2) AssembleBoneyard: [true|false]
Specifies whether, after a templated assembly has been completed, the unassembled sequences remaining should be assembled into contigs. If the template has been split, SeqMan NGen will attempt to join the split contigs together in new arrangements. Default is 'false.'

P3) CoverageType: [genome|fixed]  
Specifies the type of coverage to be used for repeat handling. 'Genome' uses the length of the genome being assembled to calculate the expected coverage. 'Fixed' uses a fixed value as the expected coverage. If you know the length of the genome/fragment being assembled, we recommend using 'genome' for this parameter and then specifying the length using the 'genomeLength' parameter. If you do not know the genome/fragment length, used 'fixed' and provide the most accurate estimate of expected coverage for the 'FixedCoverage' value. Default is 'genome.' (Note: this parameter was called "Coverage" prior to SeqMan NGen 2.0.)

P4) DefaultQuality: [number from 5-100]
The value used for the base quality of sequences without quality scores. Default is 15.

P5) FixedCoverage: [number from 1-65535]
The estimated depth of the sequencing, which can be used instead of the genome length for repeat handling. Use caution when estimating the value for fixedCoverage. If the value you use is significantly lower than the actual depth, the assembly may take a much longer time to complete and may have too many mers flagged as repeats. Default is 20.

P6) GapPenalty: [number from 0-1000]
The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Default is 30.

P7) GenomeLength: [number from 0-1015 ULL]
Specifies the length of the genome or fragment being assembled. This is used to calculate expected coverage in determining repeat handling. Default is 0. (Note: this parameter was called "setGenomeParam" prior to SeqMan NGen 2.0.)

P8) HaploidSNP : [true|false]
Specifies whether to use the second most common base at a position when performing SNP passes. (See the 'snpPasses' parameter). Using this parameter will increase the SNP percentage for SNPs occurring on one allele of a diploid genome in a templated assembly. When haploidSNP is set to 'true', the lowCoverageThreshold parameter value should be greater than zero. Default is 'false.'

P9) HaploidThreshold: [number from 0-100]
The minimum number of times that the second most common base must occur at a position in order for it to be used to find SNPs during haploid SNP passes. (See the haploidSNP parameter above). Default is 0.

P10) LowCoverageThreshold: [number from 0-10000]
The minimum coverage required in an assembly to be excluded from SNP passes. SeqMan NGen will include regions in an assembly that have coverage less than the value specified as well as regions with zero coverage when it performs SNP passes. (See the snpPasses parameter). Default is 0.

P11) MatchRepeatPercent: [number from 100-1000]
The percent frequency a mer occurs compared to its expected frequency. Mers exceeding this value are flagged as repeated and not used as mer tags in determining overlaps. Default is 150. (Note: this parameter was called "maxCoverageRatio" prior to SeqMan NGen 2.0.)

P12) MatchScore: [number from 1-1000]
The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. Default is 10.
	
P13) MatchSize: [odd whole number]
The minimum number of matching consecutive bases required to determine the overlap of sequence reads. If an even number is entered, SeqMan NGen will automatically increase the value to the next odd number. Default is 21. (Note: this parameter was called "setParam MerLength" prior to SeqMan NGen 2.0.)

P14) MatchSpacing: [number from 1- 1000000]
The length of the window of a sequence read where at least one mer tag will be chosen. Default is 50. (Note: this parameter was called "merTagWindow" prior to SeqMan NGen 2.0.)

P15) MatchWindowLength: [number from 10-1000]
The size of the window used to calculate the match percentage. Default is 50.

P16) MaxAssemblyCoverage : [number from 0-65535]
The maximum depth of coverage allowed in the templated assembly. SeqMan NGen will not exceed the coverage specified by this threshold. This parameter is only available for templated assemblies, and should be used with caution as it will limit the number of sequences included in the assembly. The default value of 0 indicates unlimited coverage.

P17) MaxContigs: [number]
The maximum number of contigs to write to an .assembly project. This command is not generally needed due to SeqMan's capacity to handle a very large number of contigs. There is no default value.

P18) MaxGap: [number from 0-1000]
The maximum number of gaps allowed per 1000 bases in the alignment. Default is 6.

P19) MaxUsableCount: [number from 1-65535]
Any mers occurring more frequently than FixedCoverage multiplied by MaxUsableCount are disregarded as mer tags from the assembly. Default is 25.

P20) MinContigSeqs: [number from 0-10000]
The minimum number of sequences in a contig. After an assembly has been completed, any contigs without a template sequence will be disassembled if they contain fewer sequences than the number specified. The use of this parameter is recommended when performing de novo assemblies using data from Next Generation sequencing  technologies, such as Illumina, as these types of assemblies can produce tens of thousands of very small contigs. Default is 0.

P21) Minimizer: [number]
(Intended for internal use only). An experimental way of choosing mer tags that may save time and memory. The accuracy of this parameter has not been verified by DNASTAR.

P22) MinMatchPercent: [number from 0-100]
The minimum percentage of matches in an overlap required to join two sequences in the same contig. Default is 93. (Note: this parameter was called "minMatchPercentage" prior to SeqMan NGen 2.0.)

P23) MismatchPenalty: [number from 0-1000]
The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Default is 20.

P24) SkipRealign: [true|false]
This parameter only affects de novo assemblies, and specifies whether to skip the realignment step of the assembly. The realignment step will then analyze each sequence at the nucleotide level to determine the exact position of each sequence in the alignment. Default is 'false.'

P25) SNP: [true|false]
(optional) Specifies whether a SNP detection pass of the gapped alignment is made during the assembly. Default is 'true.'

P26) snp_checkStrandedness: [true|false]
(optional) Specifies whether the strand that each read comes from is considered in the SNP calculation. This is ignored by the Simple method. Default is 'false.'

P27) snp_minPctToScore: [number from 0-1]
(optional) Specifies minimum percentage of reads in a column which must differ from the reference in order to score the column. For the Simple method, this is the only criteria used to call a SNP. For the Diploid and Haploid methods, this is a filter applied before the other parameters. Default is 0.05.

P28) snp_minProbNonrefToCall: [number from 0-1]
 (optional) Specifies the minimum probability of a SNP column which is required to call a SNP, expressed as a number from 0 and 1. The probabilities of all genotypes other than Homozygous Reference are totaled and checked against this number. This is the final filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 0.1, requiring a minimum 10% change.

P29) snp_minVariantDepthToScore: [number from 0-100]  
(required if "snp" is true) Specifies the minimum depth required for a specific base (or deletion) in a column before it is considered usable for SNP calling. This is the second filter applied during the Diploid and Haploid SNP calling methods, and is ignored by the Simple method. Default is 2.

P30) snp_minWeight: [number]
(optional) Specifies the minimum quality score for a base to be considered in the SNP calculation. Default 5.

P31) SNPMatchPercentage: [number from 0-100]
The minimum match percentage required during passes to fill in SNP regions. See the snpPasses parameter. Default is 90.

P32) snpMethod: [simple|haploid|diploid|population]
(optional) Specifies the SNP detection method to use. Simple produces a count of each type of base in the column and calculates the percent of non-reference bases. Haploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base called at that position. Diploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base(s) called at that position. Based on the scores, it also calls the genotype at each position. Default is 'diploid.'

P33) SNPPasses: [number from 0-10]
The number of times SeqMan NGen will cycle through a templated assembly, attempting to fill in regions with low coverage or no coverage due to SNPs. Default is 2.

P34) SplitFalseJoins: [true|false]
Specifies whether the assembler should identify and splits false joins based on the set of false join parameters indicated. Default is 'false.'

P35) SplitTemplateContigs: [true|false]
Specifies whether, after a templated assembly has been completed, the template should be split into contigs at areas where there is zero coverage. Split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. Annotations on the template sequence will also be split, and any /codon_start qualifiers will be adjusted to stay in frame. Default is 'false.'

P36) TemplateDefaultQuality: [number from 5-50000]
The value used for the base quality of template sequences without quality scores. Default is 500.
P37) TrimToMer: [true|false]
Specifies whether to trim the reads to the matching mer tags within the read. For each read, SeqMan NGen looks for mers that exist in the template (for templated assemblies) or in any other read in the assembly (for de novo assemblies). It then sets the trimming for the read to the start of the first mer found and the end of the last mer found. Trimming to mer may be useful when assembling data without accurate quality scores, data with very short linkers,  or when assembling SOLiD data. Default is 'false.'

P38) UseRepeatHandling: [true|false]
Specifies whether to use the repeat probabilities to determine if a mer occurs too frequently to use. This parameter should only be used for de novo assemblies, unless the assembleBoneyard parameter is set to 'true' for the templated assembly. Default is 'true.'

Example for 'setParam':

   setParam useRepeatHandling:true
   setParam coverageType:fixed
   setParam fixedCoverage:20
   setParam matchSize:15 
   setParam minMatchPercent:90
   setParam matchSpacing:10 
   setParam matchRepeatPercent:150
   setParam maxUsableCount:25
   setParam maxGap:15 
   setParam matchWindowLength:50
   setParam matchScore:10 
   setParam maxAssemblyCoverage:0
   setParam gapPenalty:30 
   setParam mismatchPenalty:20
   setParam defaultQuality:15
   setParam templateDefaultQuality:500
   setParam splitFalseJoins:true
   setParam allowConstraintBased:true
   setParam skipRealign:false
   setParam splitTemplateContigs:false
   setParam assembleBoneyard:false
   setParam minContigSeqs:0
   setParam snpPasses:2
   setParam snpMatchPercentage:90
   setParam lowCoverageThreshold:0
   setParam haploidSNP:false
   setParam haploidThreshold:0

****************************************************************

Q) setQualityParam
Allows you to adjust the parameters used for quality trimming. In order to be applied, the 'trimEnds' parameter for the 'assemble' command must be set to 'true.'

Parameters for 'setQualityParam':

Q1) EndRegion: [number from 1-100]
The number of bases at the end of a sequence considered to be the "end region" which is used by other quality parameters. Default is 5.

Q2) MaxN: [number from 1-100]
The maximum number of "N" bases permitted in the window used for N-based quality trimming. Default is 2.

Q3) MaxNHiQual: [number from 0-100]
The maximum number of "N" bases permitted in the window used for N-based quality trimming to meet the high-quality threshold. Default is 1.

Q4) MinAveHiQual: [number from 10-40]
The minimum averaged quality score of the evaluated window required to be considered high-quality. Default is 22.

Q5) MinAveLowQual: [number from 5-40]
The minimum averaged quality score of the evaluated window required to be considered low-quality. Default is 20.

Q6) MinEndBaseQual: [number from 5-40]
The minimum quality base score required in the specified end region. Default is 15.

Q7) NTrimWinLength: [number from 5-100]
The length of the window used for "N-based" quality trimming. N-based quality trimming trims bases that are called "N" and is used only when quality scores are not available. Default is 7.

Q8) WinLength: [number from 2-100]
The length of the window used for averaging quality scores. Default is 5.

Example for 'setQualityParam':

   setQualityParam winLength:30
   setQualityParam minAveLowQaul:14
   setQualityParam minAveHiQaul:18
   setQualityParam minEndBaseQaul:15
   setQualityParam endRegion:15
   setQualityParam nTrimWinLength:50
   setQualityParam maxN:2
   setQualityParam maxNHiQual:1

****************************************************************

R) setRepeatParam
Allows you to adjust the parameters used for scanning for repetitive sequences. In order to be applied, this command must appear in the script before the 'loadRepeat' command, and the 'repeatScan' parameter for the 'assemble' command must be set to 'true.'

R1) AlignCutoff: [number from 10-1000000]
The minimum acceptable alignment score. When the alignment score drops below the specified value, this indicates that the end of the alignment between the read and the repeat has been reached, and the alignment will stop. Default is 100.

R2) MaxMerGap: [number from 0-50]
The maximum distance between two mers required to be considered a matching pair. Default is 10.

R3) MerLength: [number from 5-50]
The minimum length of a mer required to be considered an exact match when scanning for repeats. Default is 17.

R4) MinEndFlagLen: [number from 5-1000000]
The minimum length required for a mer to be flagged as a repeat if the segment is bound by the end of the read. Default is 25.

R5) MinFlagLength: [number from 5-1000000]
The minimum length required for a mer to be flagged as a repeat. Default is 50.

R6) MinMerMatch: [number from 2-25]
The minimum number of matching mers required to start an alignment. Default is 2.

Example for 'setRepeatParam':

   setRepeatParam merLength:17
   setRepeatParam minMerMatch:2
   setRepeatParam maxMerGap:10
   setRepeatParam minFlagLength:50
   setRepeatParam alignCutoff:100
   setRepeatParam minEndFlagLength:25

****************************************************************

S) setVectorParam
Allows you to adjust the parameters used for vector trimming. In order to be applied, this command must appear in the script before the 'loadVector' command, and the 'vectScan' parameter for the 'assemble' command must be set to 'true.'

Parameters for 'setVectorParam':

S1) AlignCutoff: [number from 10-1000000]
The minimum acceptable alignment score. When the alignment score drops below the specified value, this indicates that the end of the alignment between the read and the vector has been reached, and the alignment will stop. Default is 100.

S2) EndCutOff: [number from 0-1000000]
The distance to the endpoint where trimming will go all the way to the end of the sequence. Default is 25.

S3) EndMerMatch: [number from 1-25]
The minimum number of mer matches required to start an alignment in the specified end region. Default is 1.

S4) EndRegion: [number from 0-1000000]
The number of bases at the end of a sequence where a lower stringency for matching and trimming is used. Default is 15.

S5) MaxMerGap: [number from 0-50]
The maximum distance between two mers required to be considered a matching pair. Default is 5.

S6) MergeTrimGap: [number from 0-1000000]
Maximum distance between two trim segments that will cause the segments to be merged   MergeTrimGap limits trimming to the ends of sequence reads, while EndCutOff doesn't. Controls how sensitive trimming should be in areas where some portions of the sequence match a vector and other portions don't. The higher the number the more likely the vector trimmer will find all the vector sequence in a region of poor quality. The smaller the number, the more confidence there is that the bases trimmed are actually vector and not a spurious match. Default is 7, which is suitable for trimming linkers from the ends of sequences.

S7) MerLength: [number from 5-25]
The minimum length of a mer required to be considered an exact match when searching for vector. Default is 9.

S8) MinEndTrimLength: [number from 5-1000000]
The minimum length to be trimmed when a vector matches the end of a read. This parameter can be useful in preventing small spurious matches from being trimmed, which may be significant with short read technologies. Default is 5.

S9) MinMerMatch: [number from 1-25]
The minimum number of matching mers required to start an alignment. Default is 3.

S10) minTrimLength: [number from 5-1000000] 
The minimum length required for a mer to be considered as a match for vector trimming. Default is 30.

Example for 'setVectorParam':

   setVectorParam merLength:9
   setVectorParam minMerMatch:3
   setVectorParam MerGap:5
   setVectorParam minTrimLength:30
   setVectorParam minEndTrimLength:5
   setVectorParam alignCutoff:100
   setVectorParam endRegion:15 
   setVectorParam endCutoff:25
   setVectorParam endMerMatch:1

Part IV. Preprocessing and Assembling Commands and Parameters

T) assemble
(required) Reprocesses and assembles the sequences that have been loaded. Preprocessing may include quality trimming, and scanning for vector, repetitive, and contaminant sequences.

Parameters for 'assemble':

T1) assembleBlocks: [true|false]
Specifies whether the assembly is a reference guided assembly. Default is 'false.'

T2) contamScan: [true|false]
If true, sequences will be scanned for the specified contaminant sequences before assembling. Also see loadContaminant. Default is 'false.'

T3) doAssemble: [true|false]
If false, only the preprocessing will be done, and the sequences will not be assembled. Default is 'true.'

T4) repeatScan: [true|false]
If true, sequences will be scanned for the specified known repetitive sequences before assembling. Also see loadRepeat. Default is 'false.'

T5) trimEnds: [true|false]
If true, the sequences will be trimmed based on quality scores before assembling. Default is 'false.'

T6) vectScan: [true|false]
If true, the sequences will be scanned and trimmed for vector before assembling. Also see loadVector. Default is 'false.'

Example for 'assemble':

   assemble 
      trimEnds:false 
      vectScan:false 
      repeatScan:false
      contamScan:false
      doAssemble:true

****************************************************************

U) FixedTrim
Trims reads prior to assembly using fixed values. Based on the parameter settings for this command, SeqMan NGen will trim reads either by a specified number of bases from each end, or to a specified range.

U1) end3: [number from 0-1000000]
If trimRelative (see below) is set to 'true,' then this value indicates the number of bases for SeqMan NGen to trim from the 3' end of each read. If trimRelative is set to 'false,' then this value indicates the specific 3' coordinate to which reads should be trimmed. Default is 0.

U2) end5 : [number from 0-1000000]
If trimRelative (see below) is set to 'true,' then this value indicates the number of bases for SeqMan NGen to trim from the 5' end of each read. If trimRelative is set to 'false,' then this value indicates the specific 5' coordinate to which reads should be trimmed. Default is 0.

U3) trimRelative: [true|false]
Specifies whether the value for the end3 and end5 parameters should indicate the number of bases for SeqMan NGen to trim from the 3' or 5' end of each read. When 'false,' the value specified for the end3 or end5 parameter indicates the specific coordinate to which reads should be trimmed. Default is 'true.'

Example for 'fixedTrim':

   fixedTrim
      end5:10
      end3:20 
      trimRelative:true

****************************************************************

V) RealignContigs
(optional) Does another pass through a templated assembly once the initial assembly is complete, and realigns contigs as needed. (This step occurs automatically for de novo assemblies.) Using this command may improve the accuracy of the final assembly by correcting occasional misalignments that can occur in gapped regions, however note that this step may significantly increase the time to assemble. This command must appear in the script after the 'assemble' command. 

****************************************************************

W) RemoveSmallContigs 
This command disassembles any contigs without template sequences that have fewer than the specified number of sequences.

Parameters for 'removeSmallContigs':

W1) minLength: [number]
Specifies the minimum length of a contig to prevent it from being disassembled. Default is 0.

W2) minSeqs: [number] 
(required) Specifies the minimum number of sequences necessary in a contig to prevent it from being disassembled. Default is 100.

****************************************************************

X) SetPairSpecifier
Defines the paired end pair specifier for the paired Sanger and Illumina sequences in the assembly. This command must appear in the script before the assemble command, but after sequences have been loaded (loadSeq). For more information on assembling 454 paired end data, see the 'load454PairedEnd' command. Pair specifiers define the naming convention for sequence pairs, as well as requirements for a minimum and maximum distance between the opposite ends of the inserts. Expressions for forward and reverse naming conventions should be created using the paired end specification language. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone. 

Parameters for 'SetPairSpecifier':

X1) pairs: [forward|reverse|min|max]
This parameter lists the paired end constraints, specified by the following four values. Each value should be separated by a space and the list of values enclosed in double brackets {}. An additional set of brackets is required around all of the paired end constraints, regardless of whether one or multiple pair constraints are specified.

X2) forward: [text string enclosed in quotes] 
A naming pattern to match forward clones. 

X3) max: [number]
The maximum distance for the paired end sequences to be separated. There is no default value.

X4) min: [number]
The minimum distance for the paired end sequences to be separated. There is no default value.

X5) reverse: [text string enclosed in quotes]
A naming pattern to match reverse clones. 

Example for 'setPairSpecifier':

(defines 2 pair specifiers each with different size ranges)

   setPairSpecifier 
      pairs:{{forward:"(.*)(2kb)(.*)-FP.*$"reverse:"(.*)(2kb)(.*)-RP.*$" min: 1500 max: 2500}
                  {forward:"(.*)(8kb)(.*)-FP.*$" reverse:"(.*)(8kb)(.*)-RP.*$" min: 7000 max: 9000}}

****************************************************************

Y) SplitLinkerReads
Splits specified reads based on their match to given linker sequences. Reads that align to the linker and include the linker site (as specified by the linkerSite parameter or by the cloneSite option in an *.fof file) will be split into two reads. The two newly split reads will be designated by _A and _B appended to the name.

Parameters for 'SplitLinkerReads':

Y1) linkerFile: [directory/filename enclosed in quotes]
The directory and file name of the linker file. 

Y2) linkerSite: [number]
The position indicating where reads should be split. There is no default value.

Y3) seqFile: [directory/filename enclosed in quotes]
The directory and file name of the sequence reads. 

Example for 'splitLinkerReads':

   splitLinkerReads
      seqFile: "/Library/123_project/reads.fas"
      linkerFile: "/Library/123_project/linker.fas"
      linkerSite:30

****************************************************************

Z) SplitTemplates
Splits template contigs into multiple contigs in areas where there is zero coverage. Split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. Annotations on the template sequence will also be split, and any /codon_start qualifiers will be adjusted to stay in frame. 

****************************************************************

AA) appendToAssembly
(This command is for the reference-guided workflow and is intended for internal use only).

****************************************************************

BB) convertReads
(optional) Converts  a sequence from one file format to another. This command is particularly useful for converting SOLiD .csfasta files into .fastq files that can be used by the XNG assembler.

Parameters for 'convertReads':

BB1) destination: [directory/filename enclosed in quotes]
The location and filename for the output.

BB2) file: [directory/filename enclosed in quotes]
The input file containing the reads. (Synonym for 'reads').

BB3) format: [genbank|fastq]
(optional) Specifies the format of the output file. If 'genbank' is entered, the output will be in .gbk format. If 'fastq' is entered, the output will be in .fastq format. Default is 'fastq.'

BB4) reads: [directory/filename enclosed in quotes]   
The input file containing the reads. (Synonym for 'file').

****************************************************************

CC) extendContigs
(Intended for internal use only).

Parameters for 'extendContigs':

CC1) extendPasses: [number]

CC2) mergeContigsInScaffold: [true|false]

****************************************************************

DD) include
(optional) When building a script, this command can be used to call up additional lines of script previously stored in a text file. In this way, a group of commands can be shared between two or more scripts.

Parameters for 'include':

DD1) file: [directory/filename enclosed in quotes]
Specifies a directory and name for the file. 

****************************************************************

EE) MakeSeqNamesUnique
(Intended for internal use only).

****************************************************************

FF) set
(optional) Used to set variables. See the example below and those under the 'runScript' command.

Example for 'set':

  set $snp:true
  set $snpMethod:"Diploid"

****************************************************************

GG) setAssemblyReport
(Intended for internal use only). Used to designate a file for a tab delineated report, similar to a report that XNG generates. This is useful during development to test how code changes impact results.

Parameters for 'setAssemblyReport':

GG1) file: [directory/filename enclosed in quotes]
Specifies the folder and file name. (Synonym for 'name').

GG2) name: [directory/filename enclosed in quotes]
Specifies the folder and file name. (Synonym for 'file').

****************************************************************

HH) SplitMIDSeqs
(optional) Used to split 454 MID reads into individual files with one file per MID tag.

Parameters for 'SplitMIDSeqs':

HH1) destination: [directory/filename enclosed in quotes]
The location and filename for the output. 

HH2) file: [directory/filename enclosed in quotes]
The location and filename for the input. (Synonym for 'reads').

HH3) reads: [directory/filename enclosed in quotes] 
The location and filename for the input. (Synonym for 'file').

****************************************************************

II) SplitPairs
(optional) Used to split 454 or ion torrent mate pair files into forward and reverse (and singleton) files.

Parameters for 'SplitPairs':

II1) destination: [directory/filename enclosed in quotes]
The location and filename for the output.

II2) DiscardLinkerless: [true|false]
Specifies that reads without a linker sequence should be discarded from the assembly. Default is 'false.'

II3) file: [directory/filename enclosed in quotes]
The location and filename for the input. (Synonym for 'reads').

II4) reads: [directory/filename enclosed in quotes]
The location and filename for the input. (Synonym for 'file').

II5) seqTech: [text string]
(optional) Specifies the offset to be used when converted compressed quality scores into numerical values. Values of normalScore, IonTorrent (for IonTorrent data), or SOLiD (for Applied Biosystems SOLiD data) will use an offset of 33. A value of Illumina (for Illumina data) will use an offset of 64. A value of 454 (for Roche 454 data) will use an offset of 33 and orient quality scores for all homopolymeric runs of two or more to be descending from 5' to 3' on the top strand. If a value of unknown is entered, the assembler will determine the offset from the first data file. 

Example for 'SplitPairs':

   SplitPairs 
     destination:"c:data\splitReads\"
     reads: { 
        { file:"C:data\reads\file1.fas"   format: IonTorrent }
        { file: "C:data\reads\file2.fas"   format:454  discardLinkerless: true}
        }

