XNG Commands

Note: All commands and parameters are assumed to be optional unless the description is prefaced by “required.”

Command

Parameter

Description

Allowed values (defaults in bold/underline)

Wizard equivalent

assembleTemplate

(required) Initiates the assembly of the loaded sequences using the specified template as a reference.

Example:

XNG script used in the “clustering” step of the transcript annotation workflow:

merSize: 25

minNewClusterSize: 5

minSingleMergeClusterSize: 7

minMultiMergeClusterSize: 7

minMultiMergeIgnoreFactor: currently not used by default

minClusterSizeToOutput: 100

alignmentCutoff

Used in the “clustering” step of the transcript annotation workflow.

[number]

Default = 200

assemble

Specifies whether to use the part of the query that matches the contaminant sequence(s), the part that doesn’t match, or both.

[matchContam|noMatchContam|all]

assemblyInfo

Contains information about the assembly.

[text string]

assemblyInfoAlt

Contains pairs of keys and values which will be written to the -0.assemblyInfo file.

autoTrim

Specifies whether mismatching ends of reads should automatically be trimmed.

[true|false]

autoTrim

Specifies whether mismatching ends of reads should automatically be trimmed.

[true|false]

boneyardAssembly

Specifies whether sequences not used in the original or incremental XNG assemblies should be added to the assembly project by the SNG assembler. This command pertains only to reference-guided assemblies with gap closure. By default, during this type of assembly, the XNG assembler first finds structural variations (SVs) then splits the contig after each SV. Elements of this process can be modified using this command. (Note: “Boneyard” is a term for sequences that were not assigned to any contig).

[true|false]

combineDuplicateSeqs

Specifies whether the duplicate reads will be clustered.

[true|false]

contaminant

Use of this parameter partitions the query data by running an additional mer-match (layout) against the specified contaminant sequence(s). A full assembly is then run using the part of the query that either matches or does not match the contaminant sequence(s). This parameter can be used for removing reads originating from an organism(s) that may have also been present in the query data set (e.g., reads from human DNA present in a metagenomic sample from the human gut).

file: [directory/filename enclosed in quotes] the file with contaminant sequences.

assembleContam: [matchContam|noMatchContam|all]

merLayoutMin: [number]

unassembled: [directory/filename enclosed in quotes] the file containing no contaminant reads.

[directory/filename enclosed in quotes]

dbSNPTable

(Intended for internal use only).

[directory/filename enclosed in quotes]

delayAlignInserts

Use of this flag turns the delay reads that cause inserts on or off. ‘True’ means that gap causing reads will be delayed. Reads will be added such that reads causing the lowest number of inserts (length of inserts is not considered) will be added before those causing more inserts.

[true|false]

Defaults: true for named read technologies; false for ‘Other’ read technologies

deleteIntermediates

Specifies whether intermediate files are saved or deleted. These files can be large with large-scale projects.

[true|false|none|all|notTemplateMer]

directoryMer

Specifies the path and directory where both the template and query data mer files will be stored. Alternatively, separate directories for the template and query mer files can be specified using the parameters below. If no directory is specified, the mer file will be created in the directory containing the sequence data.

[directory/filename enclosed in quotes]

directoryQueryMer

(required) Specifies the path and directory where the query mer file will be stored.

[directory/filename enclosed in quotes]

directoryTemplateMer

(required) Specifies the path and directory where the template mer file will be stored.

[directory/filename enclosed in quotes]

filterDeepLayout

(optional) Specifies that XNG remove superfluous sequences in areas of deep coverage. Set to ‘false,’ by default, except for projects involving miRNA or microbial genomes, where it is set to ‘true.’

[true|false]

‘true’ = the Limit all deep coverage regions

radio button is selected in the Advanced Assembly Options > Alignment tab dialog

filterDeepLayoutOrganelle

(optional) Specifies that XNG remove superfluous sequences in areas of deep coverage. Set to ‘false,’ by default, except for projects involving a mitochondrial or chloroplast template (i.e., those with a short name of 'MT','M', or 'CHL' or 'chloro’), where it is set to ‘true.’

[true|false]

‘true’ = the Only limit deep coverage regions for Mitochondria and Chloroplasts

radio button is selected in the Advanced Assembly Options > Alignment tab dialog

forceFullForwardAlign

Start the alignment at the 5’ end of the sequence.

[true|false]

forceMake

Specifies whether new intermediate mer files will be created. A value of false means that existing valid intermediate files will be used.

[true|false|query|hit|layout]

format

Specifies the format of the alignment output file. If ‘none’ is entered, the assembly is run to include the alignment phase, but no alignment output is generated. This parameter can be used to remove reads from a contaminant source.

[BAM|SQD|NONE|NONE_align|Aux_align]

gap5Prime

Put the gap on the 5’ side of the sequence.

[true|false]

gapPenalty

The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping.

[number]

Default = 30 for most workflows, 50 for the transcript annotation workflow

gapExtensionPenalty

Used in the “clustering” step of the transcript annotation workflow.

[number]

Default = 5

geneticCode

This parameter specifies the genetic code to use with a reference sequence.

[filepath/standard Lasergene genetic code file name]

hits

(required) Specifies the path and name of the hit file. Incomplete paths will be appended to the default directory.

[directory/filename enclosed in quotes]

increaseRunGapPen

This parameter is a flag to increase the gap open penalty in HP runs.

[true|false]

layout

(required) Specifies the path and name of the layout file. Incomplete paths will be appended to the default directory.

[directory/filename enclosed in quotes]

layoutAlign

Specifies that a pairwise alignment should be performed at the payout phase in order to pick the best position for a given read.

[true|false]

layoutMaxTemplateGap

The maximal number of gaps introduced into the alignment used during layout.

[number]

layoutRSRange

The maximal Register Shift difference used while building the layout.

[number]

layoutType

Specifies how reads are to be laid out.

[unique|once|multiple|multipleAll]

matchScore

The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value allows for longer or more frequent gaps, thus forcing bases that match to be assembled together.

[number]

Default = 10

MaxGap

The maximum number of gaps allowed per 1000 bases in the alignment.

[number from 0-1000]

Default = 6 for most workflows, 30 for the transcript annotation workflow

maxMergeSize

When linking clusters into a scaffold, only link them together if the overall number of reads in the scaffold would not exceed this threshold. Used in the “clustering” step of the transcript annotation workflow.

maxNCnt

optional) This parameter removes sequential reads of the IUPAC ambiguity code ‘N’ that are greater than or equal to the number specified. Use of this parameter may help in assemblies whose reads contain large clusters of spurious N’s.

[integer]

maxSecondaryTrimLength

During alignment, a read can be trimmed from both ends. This parameter defines the longest allowable length for the smaller of the two trimmed ends.

[number]

maxSeqs

Specifies the maximum number of query sequences to add to an assembly. Use of this command can speed up assembly.

[number]

merCntThresh

Minimum number of mers needed in order to be recorded in the mer file.

[number]

merLayoutMin

Specifies the minimum length (in bases) of at least one stretch of matching mers used to identify matches between the reference and query data. The minimum value is equal to the mer. The maximum value is the read length, which would require the entire read be an exact match. For example, with a merSize of 19 and a merLayoutMin of 21, at least one stretch of three consecutive mers in a read would have to match for the read in order to be included in the layout.

[number from 11-1000]

Default = 25

merMinimizer

(Intended for internal use only)

[number]

merSize, merLength or matchSize

(required) Specifies the length (in bases) of mers used to identify matches between the reference and query data.

[number]

merSkip

(Intended for internal use only) Specifies the number of positions to ignore or “skip” when creating the template mer file. Normally, mers are only skipped in the query (see merSkipQuery, below). The first and last mer of every read are always included. Increasing the value reduces the size of the intermediate files as well as the overall assembly time. However, larger values can also reduce the number of reads included in the assembly, especially with short read data.

0 = do not skip

2 = skip every second base

3 = skip every third base

etc.

[number]

Default = 0

merSkipQuery

Specifies the number of positions to ignore or “skip” when creating the query mer file. The first and last mer of every read are always included. Increasing the value reduces the size of the intermediate files as well as the overall assembly time. However, larger values can also reduce the number of reads included in the assembly, especially with short read data.

0 = do not skip

2 = skip every second base

3 = skip every third base

etc.

[number]

Default = 0

method

Defines how to handle splits in the assembly:

normal – normal assembly method

splitOnly – only reads which have been split will be included in the assembly

noSplit – no reads will be split

[normal|splitOnly|noSplit]

minAlignedLength

Specifies the minimum number of bases that must align after trimming for a read to be included in the assembly.

[number from 11-1000]

Default = 25 for most workflows, 50 for the transcript annotation workflow

minClusterSizeToOutput

Threshold for the number of reads that a cluster must contain in order for the cluster to be passed along to SNG for assembly in the next step of the program. Used in the “clustering” step of the transcript annotation workflow.

Note that this command is present only for the clusterParam block of the rnaAssemble command.

[number]

minMatchPercent

The minimum percentage of matches in an overlap required to join two sequences in the same contig.

[number]

Default = 93 for most workflows, 60 for the transcript annotation workflow

minMultiMergeClusterSize

When two or more clusters overlap the same k-mer, the minimum number of reads (depth) required at that k-mer for a cluster to consider that cluster significant.

If three or more clusters exceed this threshold, the k-mer is considered “noisy” and a potential false join, and will not be merged. This is reported as a “multi-cluster link that was not merged”.

If two significant clusters overlap and have similar enough depth, the clusters are considered linked and are scaffolded together. Otherwise, if only one cluster is significant, all reads at that k-mer which have no assigned cluster are merged directly into it as described for the minSingleMergeClusterSize option. This parameter is used in the “clustering” step of the transcript annotation workflow.

Note that this command is present only for the clusterParam block of the rnaAssemble command.

[number]

minMultiMergeIgnoreFactor

When two or more clusters overlap the same k-mer and may be linked, they must be within this ratio of one other. Used in the “clustering” step of the transcript annotation workflow.

Note that this command is present only for the clusterParam block of the rnaAssemble command.

[number]

minSeqsPerTemplate

Minimum number of sequences sufficient to build the layout or alignment.

[number]

minSingleMergeClusterSize

The minimum number of reads (depth) matching an existing cluster at a single k-mer required to extend that cluster by immediately adding all new reads for that k-mer to the cluster. Used in the “clustering” step of the transcript annotation workflow.

Note that this command is present only for the clusterParam block of the rnaAssemble command.

[number]

minNewClusterSize

Minimum number of matching reads at a single k-mer (i.e., “depth”) required to create a new cluster. Used in the “clustering” step of the transcript annotation workflow.

Note that this command is present only for the clusterParam block of the rnaAssemble command.

[number]

mismatchPenalty

The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage.

[number]

Default = 20

noSexChromosomes

Disables special handling of sex chromosomes.

[true|false]

noSVPairSort

Specifies whether to turn off the calculation of pairs for structural variations. This may potentially reduce XNG assembly time.

[true|false]

onePackage

Specifies whether an assembly containing multiple reference sequences should be bundled into a single .assembly package. If ‘false’ is entered, one .assembly package is created per contig.

[true|false]

openInSeqman

(optional) Specifies whether the completed assembly should immediately be launched in SeqMan.

[true|false]

output

(required) Specifies the path and directory of the output files. Incomplete paths are appended to the default directory.

[directory/filename enclosed in quotes]

pairDist

(Intended for internal use only)

[true|false]

pickTemplate

Defines the number of templates from which to choose, and finds the template that is the best match for the input sequence.

[number]

placeHit

(Intended for internal use only)

[true|false]

probe

(Intended for internal use only)

[number]

query

(required) Specifies the directory and file name(s) of the query data to be assembled. A folder with one or data files can also be used in place of individual file names.

Properties for query:

file: [directory/filename enclosed in quotes]

Specifies the directory and file/folder.

isPair: [true|false]

Specifies whether the query files contain paired end data.

minDist: [number]

(required if isPair is ‘true’) Specifies the minimum expected distance in bases between paired end reads. Default is 0.

maxDist: [number]

(required if isPair is ‘true’) Specifies the maximum expected distance in bases between paired end reads. Defaults are 750 for Illumina; 4500 for 454 and Sanger, 7500 for Other, and user-defined for Ion Torrent

seqTech: [unknown|IonTorrent||IlluminaLongReads|454|PacBio|normalScore|Other]

Specifies the offset to be used when converting compressed quality scores into numerical values. These are the offsets used for the technology specified:

Data Type	Value	Offset
IonTorrent	IonTorrent	33
Illumina	IlluminaLongReads	33
Roche 454	454	33
Other types	normalScore	33

Note 1: For 454,quality scores for homopolymeric runs of ≥ 2 are oriented from 5’ to 3’ on the top strand.

Note 2: If possible, the data type of unknown data is determined automatically based on the first data file.

pairLinker: [string]

groupName: [string] The name of a group this file belongs to. Used for running multiple samples in one file.

sex: [unknown|female|male]

trim: [true|false] Specifies whether vector trimming needs to be applied to the reads.

sngTrim: contains parameters for fast vector trimming (See the SNG command TrimVector command)

scan: [true|false] Specifies whether reads needs to be scanned for contaminants

contaminantScan: Contains the assembleTemplate command with contaminant file used as a template and parameters: directoryTemplateMer, hits, layout, output, unassembled, results, format, mersize, ignorePolyMers and deleteIntermediates. The format parameter has valuenone_ALIGN.

Example:

query: {{file: “/data/home/proj/Illumina_s_5_1.txt”}

{file: “/data/home/proj/Illumina_s_5_2.txt “}

isPair: true

minDist: 400

maxDist: 700

seqTech: Illumina}

[directory/filename enclosed in quotes]

recordSplitsOnly

Functional only when used in the same program as splitTemplateContigs or recordStructVariations (both described below). Specifies whether or not to turn off contig splitting while still recording SVs for later inclusion in the Structural Variation Report.

[true|false]

recordStructVariations

Specifies under which circumstances structural variations (SVs) should be calculated and recorded.

0|false = Don’t calculate SVs

1|true = Calculate SVs at zero coverage

2 = Calculate SVs at insertions and deletions

3 = Calculate SVs at zero coverage and at insertions

[integer between 0-3|true|false]

Default = 2

removeDuplicateSeqs

Completely removes clonal reads after the alignment phase of assembly. Clonal reads, where the endpoints of both reads in a pair match those in another pair, are usually the result of PCR artifacts. If ‘true,’ the reads will not be scored, and will not be included in SNP calculations. Marking this parameter to ‘true’ may substantially increase the time needed for assembly.

[true|false]

removeUniqueInserts

Removes reads that cause an insert which no other read would create. This parameter is only enabled when delayAlignInserts (described under the assembleTemplate

command) is true.

[true|false]

Defaults: true for Illumina and Ion Torrent read technologies; false for all other types.

repeatPenaltyScale

Indicates the quality penalty (using the Phred scale) to use for a read which places in two locations identically. Higher repeat counts are further penalized relative to this on a log₂ scale such that repeats placing in four locations have a double penalty, in eight locations have a triple penalty, and so on. This penalty is applied to a ceiling of Phred score 30 if the other methods are disabled or have a higher score.

[number]

Default = 8

repeatThreshMax

Specifies the maximum number of occurrences of a mer in the reference sequence(s) for it to be considered repeated. Mers exceeding this number will not be used for identifying matches.

[number from 1-10000]

Default = 100

repeatThreshMin

Specifies the minimum number of occurrences of a mer in the reference sequence(s) for it to be considered repeated. Mers less than this number will not be used for identifying matches.

[number]

reportFiles

Defines the kind of report file to be generated.

perProject: [true|false] Generate a per project report.

perTemplate: [true|false] Generate a per template report.

removeInteral: [true|false] Remove intermediate reports.

repeatmermax

Threshold number of occurrences in a data set for a mer to be considered “repeated.” Used in the “clustering” step of the transcript annotation workflow.

results

Specifies the path and name of the result summary file. This file contains a compilation of assembly statistics and uses the extension fileSize.txt. Incomplete paths will be appended to the default directory.

[directory/filename enclosed in quotes]

saveUnSplitAssembly

Specifies whether XNG should save both the normal assembly output, [filename].assembly, and the unsplit intermediate assembly, [filename]-noSplit.assembly. The latter file contains SVs but no SNPs, and can be used to validate splits in the final assembly.

[true|false]

sex

Specifies the sex of the subject, used for read placement and SNP calling. See Handling of Sex Chromosomes for details.

[male, female, unknown]

showCDSVariant

Specifies whether or not XNG should show all variants of a CDS feature contacted by a SNP. The version number for the CDS variant will then appear in brackets when viewed in the SNP report in SeqMan Pro.

[true|false]

sngConvertOptions

(Intended for internal use only)

[text string]

snp

Specifies whether or not a SNP detection pass of the gapped alignment should be made during the

assembly.

[true|false]

snp_checkStrandedness

Specifies whether or not the strand that each read comes from is considered in the SNP calculation. This is ignored by the simple SNP calling method (used when genome ploidy is “Heterogeneous”).

[true|false]

snp_combineSubs

This parameter is used to coalesce adjacent substitutions.

[true|false]

snp_excludeBases3p

(internal use only) This parameter causes the specified number of bases from the 3' end of each read to not be considered during variant calling.

[integer]

snp_excludeBases5p

(internal use only) This parameter causes the specified number of bases from the 5' end of each read to not be considered during variant calling.

[integer]

snp_excludeBasesEdge

This parameter causes the specified number of bases from both the 5' and 3' ends of each read to not be considered during variant calling.

[integer]

For the simple SNP calling method (used when genome ploidy is “Heterogeneous”), the default is 5. For the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), the default is 0.

snp_limitEndPos

Specifies the 3' most coordinate of the specified template from which to stop calculating SNPs.

[number between 1 and the length of the template]

snp_limitStartPos

Specifies the 5' most coordinate of the specified template from which to begin calculating SNPs. A value between 1 and the length of the template must be entered.

[number]

Default = 1

snp_limitTemplateID

Specifies a single template ID for which to calculate SNPs.

[number]

Default = 0

snp_logEndPos

Specifies the 3' most coordinate of the specified template from which to stop storing a detailed log of SNP information. A value between 1 and the length of the template must be entered.

[number]

Default = 1

snp_logLevel

Specifies the level of detailed logging to store in the “shared” project directory as “SNP.log.” Level 0 specifies that no log will be stored. Level 1 stores detailed info on the SNPs which were called, level 2 also logs columns where the preliminary filtered passed but the final filtering failed, and level 3 logs all columns. This is ignored by the simple SNP calling method (used when genome ploidy is “Heterogeneous”).

[whole number from 0-3]

Default = 0

snp_logStartPos

Specifies the 5' most coordinate of the specified template from which to begin storing a detailed log of SNP information. A value between 1 and the length of the template must be entered.

[number]

Default = 1

snp_logTemplateID

Specifies a single template from which to store a detailed log of SNP information.

[number]

Default = 0

snp_maxRun

Specifies the maximum length of a homopolymeric run for an indel to be considered during variant calling. For example, a snp_maxRun of '5' will allow a portion of sequence up to 5 bases in length to be called as a SNP.

[integer]

Defaults are 3 for 454 and Ion Torrent read technologies; 5 for all others.

snp_maxStrandBias

Strand Bias (SB) for a SNP is the bias for the SNP appearing on one strand versus the other. It is measured relative to the strand bias in the assembly at the location of the SNP. For example, in a column with 60 forward reads and 40 backward reads, 6 SNP bases on the forward strands, and 4 on the reverse strands would be unbiased. SB is given by the formula:

SB = |SNP% _f – SNP%_r| / Total SNP%

…where SNP% _f and SNP%_rare the percentage of reads containing the variant on the forward (top) and reverse (bottom) strands, respectively; and SNP% is the total percentage of reads containing the variant. SB is calculated based on an “absolute value,” and will therefore be a positive number.

The following table describes different SB thresholds:

SB Threshold	Description
-1	A negative number cannot normally be generated by the equation above. However, you may use '-1' in the script to turn off the snp_maxStrandBias parameter. In the wizard, SeqMan NGen indicates the parameter is turned off by making Maximum strand bias (see SNP Options) either blank or absent.
0	Perfectly balanced (unbiased) strands. Reads with variants are present on both strands, and variants appear equally on both stands.
Between 0-1, not inclusive	As the number '1' is approached, more variants are called with unbalanced variants containing reads at that position.
1	All variant-containing reads are on a single strand.

Note: In cases where all the reads covering a base are on one strand only, the SNP% of the other strand cannot be calculated (due to a “division by zero” error). These positions will not be removed by the snp_maxStrandBias filter. To remove these variants, instead set snp_minStrandCov to ≥ 1.

Example:

In a homozygous case (SNP% = 100) with a depth of 100, where 75 variant containing reads are on the top strand (75%) and 25 variant containing reads are on the bottom strand (25%), the strand bias would equal: (75–25)/100 = 0.5.

[integer]

Defaults for the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”) are 0.8 for 454 and Ion Torrent read technologies; not shown (blank) for all others. Defaults for the simple SNP calling method (used when genome ploidy is “Heterogeneous”) are 0.25 for all read technologies.

snp_minHomopolDelDepth

Specifies the minimum read depth required to call a deletion in a homopolymeric run.

[integer]

Default = 0

snp_minHomopolDelFrac

Specifies the minimum fraction of reads required to call a deletion in a homopolymeric run.

[integer]

Default = 0

snp_minHomopolInsDepth

Specifies the minimum read depth required to call an insertion in a homopolymeric run.

[integer]

Default = 0

snp_minHomopolInsFrac

Specifies the minimum fraction of reads required to call an insertion in a homopolymeric run.

[integer]

Default = 0

snp_minPctToScore

Specifies minimum percentage of reads in a column which must differ from the reference in order to score the column. For the simple SNP calling method (used when genome ploidy is “Heterogeneous”), this is the only criteria used to call a SNP. For the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), this is a filter applied before the other parameters.

[number from 0-1]

Default = 0.05

snp_minProbNonrefToCall

Specifies the minimum probability of a SNP column which is required to call a SNP, expressed as a number from 0 and 1. The probabilities of all genotypes other than Homozygous Reference are totaled and checked against this number. This is the final filter applied during the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”) and is ignored by the simple SNP calling method (used when genome ploidy is “Heterogeneous”).

[number from 0-1]

Default = 0.1, requiring a minimum 10% change.

snp_minStrandCov

Specifies the minimum number of reads from each strand required to call a variant at a given position.

[integer]

In the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), the default is 0. In the simple SNP calling method (used when genome ploidy is “Heterogeneous”), the default is 5.

snp_minVariantDepthToScore

(required if “snp” is true) Specifies the minimum depth required for a specific base (or deletion) in a column before it is considered usable for SNP calling. This is the second filter applied during the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”) and is ignored by the simple SNP calling method (used when genome ploidy is “Heterogeneous”).

[number from 0-100]

Default = 2

snp_minWeight

Called “Minimum base quality score” in the SeqMan NGen wizard, this parameter specifies the minimum quality score for a base to be considered in the SNP calculation.

[number]

In the simple SNP calling method (used when genome ploidy is “Heterogeneous”), the default is 20. In the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), the default is 5.

snp_reportUserMissing

Specifies what kind of positions to put in the missingUser file, including one or more of the following:

dbSNP = dbSNP Pos

user = in user VCF SNP file

zeroCoverage = include zero coverage regions

cosmic = in COSMIC database

allcaptured = include all positions in capture regions

captured = include only positions in capture regions

Example:

snp_reportUserMissing: [user allcaptured captured]

[kParamTypeStrFixedVocab]

snp_runVar

Uses a Bayesian probabilistic model to exclude heterozygous insertions and deletions in homopolymeric runs. Intended for use with Ion Torrent data.

[true|false]

Defaults: true for 454 and Ion Torrent read technologies; false for all others.

snp_showAllFeatures

Specifies whether XNG should count SNPs multiple times if the SNP contacts different versions (variants) of a CDS feature.

[true|false]

snp_writeExtended

Specifies whether the additional values produced by the Haploid or Diploid SNP calculation methods are included in the SNP table.

[true|false]

snpMethod

Specifies the SNP detection method to use. Simple produces a count of each type of base in the column and calculates the percent of non-reference bases. Haploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base called at that position. Diploid uses a Bayesian statistical model to calculate a probability score that the position contains a polymorphism and give a quality score for the base(s) called at that position. Based on the scores, it also calls the genotype at each position.

[simple|haploid|diploid]

splitTemplateContigs

Specifies under which circumstances contigs should be cut after a templated assembly. Any split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. This command pertains only to reference-guided assemblies with gap closure. By default, during this type of assembly, the XNG assembler first finds structural variations (SVs) then splits the contig after each SV. Elements of this process can be modified using this command.

0|false = Don't split

1|true = Split at locations with zero coverage

2 = Split at insertions and deletions

3 = Split at zero coverage and at insertions

[integer between 0-3|true|false]

Default = 2

template

(required) Specifies the directory and file name of the reference sequence file. A folder with one or more reference sequence files can also be used in place of individual file names. Each entry must also be enclosed by brackets. If more than template entry is used, the list must also be enclosed by an additional set of brackets.

Properties for template:

file: [directory/filename enclosed in quotes]

Specifies the directory and file/folder.

feature: [directory/filename enclosed in quotes]

optional) Specifies the directory and file name for annotated features when the reference sequence and feature annotations are in separate files.

transcriptKind: [both|identified|novel] if the .Transcriptome package is used as a template, defines which transcripts will be used as a template.

userSNP: [directory/filename enclosed in quotes]

exomeCapture:

file: [directory/filename enclosed in quotes] The BED file name.

track: [string] the region of interest (Optional)

merMask: [true|false] Specifies if mers from outside of the capture region should be excluded from assembly.

Examples for template:

Sequence and annotation in one file:

AssembleTemplate

{file: “/data/home/proj/W3110.gbk”}}

Sequence and annotation in separate files:

AssembleTemplate

feature: “/Library/ABC_proj/references/MG1655.gff”}

[directory/filename enclosed in quotes]

templateHitCntThresh

(Intended for internal use only)

[number]

trimToTargetRegions

Controls whether reads are trimmed, by default, to the boundaries of the targeted regions, as defined by the .bed or manifest file. The default of ‘true’ indicates that the reads are trimmed to the stated boundaries. If conditions are not met, the SeqMan NGen wizard does not change this parameter to 'false,' but instead omits it from the script. The parameter status is only shown in the script for control workflows.

[true|false]

Advanced Options, Alignment tab: Trim to targeted regions

unassembled

[directory/filename enclosed in quotes]

verify

[true|false]

computeSNP

Sets parameters for the SNP computation phase of the assembly. The command is designed for use with existing BAM files that have not been analyzed for SNPs, or to re-analyze an existing file with different parameters. Most of the parameters for computeSNP are identical to parameters for assembleTemplate, described above:

showCDSVariant	snp_logLevel	snp_minProbNonrefToCall
snp_checkStrandedness	snp_logStartPos	snp_minStrandCov
snp_combineSubs	snp_logTemplateID	snp_minVariantDepthToScore
snp_excludeBases3p	snp_maxRun	snp_minWeight

calcJunctionSeqs

In the structural variation workflow, specifying 'false' prevents junction sequences from being calculated.

[true|false]

concurrentAligns

(Intended for internal use only)

[number]

file

(required) Specifies the path and name of one or more .assembly projects from which to compute SNPs.

[directory/filename enclosed in quotes]

snp_writeMissingDBSnps

In a SNP assembly, specifying 'false' causes missing SNPs not to be recorded, saving time and file space.

[true|false]

snpFilter

Specifies whether SNP filtering is turned on or off.

Properties for snpFilter:

capture: [true|false]

Specifies whether there is an exome capture file. If an exon capture file is added in the SeqMan NGen wizard or through a script, this value is set to ‘true.’ In the absence of an exome capture file, the SeqMan NGen wizard automatically sets this property to 'false.'

pNotRefMinVal: [number]

In the unusual case that the hard filter is missing, this property is used to set the minimum value that can be displayed in the SeqMan SNP table. Otherwise, this property is ignored. Default is 10.

userOnly: [true|false|All]

Specifies whether there is a VCF SNP file. The SeqMan NGen wizard always calls this as ‘true’ (or ‘yes’) but ignores the property if no VCF SNP file has been loaded.

pNotRef: [number]

Called “SNP Filter Stringency” in the SeqMan NGen wizard, this specifies a PnotRef threshold. This is a “soft” filter. Data not matching the criterion are removed from the default display of the SeqMan Pro SNP table. This option is only available for the Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”). Wizard values include Low (90%), Medium (99%) and High (99.9%).

minSnpFilter: [number]

This parameter does not relate to any setting in the SeqMan NGen wizard, but corresponds to “SNP%” in SeqMan Pro and “minSNPFilter” in ArrayStar. In the simple SNP calling method (used when genome ploidy is “Heterogeneous”), the default is 5% for 454 and Ion Torrent read technologies; 1% for all others. In Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), the default depends on stringency and ploidy rather than the read technology. The default for Diploid is 15% for all stringency levels. The default for Haploid is 25% for low stringency, 50% for medium and 75% for high.

minDepth: [number]

(option) Specifies a minimum sequence depth threshold. This parameter does not relate to any setting in the SeqMan NGen wizard, but corresponds to “Depth” in SeqMan Pro and “minDepth” in ArrayStar. In the simple SNP calling method (used when genome ploidy is “Heterogeneous”), the default is 50. In Bayesian SNP calling methods (used when genome ploidy is “Diploid” or “Haploid”), the default is 20.

A set of SNP filters used by ArrayStar and SeqMan Pro.

codonOnly : [Coding|CodingChange|Nonsense|All]

maxDepth: [number]

maxCodingFeatureDistance: [number]

minSnpFilter: [number]

qCall: [number]

synonymousCodingChange: [true|false]

substitionCodingChange: [true|false]

noStartCodingChange: [true|false]

noStopCodingChange: [true|false]

nonsenseCodingChange: [true|false]

frameshiftCodingChange: [true|false]

notCodingCodingChange: [true|false]

inFrameIndelCodingChange: [true|false]

refOnly: [Reference|Unique|All]

cosmicOnly : [Yes|No|All]

minIndelSize: [number]

gerpScore: [number]

substitution: [true|false]

showIndels: [true|false]

[true|false]

(pertains to pNotRef only) Assembly Options:

SNP Filter Stringency

userSNP

Specifies a location for storing the VCF SNP table.

[directory/filename enclosed in quotes]

createGenomeTemplate

(Intended for internal use only)

file

Specifies the directory and file/folder of the input file.

[directory/filename enclosed in quotes]

output

The path and name of the output file.

[directory/filename enclosed in quotes]

diskPath

(required) Defines the default directory where temporary intermediate files from the assembly will be stored. The files can be large with large scale projects. Visit our website to view space requirements for a range of representative projects.

clean

Specifies whether or not to clean the merge disk. When automated scripts are being run simultaneously or sequentially, this command can be useful for emptying the merge disk between assemblies.

[true|false]

pathMac

Specifies the default path and file name for Macintosh.

[directory/filename enclosed in quotes]

pathWin

Specifies the default path and file name for Windows.

[directory/filename enclosed in quotes]

path

(required) Specifies the default path and file name.

Example:

diskPath

path: “/data/proj/”

[directory/filename enclosed in quotes]

dumpConsensus

(Intended for internal use only). To convert the binary consensus file created during assembly into a text file.

file

Specifies the directory and file/folder.

[directory/filename enclosed in quotes]

dumpSNP

(Intended for internal use only). Creates a tab delimited text file from one or more SNP containing binary files generated during assembly. SNP binary files include those with the .snpExt suffix contained in an .assembly package as well as those with either the .coverage.missingSNP or .nocoverage.missingSNP suffix contained in the _shared folder. To convert all the .snpExt files in a package simply use the .assembly name.

file

(required) Specifies the path and name of .assembly package (all SNP files will be included), one or more individual .snpExt files or either/both of the missingSNP files.

[directory/filename enclosed in quotes]

output

(required) Specifies the path and name of the output file.

[directory/filename enclosed in quotes]

refPos_end

To export SNPs with positions lower than this value.

[number]

refPos_start

To export SNPs with positions higher than this value.

[number]

snp_maxProbNonrefToCall

Lower limit for probability scores for exported SNPs.

[number]

snp_minProbNonrefToCall

Lower limit for probability scores for exported SNPs.

[number]

snp_type

Specifies which SNP file from the .assembly to use as an input.

templateID

Defines the template for which the SNP will be exported.

[number]

onefile

Defines whether all SNPs should be placed into one file.

[true|false]

exportSplits

(Intended for internal use only). To convert the binary splits file created during assembly into a text file.

file

Specifies the directory and file/folder.

[directory/filename enclosed in quotes]

output

The path and name of the output file.

[directory/filename enclosed in quotes]

execute

Executes any shell script command.

command

Text for any shell script command.

[text string]

exportVCF

Accepts the exome capture file and VCF file and builds another VCF file containing SNPs only in the capture regions.

userSNP

User SNP file.

[directory/filename enclosed in quotes]

exomeCapture

file: [directory/filename enclosed in quotes] Exome capture file.

track: [text string] The name of the region of interest.

output

The output VCF file.

[directory/filename enclosed in quotes]

extractPairs

Creates a tab delimited table of pair end information.

file

The path and name of any pair distance file (.pairdist file) from within a project's shared folder.

[directory/filename enclosed in quotes]

output

The path and name of the output file.

[directory/filename enclosed in quotes]

include

When building a script, this command can be used to call up additional lines of script previously stored in a text file. In this way, a group of commands can be shared between two or more scripts.

file

Specifies the directory and file/folder.

[directory/filename enclosed in quotes]

loadAssembly

(Intended for internal use only)

file

Specifies the directory and file/folder.

[directory/filename enclosed in quotes]

loadBAM

Sets parameters for analyzing existing BAM files. It allows ungapped BAM files to be converted into a fully gapped assembly file or to re-gap an existing file with different parameters. The command also permits SNPs to be calculated or re-calculated with different parameters starting with an existing BAM file. The associated parameters are also available for full assemblies and are described under the assembleTemplate command, near the top of this table:

Parameter	Allowed values
align
delayAlignInserts	[true\|false]
format
gapPenalty
increaseRunGapPen	[true\|false]
layout
matchScore
minAlignedLength
minMatchPercent
mismatchPenalty
output
removeUniqueInserts	[true\|false]
snp
snp_checkStrandedness
snp_clusteredPosFilterMinDev	[number]
snp_clusteredPosFilterMinFromEdge	[number]
snp_hetKnownThresh	[number]
snp_hetThresh	[number]
snp_limitEndPos
snp_limitStartPos
snp_limitTemplateID
snp_logEndPos
snp_logLevel
snp_logStartPos
snp_logTemplateID
snp_maxRun	[number]
snp_maxStrandBias	[number]
snp_minHomopolDelDepth	[number]
snp_minHomopolDelFrac	[number]
snp_minHomopolInsDepth	[number]
snp_minHomopolInsFrac	[number]
snp_minPctToScore
snp_minProbNonrefToCall
snp_minStrandCov	[number]
snp_minVariantDepthToScore
snp_minWeight
snp_nlMutationRate	[number] The chance that any single base is different from the reference. The default value of 0.0013 is equivalent to ~4 million variations in a Human sample against the reference or several thousand in a bacterial genome.
snp_observedInControlFilterMaxCount	[number]
snp_observedInControlFilterMaxFrac	[number]
snp_proximalGapFilterMaxDel	[number]
snp_proximalGapFilterMaxIns	[number]
snp_proximalGapFilterWindowSize	[number]
snp_reportUserMissing:	[dbSNP\|user\|zeroCoverage\|cosmic\|allcaptured\|captured]
snp_runVar	[true\|false]
snp_showAllFeatures	[true\|false]
snp_writeExtended
snp_writeMissingDBSnps	[true\|false]
snpMethod
snpRefAsm	[quoted file name]
template

mergeIonTorrentShortReads

When using Ion Torrent data, use of this command merges overlapping short reads into mini-contigs.

output

(required) Specifies the path and directory of the output files.

[directory/filename enclosed in quotes]

query

(required) Specifies the directory and file name(s) of the query data to be assembled. A folder with one or data files can also be used in place of individual file names.

[directory/filename enclosed in quotes]

message

Writes out the string to the standard output.

str

Specifies the string to be written to the standard output.

[text string]

pairFilePattern

Allows you to specify the pattern for pair files using the GREP language.

Example:

pairFilePattern

forward: “(?'name'.*)_R1_(?'ext'.*)\fastq

reverse: “(?'name'.*)_R2_(?'ext'.*)\fastq

forward

A naming pattern to match forward clones.

[text string enclosed in quotes]

reverse

A naming pattern to match reverse clones.

[text string enclosed in quotes]

pause

Creates a pause and can be used when running table scripts to stop at any point.

Example:

pause

prompt: “Table script paused. Press enter to continue.”

prompt

Text to appear in the console. The pause is terminated by hitting the Enter key.

[text string enclosed in quotes]

quit

Terminates a script.

RemoveDuplicateSeqs

Coalesces multiple identical reads at the same position into a single read, provided the reads match the template exactly. If this feature is active, at the end of assembly, XNG will print the message: “Coalesced $lld identical reads that matched the template exactly.” Allowable values are [true|false]; default is false.

runScript

Allows batching of multiple projects of the same type (e.g. assembly, computeSNPs). There are required three file: 1) a runScript file with variables, 2) a file with a table of values for the variables, and 3) a script file specifying the action to be carried out.

Example (runScript file):

setDefaultDirectory directory: “.”

set $force: false

set $DataDisk: “/Volumes/Raid/DataDisk”

set $ResultDisk: “/Volumes/ResultDisk”

set $MergeDisk: “/Volumes/MergeDisk0”

set $snp:true

set $snpMethod:”Diploid”

set $repCnt:100

set $merLayoutMin:19

diskPath path: {“${MergeDisk}/mergeSort Data”}}

runScript table: “testAssembly.txt” script: “testAssembly.template.script”

Example (table file):

defaultDir template query isPair seqTech project merSize snp snpMethod

“${ResultDisk}/rice” ${DataDisk}/rice.genome ${DataDisk}/rice FALSE Illumina rice 21 TRUE Diploid

“${ResultDisk}/ecoli” ${DataDisk}/Ecoli.gbk ${DataDisk}/ecoli TRUE Illumina Ecoli 21 TRUE Diploid

“${ResultDisk}/Exome” ${DataDisk}/GRCh37.gbk ${DataDisk}/Sample1 FALSE 454 HuEx 19 TRUE Diploid

Example (script file):

; “assembly.template.script”

setMachineMemory memory:32

setDefaultDirectory directory: $defaultDir

compareSeqs template: $template

query: {file: $query

isPair: $isPair

seqTech: $seqTech}

directoryMer: “intermediateFiles”

; directoryQueryMer: “intermediateFiles”

hits: “intermediateFiles/${project}.hits”

output: “results_${mersize}_${merSkipQuery}/${project}”

; results per project

results: “${project}.results.txt”

; aggregate all results

results: “${ResultDisk}/assembly.results.txt”

merSize: $mersize

merSkipQuery: $merSkipQuery

repeatCnt: $repCnt

merLayoutMin: $merLayoutMin

layoutType: once

maxGap: 6

format: BAM

onePackage: true

snp: $snp

snpMethod: $snpMethod

; snp_writeExtended: true

forceMake: $force

script

The filename and location of the script.

[directory/filename enclosed in quotes]

table

The filename and location of the file containing text strings and numbers values for each variable.

[directory/filename enclosed in quotes]

inline

Executes the list of commands and parameters.

set

Used to set variables. See the example below and those under the runScript command.

Example:

set $snp:true

set $snpMethod:”Diploid”

setDefaultDirectory

(required) Defines the default directory for the project. When a default directory is specified, files located in that directory only need to be identified by their subfolder and/or file name in subsequent commands.

Example:

setDefaultDirectory

directory: “/data/home/proj/”

directory or defaultDirectory

(required) Specifies the default directory. Previously called defaultDirectory.

[directory/filename enclosed in quotes]

directoryMac or defaultMacDirectory

Specifies the default directory for Macintosh. Previously called defaultMacDirectory.

[directory/filename enclosed in quotes]

directoryWin or detaultWinDirectory

Specifies the default directory for Windows. Previously called defaultWinDirectory.

[directory/filename enclosed in quotes]

setMachineMemory

Defines the amount of random access memory (RAM) that the program will use. Limiting the amount of RAM available to the assembler allows you to use the computer for other purposes while an assembly is running. However, this will likely slow down the assemblies and is not recommended for large projects.

Example:

setMachineMemory

memory: 32

memory

(required) Amount of RAM (in GB) to be used, entered in multiples of four. Entering a value greater than the available RAM causes all RAM to be used.

[number that is a multiple of 4]

setParam

Adjusts the stringency of one or more of the assembling parameters for the project. SeqMan NGen will use the default values for any parameter that is not specified within the script.

All of the parameters for setParam are identical to parameters for assembleTemplate, described near the top of this table:

delayAlignInserts

gapPenalty

increaseRunGapPen

matchScore

minAlignedLength

minMatchPercent

mismatchPenalty

removeUniqueInserts