If you are following the de novo transcriptome RNA-Seq workflow, output results are saved in a folder called [project name] De Novo Transcriptome Assembly. This folder contains the following subfolders and files:

Subfolder File/Folder Name Description
[project name]_rnaAssemble.script Input script used to create the assembly results. This file can be opened in SeqMan Pro in order to examine isoforms using the Feature Table.
Assemblies [project name]_novel_transcripts.sqd SQD assembly of all contigs that did not have a database match.
[project name]_unassembled.fastq Multi-sequence FASTQ file with all unclustered and unassembled sequences.
sub_0 (folder)k Folder containing sub-folders (sub_0, sub_1, etc.) with a separate .sqd document for each final assembly. If available, gene and organism names are used to create the file names.
Intermediate Assembly Results cluster (folder) Intermediate results are deleted by default at the end of the assembly, but can be retained by designing the input script such that the assembleTemplate command’s deleteIntermediates parameter is set to false.
combine (folder)
intermediateFiles (folder)
Reports [project name].AllTranscripts.SearchResults.txt Excel file containing summary information for each of the final assembled contigs. The table automatically opens for viewing when you open a .Transcriptome package in SeqMan Pro. The table, known in SeqMan Pro as the “All Transcripts” table, contains the following columns:

* Assembly ID – Name assigned to the assembled sequence, using the criteria specified in the wizard.

* Gene name, Custom column #1* – Best matching gene meeting criteria defined in the wizard.

* Organism name, Custom column #2* – Organism from which the best matching gene came.

* Accession number, Custom column #3* – Accession number of the best match.

* Description, Custom column #4* – Description of the best match.

*Custom columns: The four “custom columns,” above, use default names (e.g., Gene name, Organism name) if one of the default RefSeq databases was used in the SeqMan NGen assembly. However, if you used a custom GREP expression or a custom database that did not include these fields, these columns may have different names or be absent from the table.

* Database – Database (e.g. RefSeq, Custom, etc.) from which the best matching gene came.

* Transcript length – Length of the assembled sequence, in bases.

* Transcript start – Position in the assembled sequence where the match begins.

* Transcript end * – Position in the assembled sequence where the match ends.

* *% Transcript match
– Length of the matching segment in the transcript x 100, divided by the total length of the transcript.

* Gene length – Length of the database entry, in bases.

* % of Full length – Length of the assembled sequence x 100, divided by the length of the corresponding database entry. Values greater than 100% indicate that the assembled sequence is longer than the database entry.

* Gene start – Position in the database entry where the match begins.

* Gene end – Position in the database entry where the match ends.

* % Gene match – Length of the matching segment in the database entry x 100, divided by the total length of the database entry.

* % Identity – Total number of identical bases in the matching region x 100, divided by the total number of bases in the matching region.

* Bit score – Normalized value calculated from the raw score and expressed in units of “bits,” a common measure in information theory.

* eValue – “Expectation value,” an estimate of the probability of obtaining the observed alignment score with two random sequences. Expectation values are less sensitive to length than Bit scores and are therefore are generally a better measure of alignment quality.

* Assembled reads – Total number of assembled reads for that sequence.
[project name].AllTranscripts.Table.txt Excel file containing summary information for each of the final assembled contigs. The table contains the following columns:

* Assembly ID – Name assigned to the assembled sequence, using the criteria specified in the wizard.

* Type – Type of matching gene (e.g., mRNA, tRNA, rRNA, etc.)

* Gene length – Length of the database entry, in bases.

* % of Full length – Length of the assembled sequence x 100, divided by the length of the corresponding database entry. Values greater than 100% indicate that the assembled sequence is longer than the database entry.

* Assembled reads – Total number of assembled reads for that sequence.

* Depth – Average depth of coverage.
Transcripts [project name]_identified_transcripts.fas Multi-sequence .fasta file containing the consensus sequences from all the assembled contigs that had a database match. Header lines for each entry contain the name and sequence length.
[project name]_novel_transcripts.fas Multi-sequence .fasta file containing the consensus sequences from all the assembled contigs that did not have a database match. Header lines for each entry contain the name and sequence length.

Need more help with this?
Contact DNASTAR

Thanks for your feedback.