In this tutorial, you will de novo assemble an abbreviated set of paired end RNA-Seq sequences from Saccharomyces cerevisiae (yeast) from Nookaew I et al., 2012. This workflow uses an abbreviated yeast data set with about 1 million reads per file.

With other applications, de novo assembly of RNA-Seq data can potentially result in thousands of unlabeled contigs representing the expressed transcripts. By contrast, SeqMan NGen automatically attempts to group contigs from the same gene, and then name and annotate them based on the best match to a collection of annotated reference sequences (the “Transcript Annotation Database”) extracted from data on NCBI’s RefSeq website. Results from this workflow are non-quantitative.

Running the transcriptome assembly in SeqMan NGen:

In this part of the tutorial, you will use SeqMan NGen to de novo assemble and annotate the RNA-Seq data.

  1. Download (147 MB) and extract it to any convenient location (i.e., your desktop). The tutorial data consist of the paired-end reads Yeast_RNASeq_1Mreads_1.fastq and Yeast_RNASeq_1Mreads_2.fastq.
  1. Launch SeqMan Ultra and choose New Assembly on the left. On the right, click on the Transcriptomics workflow named De novo transcriptome assembly and annotation. This causes SeqMan NGen to open at the Workflow screen.
  1. Choose the De Novo Assembly workflow named De novo transcriptome.
  1. In the Set Contaminant screen, take the opportunity to verify that you are logged in by looking at the key icon in the bottom left corner. If there is a green check mark, click Next. If there is a yellow triangle, click the icon and enter the same login credentials you use for the DNASTAR website. Once you return to the Set Contaminant screen, click Next.
  1. In the Input Sequences screen, press Add and add the Yeast_RNASeq_1Mreads_1.fastq and Yeast_RNASeq_1Mreads_2.fastq files. Click Next.
  1. In the Transcript Annotation Database screen, click the Download Database button.Choose RefSeq Fungi and press Select. Then click Next.

  1. In the Assembly Options screen, click Next.
  1. In the Assembly Output screen, type “Transcriptome” into the Project Name text box, then use the Browse button to specify a Project Folder for your assembly output files. Click Next.
  1. In the Run Assembly Project screen, note that:

    • The estimated disk requirement of 2.1 TB is based on the total length of the fungal Transcript Annotation Database, which is 4.2 GB: larger than a human genome. That estimate is based on reference guided genome assemblies that have fixed 50X coverage, not reference guided transcriptome assemblies, which have highly variable coverage. The assembly in this tutorial has extremely low coverage and uses far less disk space than what is estimated here.

    • Cloud Assembly is not offered for the de novo transcriptome workflow because most data sets exceed the 48 hour time limit.

      Click the link “Run assembly on this computer.” The assembly will take approximately one hour on a standard laptop.
  1. Wait until being informed that assembly has finished, then click Next.

  1. In the Assembly Summary screen, note the button View assembled transcripts. In the future, this button will allow you to open transcriptome results in SeqMan Ultra. As of version 17,however, .transcriptome results can only be analyzed in SeqMan Pro.
  1. Click Finish to close SeqMan NGen and press Yes when prompted.

Viewing transcripts in SeqMan Pro:

During the assembly process, the_de novo_ transcriptome assembly output was saved to a package called Transcript Project.Transcriptome. Any assembled transcripts with a database match exceeding the specified thresholds were termed “Identified Transcripts,” while assembled transcripts that did not have a database match were called “Novel Transcripts.” This part of the tutorial shows how to load the annotated transcripts into the SeqMan Pro application for downstream analysis.

  1. Launch SeqMan Pro and drag and drop the result file Transcriptome.Transcriptome from your file explorer onto the SeqMan Pro window.
  1. Observe that the ensuing All Transcripts window contains two tabs. Each tab’s heading shows the total number of transcripts in the table, and the number currently selected. The tables in the two tabs support a wide variety of sortable columns which can be displayed or hidden, as desired.

The Identified Transcripts tab is active, by default. You should see over 1300 Total Identified Transcripts. Since you haven’t yet made any selections, the number of Selected Identified Transcripts is zero.

  1. Click on the Novel Transcripts tab. You should see approximately 50-70 Total Novel Transcripts. This table lists the assembled contigs that did not have any match to the Transcript Annotation Database that met the search criteria thresholds and therefore, were not labeled with any match information. Note that this table contains only three columns.

  1. Return to the Identified Transcripts tab and experiment with the following:

    • To show or hide columns - Right-click and choose Show/Hide Column, then check or uncheck boxes. Each column is described in detail in the SeqMan Pro help.

    • To move a column - Use the mouse to drag and drop it in the desired location.

    • To sort data in alphabetical or numerical order - Click on the column header that you wish to sort. Note that the resulting groups are also shown in different colors to help visually differentiate between them.

    • To open individual contigs in SeqMan Pro for visualization and editing, double-click on a row of interest to navigate to the corresponding contig assembly. The appropriate .sqd file is loaded with the Alignment view of the selected contig displayed. All the usual visualization and editing tools in SeqMan Pro are available.

    • To set stringency thresholds, return to the Identified Transcripts tab and choose View > Sort. Set up the dialog to match the image below. To add the second row, click on the plus sign in the original row. Press Sort Now.

    • To save the sub-set of transcripts that met the stringency thresholds:

      1. In the Identified Transcripts table, click the %Transcript Match header to sort the column in decreasing order.

      2. Select all rows where Transcript Match is ≥ 99.00, noting that the Identity is also ≥ 99 for those rows. In the header, you will see approximately 130-160 selected transcripts.

      3. Choose File > Save Selected Transcripts for. In the Save dialog, designate a name and location for the output file, and then click Save. Two files will be created, each with a different extension (.fas and .searchresults).

      4. (optional) To see what is contained in each of these file types, open the files in any suitable text editor.

This marks the end of this tutorial.

Need more help with this?

Thanks for your feedback.