Home > Blog > What Can We Learn from Gene Homology Analysis?

What Can We Learn from Gene Homology Analysis?

By Prajkta Chivte, Ph.D., DNASTAR Technical Sales Scientist

May 23, 2024 | Molecular Biology

Prajkta Chivte recently completed her doctoral studies in Biochemistry, where she established novel biomarkers for diagnosing COVID-19 using mass spectrometry. As a Technical Sales Scientist at DNASTAR, she works directly with a diverse range of domestic and international customers to understand their research needs and guide them to the appropriate Lasergene software workflows.

This blog post answers some common questions about gene homology that I receive in my role as a Technical Sales Scientist. Whether you are a curious undergraduate, a lab-based scientist, or simply fascinated by the process of evolution, I hope you’ll learn something new about the cornerstone topic of homology.

In the first part of this blog post, I’ll answer some basic questions related to gene homology. Next, I’ll show how easy it is to use Lasergene software to reveal evolutionary connections between genomes. Finally, I’ll share a real-world application of this workflow that was pioneered by DNASTAR scientists and one of our customers at the University of Groningen.

Can you give a brief history of homology analysis?

The concept of phenotypical homology was first introduced by Richard Owen (1804-1892) when describing homologous structures between species. Once Charles Darwin (1809-1882) published his works on evolutionary theory, these homologous structures were reinterpreted as showing derivation from a common ancestral structure. As we know now, analogous structures may indeed point to a common ancestor, but can also arise independently in virtually unrelated species through the process of convergent evolution.

With the advent of nucleotide and protein sequencing technologies and the development of specialized bioinformatics software, it is now possible to go beyond phenotypes and instead determine whether organisms share a common ancestor by comparing their DNA, RNA, and protein sequences. This type of analysis is called “sequence homology” or “gene homology.”

What’s the difference between sequence similarity (or identity) and sequence homology?

The link between sequence homology and sequence similarity (or identity) is often misunderstood. Simply put, sequence similarity indicates the percentage of similar residues between two sequences. Sequence similarity is a quantitative parameter, so we can say that two sequences “share 55% similarity.”

By contrast, sequence homology is an inference drawn from the results of sequence similarity, and always involves a qualitative statement. Sequences are either homologous or nonhomologous. An analogy is that a person can either be pregnant or not pregnant; they can’t be 55% pregnant. Because sequence homology is qualitative, it is not possible to calculate a “percent homology” for a pair of sequences. They can have a “percent similarity,” but they either do, or do not, share a common ancestor. Genes that share a common evolutionary origin are referred to as homologs, which are further categorized into three classes: orthologs (separated by speciation), paralogs (separated by duplication), and xenologs (obtained by horizontal gene transfer).

One caveat that’s important to mention is that the determination of homology can be somewhat arbitrary, as it depends on the (human-determined) settings for what counts as a sequence match. That can make definitively stating that two sequences share a common ancestor problematic; the answer may change as thresholds are made more or less stringent.

What knowledge can be gained from sequence homology analysis?

Not only does sequence homology help in performing phylogenetic analysis and understanding evolutionary relationships, but also assists in inferring/predicting the functions of genes, unveiling insights into various genetic diseases. Recently, this knowledge has been applied with great success in drug discovery pipelines. However, researchers from diverse fields can benefit from gene homology analysis due to the multi-dimensional nature of the results.

– Biotechnologists and pharmacists can use homology analysis to further drug discovery and protein engineering, as well as to identify therapeutic targets.

– Evolutionary biologists can trace evolutionary relationships between species.

– Molecular biologists can discover structural similarities between the genes of closely related species.

– Microbiologists and virologists can examine pathogenicity and virulence by studying the homologs within a certain family or genus.

– Environmental scientists and ecologists can evaluate the genetic diversity and population structure of a given ecosystem.

– Anthropologists and archaeologists can determine human migration patterns.

As you can see, gene homology has become an essential workflow for researchers across a wide range of biological, environmental, medical, and social sciences fields.

Group of neanderthal people walking by river ,

What are the steps involved in doing this workflow in Lasergene?

The ability to do gene homology analysis was added in Lasergene 17.6, which was released in 2024.

This workflow supports phylogenetic analysis of species too distantly related to be compared with their nucleotide sequences. Instead, the workflow uses annotated genome sequences to extract and compare the gene sets from each organism at the amino acid level. The protein sequences of homologous genes, as determined by a set of user defined sequence matching criteria, present in all the genomes are then used to construct a single concatenated sequence for each test genome. Those concatenated sequences are aligned by a multiple sequence alignment (MSA) algorithm, such as MAFFT. Finally, the MSA is used as input for either of two phylogenetic tree building algorithms, RAxML or Neighbor Joining (NJ).

The fully automated workflow is accessed through a project setup wizard, and takes only a minute or two to set up:

Step 1: In MegAlign Pro, use Align > Align by Gene Homology to launch the wizard at the Reference Sequence screen (Figure 1).

Figure 1: The first step in setting up a gene homology alignment is to add a reference sequence.

Use the buttons on the right to add an annotated reference sequence from your computer or from the DNASTAR Cloud Data Drive; or to download a genome template from the DNASTAR website or from NCBI’s Entrez database. Then click Next.

Step 2: In the Input Sequences wizard screen (Figure 2), add the annotated sequences that you wish to compare to the reference.

Figure 2: Sample sequences are added in this wizard screen and can also be grouped into replicate sets.

When finished, click Next.

(Note that if your starting point are unannotated sequences, you will need to first need to run them through an annotation pipeline such as NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP). If you only have raw unassembled data, you will need to assemble it first in SeqMan NGen and then annotate the final consensus sequence. )

Step 3: In the Analysis Options screen (Figure 3), customize options related to the homology defining criteria and which MSA and tree building algorithms to use, if desired.

Figure 3: This screen allows you to change thresholds for homolog identification and choose your preferred sequence alignment and tree building algorithms.

Step 4: Click Next to proceed to a screen where you can name the project. Then click Next again to choose whether to run the alignment on your local computer or on the cloud.

After successfully completing the alignment, MegAlign Pro generates the usual distance table and aligned sequences view, but also creates a phylogenetic tree and a homologs view (Figure 4). The homologs view contains two customizable tables with a summary of all the identified homologs (or “unique to reference”) along with their % coverage and % similarity. These last two statistics are valuable in assessing the level of homology.

Gene homology MAP — Figure 4: The results of a gene homology alignment, including the Sequences view, Homolog table, and phylogenetic tree.

Can you describe a use case for this workflow?

Yes! Our new gene homology analysis workflow was used to solve a medical mystery even before the workflow was formally released in Lasergene 17.6. This innovative research was recently described in Frontiers in Cellular and Infection Microbiology. Click here to read the full article online.

Here’s a brief summary of the situation and the solution:

A lung transplant patient in Europe had an infection, but attempts to culture the causative agent from patient samples failed, although 16S rRNA PCR amplification and Sanger sequencing indicated the presence of an unknown Mycoplasma faucium bacterium. To characterize this unknown unculturable pathogen, Artur J. Sabat and his team of researchers in the Netherlands and Germany turned to shotgun metagenomics using Oxford Nanopore Technologies (ONT) to sequence the cellular component of a pus sample from the patient.

Electrocardiogram in hospital surgery operating emergency room showing patient heart rate with blur team of surgeons background

They teamed up with DNASTAR scientist Tim Durfee and DNASTAR software developer Schuyler Baldwin to first remove sequence reads from the patient’s genome by aligned the ONT data to the human genome reference sequence using DNASTAR’s SeqMan NGen. The unaligned non-human reads were then assembled using the long read de novo assembly workflow in SeqMan NGen. resulting in a single circularly permuted contig of the expected chromosome length (~800Mb). The consensus sequence was polished with Illumina paired end data obtained from the same pus sample using a separate automated workflow in SeqMan NGen. Manual editing was carried out in SeqMan Ultra to determine the final sequence which was annotated using PGAP.

With the complete annotated genome sequence of the bacterium in hand, the team sought to use gene homology based phylogeny to confirm the 16S rRNA placement of this new M. faucium organism in the Mycoplasma clade and also identify gene sets unique to this pathogen and its subgroup. The MegAlign Pro workflow described above was used to identify gene homologs among 42 Mycoplasma species. The phylogenetic trees generated confirmed the 16S rRNA results and allowed the identification of differentially distributed genes that could lead to better understanding of how M. faucium and closely associated Mycoplasma species cause disease. For example, three new mobile genetic elements in M. faucium were identified along with the previously unknown resistance of the bacterium to tetracyclines due to horizontal gene transfer and other virulence factors and defense mechanisms of these pathogens.

This collaboration is a great example of the quick and efficient use of bioinformatics applications to address challenges in the medical field. According to the authors, “this study represents the first-ever acquisition of a complete circularized bacterial genome directly from a patient sample obtained from invasive infection of a primary sterile site using culture-independent, PCR-free clinical metagenomics.” This approach could also open doors for analyzing other complex or unknown pathogens.

Where can I learn more about the gene homology alignment workflow?

I highly recommend watching our webinar, Gene Homology at Scale, where my colleague, Matt Keyser, does a live demonstration of this workflow and discusses how to interpret the results.

WATCH THE WEBINAR

Would you like to receive technical tips and special offers straight to your inbox?