Javascript required
Skip to content Skip to sidebar Skip to footer

Dna to Protein Conversion Software Free Download Updated FREE

Dna to Protein Conversion Software Free Download

ane Introduction

We now have the complete genome sequences of many organisms including humans which act as reference datasets for other genome-wide studies. For example, ChIP-seq studies uncover genomic regions bound by particular proteins whereas genome sequencing efforts are identifying DNA sequence variants associated with disease. Following alignment to the genome sequence, both of these approaches render lists of genomic region coordinates. Nevertheless obtaining the underlying nucleotide sequences and their protein coding potential is not trivial. Similarly, given a listing of protein regions, a biologist may need the corresponding genomic locations coding these protein regions in order to understand the genomic context of these poly peptide regions. As the engineering science in genomics and proteomics advances speedily, more than and more molecular biological studies will need the interconversion of genomic loci and protein regions. Here we develop a new package, geno2proteo, to address this issue.

Currently, finding the poly peptide sequence of a coding region can be done by using the two web sites, UCSC genome browser [1] and Ensembl [2]. However, their capabilities of finding protein sequences of coding genomic regions are restricted in several ways. A manual procedure has to be implemented to find the protein sequences for a single genomic locus which must exist repeated for any boosted genomic loci, which becomes very time-consuming if the user has many genomic locations to piece of work on. One tin can obtain the whole protein sequence encoded past a protein-coding transcript from the Ensembl spider web site. However, it is non straightforward to obtain the amino acid sequence of whatever genomic coding region from the Ensembl spider web site or database. The recently developed R Bioconductor package ensembldb [3] has the functionality for performing mapping betwixt genomic coordinates and protein coordinates.

Notation that other software are bachelor for 2 related bioinformatic tasks; namely obtaining the Deoxyribonucleic acid sequences of any genomic regions and the amino acrid sequences of any protein regions. The R package Bsgenome [4], the toolkit BEDTools [5], and the online tool the UCSC Table Browser [6] tin can be used for obtaining the Deoxyribonucleic acid sequences of whatsoever genomic regions. The web site UniProt [vii] can exist used for obtaining the protein sequences of any poly peptide regions. The Python parcel Biopython [8] can perform both tasks. Another related software is BLAST (https://nail.ncbi.nlm.nih.gov/Nail.cgi), which finds regions of DNA or poly peptide sequences which are significantly like to the given sequences. Information technology can besides search protein sequences using a nucleotide sequence and vice versa. However, BLAST addresses a very dissimilar problem from the i solved by ensembldb and the packet developed here, geno2proteo. The input of Nail is a DNA or protein sequence itself and Blast tries to find all the sequences in a genome or proteome database which are significantly similar to the input sequence. In contrast, the input of both geno2proteo and ensembldb is a genomic or protein region specified by the coordinates, and the output is the Deoxyribonucleic acid and protein sequences which are an exact match for the input region.

The R packet geno2proteo presented in this paper implements the 2-style mapping between genome and proteome; namely, given a genome and the gene annotations, information technology finds the amino acrid sequences coded past whatever given genomic regions and finds the genomic regions coding for whatsoever given protein regions. Moreover, geno2proteo performs these tasks in a batch way, namely it finds and generates an output file of the genomic coordinates or protein sequences of whatsoever number of genomic or poly peptide regions from a unmarried input file containing a list of genomic or protein regions. Equally a by-product, the R package geno2proteo as well provides functions for ii more tasks; namely obtaining the DNA sequences of any genomic regions and the amino acid sequences of any protein regions. An additional deliverable of our research was the creation of a web service based on the R bundle to permit the users who are not familiar with the R programming to perform the four genomic and proteomic tasks by just going to the website http://sharrocksresources.manchester.air-conditioning.uk/tofigaps and using the online tool.

A summary of the comparison of our geno2proteo package with other public software or web services on performing the four genomic and proteomic tasks is presented in Table 1. The table also compares the range of species and strains that those tools can process.

Tabular array 1:

Comparing geno2proteo with other public software on the four tasks. Tasks are shown in the kickoff iv columns headings (assuming) whereas the table content indicates the capability of each of the indicated software packages on each task.

DNA sequences of genomic regions Poly peptide sequences of genomic regions Genomic loci of protein regions Poly peptide sequences of protein regions Species Is it a spider web service
BSgenome [three], BEDTools [four] A list of any genomic regions Any user-defined No
UCSC Tabular array Browser [v] A list of any genomic regions only database species Yes
UniProt [six] A listing of protein IDs only database species Yes
Biopython [vii] A list of any genomic regions A list of whatsoever protein regions Whatever user-defined No
ensembldb [8] A list of any genomic regions A listing of any genomic regions A list of any poly peptide regions A listing of any poly peptide regions Only database species No
geno2proteo A list of any genomic regions A list of any genomic regions A list of any protein regions A list of any protein regions Whatsoever user-defined No
ToFiGAPS A list of whatever genomic regions A list of whatever genomic regions A list of any protein regions A list of any protein regions human and mouse Yes

2 Implementation

2.1 The R Package Geno2proteo

Given a specific genome and i version of its gene annotations, the R packet geno2proteo exploits the exonic structure of the protein-coding transcripts contained in the gene annotations for the genome and proteome mapping tasks. It needs iii external information files:

  1. ane.

    DNA sequences: a text file in FASTA format containing the DNA sequence of the genome that the user wants to use for analysing the data. It can be compressed by GNU Null.

  2. 2.

    Gene annotations: a text file in GTF format containing the factor annotations of the aforementioned version of the genome of the aforementioned species equally that of the Deoxyribonucleic acid sequence file described to a higher place. Information technology can besides be compressed past GNU Zip.

  3. 3.

    Genetic coding table: a text file containing a genetic coding table of the codons and the amino acids coded past those codons.

The commencement two information files represent to the specific genome and its gene annotations which one wants to use. The third file is used for the translation from Deoxyribonucleic acid sequences to poly peptide sequence. A file for the standard genetic coding scheme is provided in the packet. The DNA sequence and gene annotation files can be downloaded from the Ensembl web site [9] and tin be used directly by the functions in the package. Alternatively the user tin can create their ain information files, as long as they are in the file formats required by the bundle, which will be useful if the user needs to utilise some data files which are not available from the public databases like Ensembl. The exact file format of those information files and how to download or create them are described in detail in the bundle'due south user guide (Supplementary file 1).

The geno2proteo bundle workflow starts by generating an internal information file, containing all the coding regions and their protein sequences, by using the three external reference data files and the R function generatingCDSaaFile provided in this packet (Figure 1). Note that this internal file needs to be generated only once for one specific genome and one version of its factor annotations, which will be used subsequently by all the genome and proteome mapping tasks for the genome with this version of the factor annotations. The package has 4 chief functions for the four tasks of obtaining the Deoxyribonucleic acid and amino acid sequences of whatsoever list of genomic or protein regions, equally depicted in the lower role of Figure 1, which are:

Figure 1: A workflow depiction of all the functions and their inputs, outputs and relations in the R package geno2proteo.

Figure 1:

A workflow depiction of all the functions and their inputs, outputs and relations in the R package geno2proteo.

  1. 1.

    genomicLocsToProteinSequence takes a list of genomic regions as the input and finds the amino acid sequences and the DNA sequences of the coding regions within those genomic regions.

  2. ii.

    genomicLocsToWholeDNASequence takes a list of genomic regions and finds the whole DNA sequences of those genomic regions.

  3. 3.

    proteinLocsToGenomic finds the genomic regions coding for a list of protein regions.

  4. iv.

    proteinLocsToProteinSeq finds the amino acid sequences themselves for a list of regions in proteins specified past the coordinates of the regions along the proteins.

A more than detailed explanation of all the functions in the package and how to employ them are given in the package's user guide (Supplementary file 1). The geno2proteo bundle is available in the R package CRAN repository, https://cran.r-project.org/package=geno2proteo.

2.2 The online tool ToFiGAPS

Based on the R package geno2proteo, nosotros created a web service to give the users a more direct and elementary manner to use the functions provided in the R package. The user interface of the tool is in one single web folio, as shown in Effigy ii. To utilize the spider web service, simply go to the web folio:

Figure 2: The main webpage of the online tool ToFiGAPS.This web interface performs the four genomic and proteomic mapping tasks. The web page address is http://sharrocksresources.manchester.ac.uk/tofigaps. The results box shows the DNA sequences retrieved from the selected human genome GRCh37 for the 4 genomic regions in the input box after clicking the button

Figure 2:

The main webpage of the online tool ToFiGAPS.

This web interface performs the four genomic and proteomic mapping tasks. The web page accost is http://sharrocksresources.manchester.air conditioning.united kingdom/tofigaps. The results box shows the Deoxyribonucleic acid sequences retrieved from the selected human genome GRCh37 for the 4 genomic regions in the input box afterwards clicking the push "Submit", as the task "Notice the DNA sequences of any genomic regions" was selected.

http://sharrocksresources.manchester.ac.great britain/tofigaps

and follow the three simple steps:

  1. Step ane:

    Choose ane of the iv tasks which one wants to perform.

  2. Step 2:

    Choose a species genome and a gene annotation version.

  3. Stride 3:

    Input a listing of genomic regions or proteomic regions.

Then click the Submit push, and afterwards waiting a short time, the results will be shown in the results box in the bottom part of the web page. A detailed explanations about how to use the tool and the formats of the input and output are in the User's Guide of the online tool, which can be accessed from the tool'southward main spider web page. Currently two species, human and mouse, with ii versions of genome for each of them are available in the web site. If the user wants to employ other versions of human or mouse genome or any other species' genome, he/she will accept to utilize the R package geno2proteo which provides the function to bargain with any species (for details run across the section about the R packet geno2proteo).

3 Application

Protein modification with small ubiquitin-like modifier (SUMO) plays an important regulatory role on the activities of hundreds of proteins associated with various biological functions [10]. For example, it can enhance the repressive activities of transcriptional regulators and does so by a myriad of mechanisms, including enhancing co-repressor recruitment [10]. To further report how SUMO might bear upon on gene regulation, we generated the SUMO2/three ChIP-seq data from the MCF10A cell line to make up one's mind the genome-wide SUMO2/3 binding sites in these cells. MCF10A cells were treated with 1.8 ng/ml epidermal growth factor (EGF) for xxx min and one sample was generated using an anti-SUMO2 antibody (Life Technologies) according to previously described protocol [11]. The reads were aligned to the human genome hg19 using the software Bowtie 2 [12], and then 28,663 SUMO-associated regions (i.due east. peaks) were identified from the aligned reads using the software MACS2 [thirteen]. The ChIP-seq data is publicly bachelor from ArrayExpress with the accession number East-MTAB-7759. Upon visual inspection of the data, nosotros noticed that a large number of SUMO peaks were found almost to transcriptional termination sites. We therefore selected all SUMO-associated regions located within +/−two kb of a transcriptional termination site (due north = 329) and performed motif enrichment analysis using the software Homer [fourteen] to place potential common bounden motifs that might hint at a particular Dna bounden protein. We found that the two most enriched novel Dna motifs are similar to the binding motifs of SOX18 and RBPJ1, respectively (Effigy 3A). Note that the motifs shown in Figure 3A are non the binding motifs of SOX18 and RBPJ1 themselves every bit shown in Figure 3B. Instead they are the ii de novo motifs that the software Homer uncovered from the 329 selected SUMO peaks to which the SOX18 and RBPJ1 motifs are the most similar known motifs co-ordinate to the software Homer. Nosotros found 44 regions where the matching sites of the two motifs are close to each other and have the "SOX18 motif" up-stream of the "RBPJ1 motif" with a 2 bp gap betwixt them (the genomic coordinates of those 44 regions in hg19 are in Supplementary file 2). We besides constitute that all of these 44 regions are within the protein coding regions of genes encoding zinc finger proteins. We therefore asked whether at that place was an underlying DNA sequence motif or whether this was an indirect upshot of a highly conserved amino acid sequence giving rise to nucleotide sequence conservation due to the underlying common codon usage. We therefore needed the protein sequences of these multiple genomic regions, which was the original motivation for the states to create the software presented in this paper.

Figure 3: Motif analysis of the SUMO binding regions.(A) Two de novo motifs uncovered by motif discovery using HOMER [14] in SUMO2/3 ChIP-seq data, visualised by using WebLogo [15]. The left motif is similar to SOX18's binding motif, and the right motif is similar to RBPJ1's binding motif. (B) SOX18 and RBPJ1 binding motifs which were identified by HOMER as the known motifs being similar to the two de novo motifs in (A), created by Seq2Logo [16]. (C) The logo graphs of the DNA (top) and protein (bottom) sequences associated with the 44

Figure 3:

Motif assay of the SUMO binding regions.

(A) Two de novo motifs uncovered past motif discovery using HOMER [14] in SUMO2/3 ChIP-seq data, visualised by using WebLogo [fifteen]. The left motif is similar to SOX18'south binding motif, and the right motif is similar to RBPJ1'south binding motif. (B) SOX18 and RBPJ1 bounden motifs which were identified by HOMER as the known motifs being similar to the two de novo motifs in (A), created by Seq2Logo [16]. (C) The logo graphs of the DNA (top) and poly peptide (bottom) sequences associated with the 44 "SOX18-2bp-RBPJ1" motif matching sites found in the SUMO ChIP-seq data. Ii more than nucleotides in the upwards- and down-stream of the matching sites were also included, because they belong to the same Deoxyribonucleic acid translation frame as the first and last nucleotide inside the matching sites according to the genes in which they are located.

After applying the function genomicLocToProteinSequence of the R package to these 44 SUMO2/3 binding regions with the "SOX18-2bp-RBPJ1" composite motif, we obtained the protein sequences besides every bit the DNA sequences of the coding regions inside these genomic sites. Effigy 3C shows the logo graphs of the DNA sequences and protein sequences of all of the "SOX18-2bp-RBPJ1 motif" matching regions, using the on-line tool WebLogo [15] and Seq2Logo [16]. Outset note that the Deoxyribonucleic acid motifs in Figure 3C are quite similar to the respective motifs in Figure 3A, but are more specific at some positions, because the former were obtained from a subset of the genomic regions from which the latter motifs were obtained. Comparing the DNA and protein sequences of the 44 SUMO2/iii bounden regions in Figure 3C, it looks like the protein sequence is overall more conserved than the DNA sequence. However, at several positions the Deoxyribonucleic acid sequence is more conserved. For example, the second amino acrid in the protein sequence, Arginine(R), is coded by six codons in total according to the standard genetic coding scheme, for which Table two lists the (re-scaled) expected frequency of those six codons in human genome [17] and the (re-scaled) observed frequency of the half dozen codons at the 44 genomic sites associated with SUMO bounden. Table 3 compares the expected frequency and the observed frequency of the four codons coding the ixth amino acid, Proline(P). In both cases, there is a large difference betwixt the expected and observed frequencies of the codons coding the amino acid at one particular position in the 44 genomic sites. I specific codon appears in more than than 90% of the 44 SUMO-associated sites and several other codons practice not appear at all, while the expected frequency of all the codons coding the same amino acid is between viii and 32%, indicating that the DNA sequences at those positions are more than conserved than the respective amino acids. As a further exam of conservation, nosotros took advantage of the fact that the amino acid motif underlying the SUMO binding regions is repeated throughout the zinc finger regions of these proteins. We therefore compared the protein and Deoxyribonucleic acid sequences of the surrounding N- and C-terminal sequence motif repeats and their codon usage bias. These adjacent motifs showed similar amino acid conservation (Figure 4B) but lower DNA sequence conservation (Figure 4B). This lack of DNA sequence conservation in the surrounding motifs is further emphasised past looking at the codon usage frequencies at diagnostic amino acid residues (Tabular array 2 and Table iii). While clearly non-random, the highest usage of a codon was 62% rather than xc% found in the SUMO bounden regions. Together these results therefore suggest that both the Dna sequences underlying the SUMO bounden regions and the encoded protein sequences may have functional relevance.

Figure 4: Comparison of the sequence motifs in adjacent repeated protein regions.The amino acid and DNA sequences of the 44 selected motifs associated with SUMO binding (A and C), are compared to the equivalent motifs on the same proteins/genes, in the immediate N-terminal or C-terminal regions (B and D). These motifs have similar amino acid sequences (chosen by using the template HT[GC]EK[AP]) to retrieve sequences immediately before or after the selected sites). The four sequence logos were generated using WebLogo [15].

Figure four:

Comparison of the sequence motifs in side by side repeated protein regions.

The amino acrid and Dna sequences of the 44 selected motifs associated with SUMO binding (A and C), are compared to the equivalent motifs on the aforementioned proteins/genes, in the firsthand North-terminal or C-terminal regions (B and D). These motifs take like amino acid sequences (chosen by using the template HT[GC]EK[AP]) to retrieve sequences immediately earlier or later the selected sites). The four sequence logos were generated using WebLogo [15].

Tabular array 2:

The expected and observed frequencies of the 6 codons for the twond amino acid Arginine (R2) in the protein sequence in Figure 3C, and the observed frequencies of the vi codons for R2 at the sites immediately before or after the 44 sites and with the similar amino acids in Figure 4D.a

CGT CGC CGA CGG AGA AGG
Expected frequency 7.nine% xviii.3% ten.9% twenty.ane% 21.five% 21.2%
Observed frequency at the 44 sites 0% 0% 4.7% 0% 93.0% 2.3%
Observed frequency at the sites immediately before or after the 44 sites and with the similar amino acids 0% 0% 15.3% 3.four% 69.4% 11.9%

Table three:

The expected and observed frequencies of the six codons for the nineth amino acrid Proline (P9) in the poly peptide sequence in Effigy 3C, and the observed frequencies of the six codons for P9 at the sites immediately before or after the 44 sites and with the similar amino acids every bit in Figure 4D.a

CCA CCC CCG CCT
Expected frequency 27.7% 32.four% 11.iii% 28.half dozen%
Observed frequency at the 44 sites 0% 93.0% 7.0% 0%
Observed frequency at the sites immediately before or after the 44 sites and with the like amino acids x.2% 62.three% ii.ix% 24.6%

4 Discussion

We created an R package geno2proteo, a software dedicated to mapping sequences from whatever genomic and protein coordinates to reference DNA and protein sequences. We also created an online tool to permit the users to use the software straight from the web interface of the software. We illustrate how the package and online tool can be used to interrogate the protein and DNA sequences associated with genomic regions recovered by a ChIP-seq experiment. Hither, it was initially ambiguous whether the DNA sequence conservation institute under the SUMO binding peaks was a consequence of stiff conservation of a protein coding sequence or rather was indicative of an underlying Dna motif that potentially acts equally a poly peptide binding site. Our analysis suggested that the latter is a possibility that warrants further testing in the time to come. We promise that the software volition be useful in other studies involving genomic and proteomic data.

Acknowledgements

We would like to give thanks our colleague Dr. David Gerrard for reviewing the software and giving many useful suggestions which nosotros adopted to improve the software. This work was supported by the Wellcome Trust (103857/Z/xiv/Z).

  1. Disharmonize of interest argument: Authors state no disharmonize of interest. All authors have read the periodical's Publication ideals and publication malpractice statement available at the journal'due south website and hereby confirm that they comply with all its parts applicative to the nowadays scientific work.

References

[i] Tyner C, Barber GP, Casper J, Clawson H, Diekhans M, Eisenhart C, et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res 2017;45:D626–34.27899642 Search in Google Scholar

[two] Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, et al. Ensembl 2017. Nucleic Acids Res 2017;45:D635–42.10.1093/nar/gkw110427899575 Search in Google Scholar

[iii] Rainer J, Gatto L, Weichenberger CX. ensembldb: an R package to create and utilise Ensembl-based annotation resources. Bioinformatics 2019. DOI: x.1093/bioinformatics/btz031. Search in Google Scholar

[four] Pagès H. BSgenome: Software infrastructure for efficient representation of full genomes and their SNPs. R package version 1.48.0, 2018. http://bioconductor.org/packages/Bsgenome/. Accessed on x May 2018. Search in Google Scholar

[5] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010;26:841–two.20110278 10.1093/bioinformatics/btq033 Search in Google Scholar

[6] Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004;32(Database effect):D493–half-dozen.14681465 ten.1093/nar/gkh103 Search in Google Scholar

[vii] The UniProt Consortium. UniProt: a worldwide hub of poly peptide noesis. Nucleic Acids Res 2019;47:D506–15. 10.1093/nar/gky1049Search in Google Scholar

[8] Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009;25:1422–iii.10.1093/bioinformatics/btp16319304878 Search in Google Scholar

[ix] Aken BL, Ayling Southward, Barrell D, Clarke L, Curwen V, Fairley S, et al. The Ensembl gene annotation system. Database 2016;2016:ane–19. 10.1093/database/baw093Search in Google Scholar

[10] Cubeñas-Potts C, Matunis MJ. SUMO: A multifaceted modifier of chromatin construction and function. Dev Jail cell 2013;24:1–12.23328396 x.1016/j.devcel.2012.xi.020 Search in Google Scholar

[11] Aguilar-Martinez Eastward, Chen X, Webber A, Mould AP, Seifert A, Hay RT, et al. Screen for multi-SUMO-binding proteins reveals a multi-SIM-binding mechanism for recruitment of the transcriptional regulator ZMYM2 to chromatin. Proc Natl Acad Sci U.s. 2015;112:E4854–63.10.1073/pnas.1509716112 Search in Google Scholar

[12] Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie two. Nat Methods 2012;ix:357–nine.10.1038/nmeth.192322388286 Search in Google Scholar

[13] Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008;ix:R137.18798982 x.1186/gb-2008-nine-ix-r137 Search in Google Scholar

[14] Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 2010;38:576–89.x.1016/j.molcel.2010.05.004 Search in Google Scholar

[15] Crooks GE, Hon Thousand, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res 2004;14:1188–90.10.1101/gr.84900415173120 Search in Google Scholar

[sixteen] Thomsen MCF, Nielsen M. Seq2Logo: a method for structure and visualization of amino acid bounden motifs and sequence profiles including sequence weighting, pseudo counts and ii-sided representation of amino acid enrichment and depletion. Nucleic Acids Res 2012;40:W281–seven.x.1093/nar/gks46922638583 Search in Google Scholar

[17] Nakamura Y, Gojobori T, Ikemura T. Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 2000;28:292.10.1093/nar/28.1.29210592250 Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (DOI: https://doi.org/10.1515/jib-2018-0090).

Dna to Protein Conversion Software Free Download

DOWNLOAD HERE

Source: https://www.degruyter.com/document/doi/10.1515/jib-2018-0090/html

Posted by: durancomereces.blogspot.com