cartVersion cartVersion cartVersion cartVersion 0 0 0 0 0 0 0 0 0 0 0 cartVersion cartVersion cartVersion 0 cartVersion 0 cpgIslandExt CpG Islands bed 4 + CpG Islands (Islands < 300 Bases are Light Green) 3 1 0 100 0 128 228 128 0 0 0
CpG islands are associated with genes, particularly housekeeping\ genes, in vertebrates. CpG islands are typically common near\ transcription start sites and may be associated with promoter\ regions. Normally a C (cytosine) base followed immediately by a \ G (guanine) base (a CpG) is rare in\ vertebrate DNA because the Cs in such an arrangement tend to be\ methylated. This methylation helps distinguish the newly synthesized\ DNA strand from the parent strand, which aids in the final stages of\ DNA proofreading after duplication. However, over evolutionary time,\ methylated Cs tend to turn into Ts because of spontaneous\ deamination. The result is that CpGs are relatively rare unless\ there is selective pressure to keep them or a region is not methylated\ for some other reason, perhaps having to do with the regulation of gene\ expression. CpG islands are regions where CpGs are present at\ significantly higher levels than is typical for the genome as a whole.
\ \\ The unmasked version of the track displays potential CpG islands\ that exist in repeat regions and would otherwise not be visible\ in the repeat masked version.\
\ \\ By default, only the masked version of the track is displayed. To view the\ unmasked version, change the visibility settings in the track controls at\ the top of this page.\
\ \CpG islands were predicted by searching the sequence one base at a\ time, scoring each dinucleotide (+17 for CG and -1 for others) and\ identifying maximally scoring segments. Each segment was then\ evaluated for the following criteria:\ \
\ The entire genome sequence, masking areas included, was\ used for the construction of the track Unmasked CpG.\ The track CpG Islands is constructed on the sequence after\ all masked sequence is removed.\
\ \The CpG count is the number of CG dinucleotides in the island. \ The Percentage CpG is the ratio of CpG nucleotide bases\ (twice the CpG count) to the length. The ratio of observed to expected \ CpG is calculated according to the formula (cited in \ Gardiner-Garden et al. (1987)):\ \
Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)\ \ where N = length of sequence.\
\ The calculation of the track data is performed by the following command sequence:\
\ twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\\ | cpg_lh /dev/stdin 2> cpg_lh.err \\\ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\\ | sort -k1,1 -k2,2n > cpgIsland.bed\\ The unmasked track data is constructed from\ twoBitToFa -noMask output for the twoBitToFa command.\ \ \
\ CpG islands and its associated tables can be explored interactively using the\ REST API, the\ Table Browser or the\ Data Integrator.\ All the tables can also be queried directly from our public MySQL\ servers, with more information available on our\ help page as well as on\ our blog.
\\ The source for the cpg_lh program can be obtained from\ src/utils/cpgIslandExt/.\ The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file")\
\ \This track was generated using a modification of a program developed by G. Miklem and L. Hillier \ (unpublished).
\ \\ Gardiner-Garden M, Frommer M.\ \ CpG islands in vertebrate genomes.\ J Mol Biol. 1987 Jul 20;196(2):261-82.\ PMID: 3656447\
\ regulation 1 html cpgIslandSuper\ longLabel CpG Islands (Islands < 300 Bases are Light Green)\ parent cpgIslandSuper pack\ priority 1\ shortLabel CpG Islands\ track cpgIslandExt\ rmsk RepeatMasker rmsk Repeating Elements by RepeatMasker 1 1 0 0 0 127 127 127 1 0 0\ This track was created by using Arian Smit's\ RepeatMasker\ program, which screens DNA sequences\ for interspersed repeats and low complexity DNA sequences. The program\ outputs a detailed annotation of the repeats that are present in the\ query sequence (represented by this track), as well as a modified version\ of the query sequence in which all the annotated repeats have been masked\ (generally available on the\ Downloads page). RepeatMasker uses the\ Repbase Update library of repeats from the\ Genetic \ Information Research Institute (GIRI).\ Repbase Update is described in Jurka (2000) in the References section below.\ Some newer assemblies have been made with Dfam, not Repbase. You can\ find the details for how we make our database data here in our "makeDb/doc/"\ directory.
\ \\ In full display mode, this track displays up to ten different classes of repeats:\
\ The level of color shading in the graphical display reflects the amount of\ base mismatch, base deletion, and base insertion associated with a repeat\ element. The higher the combined number of these, the lighter the shading.\
\ \\ A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that\ the curator was unsure of the classification. At some point in the future,\ either the "?" will be removed or the classification will be changed.
\ \\ Data are generated using the RepeatMasker -s flag. Additional flags\ may be used for certain organisms. Repeats are soft-masked. Alignments may\ extend through repeats, but are not permitted to initiate in them.\ See the FAQ for more information.\
\ \\ Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and\ repeat libraries used to generate this track.\
\ \\ Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0.\ \ http://www.repeatmasker.org. 1996-2010.\
\ \\ Repbase Update is described in:\
\ \\ Jurka J.\ \ Repbase Update: a database and an electronic journal of repetitive elements.\ Trends Genet. 2000 Sep;16(9):418-420.\ PMID: 10973072\
\ \\ For a discussion of repeats in mammalian genomes, see:\
\ \\ Smit AF.\ \ Interspersed repeats and other mementos of transposable elements in mammalian genomes.\ Curr Opin Genet Dev. 1999 Dec;9(6):657-63.\ PMID: 10607616\
\ \\ Smit AF.\ \ The origin of interspersed repeats in the human genome.\ Curr Opin Genet Dev. 1996 Dec;6(6):743-8.\ PMID: 8994846\
\ varRep 0 canPack off\ group varRep\ longLabel Repeating Elements by RepeatMasker\ maxWindowToDraw 10000000\ priority 1\ shortLabel RepeatMasker\ spectrum on\ track rmsk\ type rmsk\ visibility dense\ unipAliSwissprot SwissProt Aln. bigPsl UCSC alignment of SwissProt proteins to genome (dark blue: main isoform, light blue: alternative isoforms) 3 1 0 0 0 127 127 127 0 0 0 genes 1 baseColorDefault genomicCodons\ baseColorTickColor contrastingColor\ baseColorUseCds given\ bigDataUrl /gbdb/aplCal1/uniprot/unipAliSwissprot.bb\ indelDoubleInsert on\ indelQueryInsert on\ itemRgb on\ labelFields name,acc,uniprotName,geneName,hgncSym,refSeq,refSeqProt,ensProt\ longLabel UCSC alignment of SwissProt proteins to genome (dark blue: main isoform, light blue: alternative isoforms)\ mouseOverField protFullNames\ parent uniprot\ priority 1\ searchIndex name,acc\ shortLabel SwissProt Aln.\ showDiffBasesAllScales on\ skipFields isMain\ track unipAliSwissprot\ type bigPsl\ urls acc="https://www.uniprot.org/uniprot/$$" hgncId="https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$$" refSeq="https://www.ncbi.nlm.nih.gov/nuccore/$$" refSeqProt="https://www.ncbi.nlm.nih.gov/protein/$$" ncbiGene="https://www.ncbi.nlm.nih.gov/gene/$$" entrezGene="https://www.ncbi.nlm.nih.gov/gene/$$" ensGene="https://www.ensembl.org/Gene/Summary?g=$$"\ visibility pack\ unipAliTrembl TrEMBL Aln. bigPsl UCSC alignment of TrEMBL proteins to genome 0 2 0 0 0 127 127 127 0 0 0 genes 1 baseColorDefault genomicCodons\ baseColorTickColor contrastingColor\ baseColorUseCds given\ bigDataUrl /gbdb/aplCal1/uniprot/unipAliTrembl.bb\ indelDoubleInsert on\ indelQueryInsert on\ itemRgb on\ labelFields name,acc,uniprotName,geneName,hgncSym,refSeq,refSeqProt,ensProt\ longLabel UCSC alignment of TrEMBL proteins to genome\ mouseOverField protFullNames\ parent uniprot off\ priority 2\ searchIndex name,acc\ shortLabel TrEMBL Aln.\ showDiffBasesAllScales on\ skipFields isMain\ track unipAliTrembl\ type bigPsl\ urls acc="https://www.uniprot.org/uniprot/$$" hgncId="https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$$" refseq="https://www.ncbi.nlm.nih.gov/nuccore/$$" refSeqProt="https://www.ncbi.nlm.nih.gov/protein/$$" ncbiGene="https://www.ncbi.nlm.nih.gov/gene/$$" entrezGene="https://www.ncbi.nlm.nih.gov/gene/$$" ensGene="https://www.ensembl.org/Gene/Summary?g=$$"\ visibility hide\ cpgIslandExtUnmasked Unmasked CpG bed 4 + CpG Islands on All Sequence (Islands < 300 Bases are Light Green) 0 2 0 100 0 128 228 128 0 0 0CpG islands are associated with genes, particularly housekeeping\ genes, in vertebrates. CpG islands are typically common near\ transcription start sites and may be associated with promoter\ regions. Normally a C (cytosine) base followed immediately by a \ G (guanine) base (a CpG) is rare in\ vertebrate DNA because the Cs in such an arrangement tend to be\ methylated. This methylation helps distinguish the newly synthesized\ DNA strand from the parent strand, which aids in the final stages of\ DNA proofreading after duplication. However, over evolutionary time,\ methylated Cs tend to turn into Ts because of spontaneous\ deamination. The result is that CpGs are relatively rare unless\ there is selective pressure to keep them or a region is not methylated\ for some other reason, perhaps having to do with the regulation of gene\ expression. CpG islands are regions where CpGs are present at\ significantly higher levels than is typical for the genome as a whole.
\ \\ The unmasked version of the track displays potential CpG islands\ that exist in repeat regions and would otherwise not be visible\ in the repeat masked version.\
\ \\ By default, only the masked version of the track is displayed. To view the\ unmasked version, change the visibility settings in the track controls at\ the top of this page.\
\ \CpG islands were predicted by searching the sequence one base at a\ time, scoring each dinucleotide (+17 for CG and -1 for others) and\ identifying maximally scoring segments. Each segment was then\ evaluated for the following criteria:\ \
\ The entire genome sequence, masking areas included, was\ used for the construction of the track Unmasked CpG.\ The track CpG Islands is constructed on the sequence after\ all masked sequence is removed.\
\ \The CpG count is the number of CG dinucleotides in the island. \ The Percentage CpG is the ratio of CpG nucleotide bases\ (twice the CpG count) to the length. The ratio of observed to expected \ CpG is calculated according to the formula (cited in \ Gardiner-Garden et al. (1987)):\ \
Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)\ \ where N = length of sequence.\
\ The calculation of the track data is performed by the following command sequence:\
\ twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\\ | cpg_lh /dev/stdin 2> cpg_lh.err \\\ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\\ | sort -k1,1 -k2,2n > cpgIsland.bed\\ The unmasked track data is constructed from\ twoBitToFa -noMask output for the twoBitToFa command.\ \ \
\ CpG islands and its associated tables can be explored interactively using the\ REST API, the\ Table Browser or the\ Data Integrator.\ All the tables can also be queried directly from our public MySQL\ servers, with more information available on our\ help page as well as on\ our blog.
\\ The source for the cpg_lh program can be obtained from\ src/utils/cpgIslandExt/.\ The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file")\
\ \This track was generated using a modification of a program developed by G. Miklem and L. Hillier \ (unpublished).
\ \\ Gardiner-Garden M, Frommer M.\ \ CpG islands in vertebrate genomes.\ J Mol Biol. 1987 Jul 20;196(2):261-82.\ PMID: 3656447\
\ regulation 1 html cpgIslandSuper\ longLabel CpG Islands on All Sequence (Islands < 300 Bases are Light Green)\ parent cpgIslandSuper hide\ priority 2\ shortLabel Unmasked CpG\ track cpgIslandExtUnmasked\ unipLocSignal Signal Peptide bigBed 12 + UniProt Signal Peptides 1 3 255 0 150 255 127 202 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipLocSignal.bb\ color 255,0,150\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ itemRgb off\ longLabel UniProt Signal Peptides\ parent uniprot\ priority 3\ shortLabel Signal Peptide\ track unipLocSignal\ type bigBed 12 +\ visibility dense\ unipLocExtra Extracellular bigBed 12 + UniProt Extracellular Domain 1 4 0 150 255 127 202 255 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipLocExtra.bb\ color 0,150,255\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ itemRgb off\ longLabel UniProt Extracellular Domain\ parent uniprot\ priority 4\ shortLabel Extracellular\ track unipLocExtra\ type bigBed 12 +\ visibility dense\ unipInterest Interest bigBed 12 + UniProt Regions of Interest 1 4 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipInterest.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ itemRgb off\ longLabel UniProt Regions of Interest\ parent uniprot\ priority 4\ shortLabel Interest\ track unipInterest\ type bigBed 12 +\ visibility dense\ unipLocTransMemb Transmembrane bigBed 12 + UniProt Transmembrane Domains 1 5 0 150 0 127 202 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipLocTransMemb.bb\ color 0,150,0\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ itemRgb off\ longLabel UniProt Transmembrane Domains\ parent uniprot\ priority 5\ shortLabel Transmembrane\ track unipLocTransMemb\ type bigBed 12 +\ visibility dense\ unipLocCytopl Cytoplasmic bigBed 12 + UniProt Cytoplasmic Domains 1 6 255 150 0 255 202 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipLocCytopl.bb\ color 255,150,0\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ itemRgb off\ longLabel UniProt Cytoplasmic Domains\ parent uniprot\ priority 6\ shortLabel Cytoplasmic\ track unipLocCytopl\ type bigBed 12 +\ visibility dense\ unipChain Chains bigBed 12 + UniProt Mature Protein Products (Polypeptide Chains) 1 7 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipChain.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Mature Protein Products (Polypeptide Chains)\ parent uniprot\ priority 7\ shortLabel Chains\ track unipChain\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#ptm_processing" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ unipDisulfBond Disulf. Bonds bigBed 12 + UniProt Disulfide Bonds 1 8 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipDisulfBond.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Disulfide Bonds\ parent uniprot\ priority 8\ shortLabel Disulf. Bonds\ track unipDisulfBond\ type bigBed 12 +\ visibility dense\ unipDomain Domains bigBed 12 + UniProt Domains 1 8 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipDomain.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Domains\ parent uniprot\ priority 8\ shortLabel Domains\ track unipDomain\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#family_and_domains" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ unipModif AA Modifications bigBed 12 + UniProt Amino Acid Modifications 1 9 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipModif.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Amino Acid Modifications\ parent uniprot\ priority 9\ shortLabel AA Modifications\ track unipModif\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#aaMod_section" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ unipMut Mutations bigBed 12 + UniProt Amino Acid Mutations 1 10 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipMut.bb\ longLabel UniProt Amino Acid Mutations\ parent uniprot\ priority 10\ shortLabel Mutations\ track unipMut\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#pathology_and_biotech" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$" variationId="http://www.uniprot.org/uniprot/$$"\ visibility dense\ unipOther Other Annot. bigBed 12 + UniProt Other Annotations 1 11 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipOther.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Other Annotations\ parent uniprot\ priority 11\ shortLabel Other Annot.\ track unipOther\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#family_and_domains" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ unipStruct Structure bigBed 12 + UniProt Protein Primary/Secondary Structure Annotations 0 11 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipStruct.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ group genes\ longLabel UniProt Protein Primary/Secondary Structure Annotations\ parent uniprot\ priority 11\ shortLabel Structure\ track unipStruct\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#structure" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility hide\ unipRepeat Repeats bigBed 12 + UniProt Repeats 1 12 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipRepeat.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Repeats\ parent uniprot\ priority 12\ shortLabel Repeats\ track unipRepeat\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#family_and_domains" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ unipConflict Seq. Conflicts bigBed 12 + UniProt Sequence Conflicts 1 13 0 0 0 127 127 127 0 0 0 genes 1 bigDataUrl /gbdb/aplCal1/uniprot/unipConflict.bb\ filterValues.status Manually reviewed (Swiss-Prot),Unreviewed (TrEMBL)\ longLabel UniProt Sequence Conflicts\ parent uniprot off\ priority 13\ shortLabel Seq. Conflicts\ track unipConflict\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#Sequence_conflict_section" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility dense\ est Sea hare ESTs psl est Sea hare ESTs Including Unspliced 0 100 0 0 0 127 127 127 1 0 0\ This track shows alignments between sea hare expressed sequence tags\ (ESTs) in \ GenBank and the genome. ESTs are single-read sequences,\ typically about 500 bases in length, that usually represent fragments of\ transcribed genes.\
\ \\ This track follows the display conventions for\ \ PSL alignment tracks. In dense display mode, the items that\ are more darkly shaded indicate matches of better quality.\
\ \\ The strand information (+/-) indicates the\ direction of the match between the EST and the matching\ genomic sequence. It bears no relationship to the direction\ of transcription of the RNA with which it might be associated.\
\ \\ The description page for this track has a filter that can be used to change\ the display mode, alter the color, and include/exclude a subset of items\ within the track. This may be helpful when many items are shown in the track\ display, especially when only some are relevant to the current task.\
\ \\ To use the filter:\
\ This track may also be configured to display base labeling, a feature that\ allows the user to display all bases in the aligning sequence or only those\ that differ from the genomic sequence. For more information about this option,\ go to the\ \ Base Coloring for Alignment Tracks page.\ Several types of alignment gap may also be colored;\ for more information, go to the\ \ Alignment Insertion/Deletion Display Options page.\
\ \\ To make an EST, RNA is isolated from cells and reverse\ transcribed into cDNA. Typically, the cDNA is cloned\ into a plasmid vector and a read is taken from the 5'\ and/or 3' primer. For most — but not all — ESTs, the\ reverse transcription is primed by an oligo-dT, which\ hybridizes with the poly-A tail of mature mRNA. The\ reverse transcriptase may or may not make it to the 5'\ end of the mRNA, which may or may not be degraded.\
\ \\ In general, the 3' ESTs mark the end of transcription\ reasonably well, but the 5' ESTs may end at any point\ within the transcript. Some of the newer cap-selected\ libraries cover transcription start reasonably well. Before the\ cap-selection techniques\ emerged, some projects used random rather than poly-A\ priming in an attempt to retrieve sequence distant from the\ 3' end. These projects were successful at this, but as\ a side effect also deposited sequences from unprocessed\ mRNA and perhaps even genomic sequences into the EST databases.\ Even outside of the random-primed projects, there is a\ degree of non-mRNA contamination. Because of this, a\ single unspliced EST should be viewed with considerable\ skepticism.\
\ \\ To generate this track, sea hare ESTs from GenBank were aligned\ against the genome using blat. Note that the maximum intron length\ allowed by blat is 750,000 bases, which may eliminate some ESTs with very\ long introns that might otherwise align. When a single\ EST aligned in multiple places, the alignment having the\ highest base identity was identified. Only alignments having\ a base identity level within 0.5% of the best and at least 96% base identity\ with the genomic sequence were kept.\
\ \\ This track was produced at UCSC from EST sequence data\ submitted to the international public sequence databases by\ scientists worldwide.\
\ \\ Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW.\ \ GenBank.\ Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42.\ PMID: 23193287; PMC: PMC3531190\
\ \\ Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.\ GenBank: update.\ Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6.\ PMID: 14681350; PMC: PMC308779\
\ \\ Kent WJ.\ BLAT - the BLAST-like alignment tool.\ Genome Res. 2002 Apr;12(4):656-64.\ PMID: 11932250; PMC: PMC187518\
\ rna 1 baseColorUseSequence genbank\ group rna\ indelDoubleInsert on\ indelQueryInsert on\ intronGap 30\ longLabel Sea hare ESTs Including Unspliced\ maxItems 300\ shortLabel Sea hare ESTs\ spectrum on\ table all_est\ track est\ type psl est\ visibility hide\ mrna Sea hare mRNAs psl . Sea hare mRNAs from GenBank 3 100 0 0 0 127 127 127 0 0 0\ The mRNA track shows alignments between sea hare mRNAs\ in \ GenBank and the genome.
\ \\ This track follows the display conventions for\ \ PSL alignment tracks. In dense display mode, the items that\ are more darkly shaded indicate matches of better quality.\
\ \\ The description page for this track has a filter that can be used to change\ the display mode, alter the color, and include/exclude a subset of items\ within the track. This may be helpful when many items are shown in the track\ display, especially when only some are relevant to the current task.\
\ \\ To use the filter:\
\ This track may also be configured to display codon coloring, a feature that\ allows the user to quickly compare mRNAs against the genomic sequence. For more\ information about this option, go to the\ \ Codon and Base Coloring for Alignment Tracks page.\ Several types of alignment gap may also be colored;\ for more information, go to the\ \ Alignment Insertion/Deletion Display Options page.\
\ \\ GenBank sea hare mRNAs were aligned against the genome using the\ blat program. When a single mRNA aligned in multiple places,\ the alignment having the highest base identity was found.\ Only alignments having a base identity level within 0.5% of\ the best and at least 96% base identity with the genomic sequence were kept.\
\ \\ The mRNA track was produced at UCSC from mRNA sequence data\ submitted to the international public sequence databases by\ scientists worldwide.\
\ \\ Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW.\ \ GenBank.\ Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42.\ PMID: 23193287; PMC: PMC3531190\
\ \\ Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.\ GenBank: update.\ Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6.\ PMID: 14681350; PMC: PMC308779\
\ \\ Kent WJ.\ BLAT - the BLAST-like alignment tool.\ Genome Res. 2002 Apr;12(4):656-64.\ PMID: 11932250; PMC: PMC187518\
\ rna 1 baseColorDefault diffCodons\ baseColorUseCds genbank\ baseColorUseSequence genbank\ group rna\ indelDoubleInsert on\ indelPolyA on\ indelQueryInsert on\ longLabel Sea hare mRNAs from GenBank\ shortLabel Sea hare mRNAs\ showDiffBasesAllScales .\ table all_mrna\ track mrna\ type psl .\ visibility pack\ gold Assembly bed 3 + Assembly from Fragments 0 100 150 100 30 230 170 40 0 0 0\ This track shows the draft assembly of the sea hare genome. \ \ Whole-genome shotgun reads were assembled into contigs. When possible, \ contigs were grouped into scaffolds (also known as "supercontigs").\ The order, orientation and gap sizes between contigs within a scaffold are\ based on paired-end read evidence.
\\ In dense mode, this track depicts the contigs that make up the \ currently viewed scaffold. \ Contig boundaries are distinguished by the use of alternating gold and brown \ coloration. Where gaps\ exist between contigs, spaces are shown between the gold and brown\ blocks. The relative order and orientation of the contigs\ within a scaffold is always known; therefore, a line is drawn in the graphical\ display to bridge the blocks.
\\ All components within this track are of fragment type "W": \ Whole Genome Shotgun contig.
\ map 1 altColor 230,170,40\ color 150,100,30\ group map\ longLabel Assembly from Fragments\ shortLabel Assembly\ track gold\ type bed 3 +\ visibility hide\ cpgIslandSuper CpG Islands bed 4 + CpG Islands (Islands < 300 Bases are Light Green) 0 100 0 100 0 128 228 128 0 0 0CpG islands are associated with genes, particularly housekeeping\ genes, in vertebrates. CpG islands are typically common near\ transcription start sites and may be associated with promoter\ regions. Normally a C (cytosine) base followed immediately by a \ G (guanine) base (a CpG) is rare in\ vertebrate DNA because the Cs in such an arrangement tend to be\ methylated. This methylation helps distinguish the newly synthesized\ DNA strand from the parent strand, which aids in the final stages of\ DNA proofreading after duplication. However, over evolutionary time,\ methylated Cs tend to turn into Ts because of spontaneous\ deamination. The result is that CpGs are relatively rare unless\ there is selective pressure to keep them or a region is not methylated\ for some other reason, perhaps having to do with the regulation of gene\ expression. CpG islands are regions where CpGs are present at\ significantly higher levels than is typical for the genome as a whole.
\ \\ The unmasked version of the track displays potential CpG islands\ that exist in repeat regions and would otherwise not be visible\ in the repeat masked version.\
\ \\ By default, only the masked version of the track is displayed. To view the\ unmasked version, change the visibility settings in the track controls at\ the top of this page.\
\ \CpG islands were predicted by searching the sequence one base at a\ time, scoring each dinucleotide (+17 for CG and -1 for others) and\ identifying maximally scoring segments. Each segment was then\ evaluated for the following criteria:\ \
\ The entire genome sequence, masking areas included, was\ used for the construction of the track Unmasked CpG.\ The track CpG Islands is constructed on the sequence after\ all masked sequence is removed.\
\ \The CpG count is the number of CG dinucleotides in the island. \ The Percentage CpG is the ratio of CpG nucleotide bases\ (twice the CpG count) to the length. The ratio of observed to expected \ CpG is calculated according to the formula (cited in \ Gardiner-Garden et al. (1987)):\ \
Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)\ \ where N = length of sequence.\
\ The calculation of the track data is performed by the following command sequence:\
\ twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\\ | cpg_lh /dev/stdin 2> cpg_lh.err \\\ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\\ | sort -k1,1 -k2,2n > cpgIsland.bed\\ The unmasked track data is constructed from\ twoBitToFa -noMask output for the twoBitToFa command.\ \ \
\ CpG islands and its associated tables can be explored interactively using the\ REST API, the\ Table Browser or the\ Data Integrator.\ All the tables can also be queried directly from our public MySQL\ servers, with more information available on our\ help page as well as on\ our blog.
\\ The source for the cpg_lh program can be obtained from\ src/utils/cpgIslandExt/.\ The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file")\
\ \This track was generated using a modification of a program developed by G. Miklem and L. Hillier \ (unpublished).
\ \\ Gardiner-Garden M, Frommer M.\ \ CpG islands in vertebrate genomes.\ J Mol Biol. 1987 Jul 20;196(2):261-82.\ PMID: 3656447\
\ regulation 1 altColor 128,228,128\ color 0,100,0\ group regulation\ html cpgIslandSuper\ longLabel CpG Islands (Islands < 300 Bases are Light Green)\ shortLabel CpG Islands\ superTrack on\ track cpgIslandSuper\ type bed 4 +\ gap Gap bed 3 + Gap Locations 1 100 0 0 0 127 127 127 0 0 0\ Gaps are represented as black boxes in this track.\ If the relative order and orientation of the contigs on either side\ of the gap is supported by read pair data, \ it is a bridged gap and a white line is drawn \ through the black box representing the gap. \
\This assembly contains the following principal types of gaps:\
\ The GC percent track shows the percentage of G (guanine) and C (cytosine) bases\ in 5-base windows. High GC content is typically associated with\ gene-rich areas.\
\\ This track may be configured in a variety of ways to highlight different\ apsects of the displayed information. Click the\ "Graph configuration help"\ link for an explanation of the configuration options.\ \
The data and presentation of this graph were prepared by\ Hiram Clawson.\
\ \ map 0 altColor 128,128,128\ autoScale Off\ color 0,0,0\ graphTypeDefault Bar\ gridDefault OFF\ group map\ longLabel GC Percent in 5-Base Windows\ maxHeightPixels 128:36:16\ shortLabel GC Percent\ spanList 5\ track gc5Base\ type wig 0 100\ viewLimits 30:70\ visibility hide\ windowingFunction Mean\ blastHg18KG Human Proteins psl protein Human Proteins Mapped by Chained tBLASTn 3 100 0 0 0 127 127 127 0 0 0\ This track contains tBLASTn alignments of the peptides from the predicted and \ known genes identified in the hg18 UCSC Genes track.
\ \\ tBLASTn is part of the NCBI BLAST tool set. For more information on BLAST, see\ Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. \ Basic local alignment search tool. \ J Mol Biol. 1990 Oct 5;215(3):403-410.
\\ Blat was written by Jim Kent. The remaining utilities \ used to produce this track were written by Jim Kent or Brian Raney.
\ genes 1 blastRef hg18.blastKGRef04\ colorChromDefault off\ group genes\ longLabel Human Proteins Mapped by Chained tBLASTn\ pred hg18.blastKGPep04\ shortLabel Human Proteins\ track blastHg18KG\ type psl protein\ visibility pack\ nestedRepeats Interrupted Rpts bed 12 + Fragments of Interrupted Repeats Joined by RepeatMasker ID 0 100 0 0 0 127 127 127 1 0 0\ This track shows joined fragments of interrupted repeats extracted\ from the output of the \ RepeatMasker program which screens DNA sequences\ for interspersed repeats and low complexity DNA sequences using the\ \ Repbase Update library of repeats from the\ Genetic\ Information Research Institute (GIRI). Repbase Update is described in\ Jurka (2000) in the References section below.\
\ \\ The detailed annotations from RepeatMasker are in the RepeatMasker track. This\ track shows fragments of original repeat insertions which have been interrupted\ by insertions of younger repeats or through local rearrangements. The fragments\ are joined using the ID column of RepeatMasker output.\
\ \\ In pack or full mode, each interrupted repeat is displayed as boxes\ (fragments) joined by horizontal lines, labeled with the repeat name.\ If all fragments are on the same strand, arrows are added to the\ horizontal line to indicate the strand. In dense or squish mode, labels\ and arrows are omitted and in dense mode, all items are collapsed to\ fit on a single row.\
\ \\ Items are shaded according to the average identity score of their\ fragments. Usually, the shade of an item is similar to the shades of\ its fragments unless some fragments are much more diverged than\ others. The score displayed above is the average identity score,\ clipped to a range of 50% - 100% and then mapped to the range\ 0 - 1000 for shading in the browser.\
\ \\ UCSC has used the most current versions of the RepeatMasker software\ and repeat libraries available to generate these data. Note that these\ versions may be newer than those that are publicly available on the Internet.\
\ \\ Data are generated using the RepeatMasker -s flag. Additional flags\ may be used for certain organisms. See the\ FAQ for more information.\
\ \\ Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and\ repeat libraries used to generate this track.\
\ \\ Smit AFA, Hubley R, Green P.\ RepeatMasker Open-3.0.\ \ http://www.repeatmasker.org. 1996-2010.\
\ \\ Repbase Update is described in:\
\ \\ Jurka J.\ \ Repbase Update: a database and an electronic journal of repetitive elements.\ Trends Genet. 2000 Sep;16(9):418-420.\ PMID: 10973072\
\ \\ For a discussion of repeats in mammalian genomes, see:\
\ \\ Smit AF.\ \ Interspersed repeats and other mementos of transposable elements in mammalian genomes.\ Curr Opin Genet Dev. 1999 Dec;9(6):657-63.\ PMID: 10607616\
\ \\ Smit AF.\ \ The origin of interspersed repeats in the human genome.\ Curr Opin Genet Dev. 1996 Dec;6(6):743-8.\ PMID: 8994846\
\ varRep 1 exonNumbers off\ group varRep\ longLabel Fragments of Interrupted Repeats Joined by RepeatMasker ID\ shortLabel Interrupted Rpts\ track nestedRepeats\ type bed 12 +\ useScore 1\ visibility hide\ microsat Microsatellite bed 4 Microsatellites - Di-nucleotide and Tri-nucleotide Repeats 0 100 0 0 0 127 127 127 0 0 0\ This track displays regions that are likely to be useful as microsatellite\ markers. These are sequences of at least 15 perfect di-nucleotide and \ tri-nucleotide repeats and tend to be highly polymorphic in the\ population.\
\ \\ The data shown in this track are a subset of the Simple Repeats track, \ selecting only those \ repeats of period 2 and 3, with 100% identity and no indels and with\ at least 15 copies of the repeat. The Simple Repeats track is\ created using the \ Tandem Repeats Finder. For more information about this \ program, see Benson (1999).
\ \\ Tandem Repeats Finder was written by \ Gary Benson.
\ \\ Benson G.\ \ Tandem repeats finder: a program to analyze DNA sequences.\ Nucleic Acids Res. 1999 Jan 15;27(2):573-80.\ PMID: 9862982; PMC: PMC148217\
\ varRep 1 group varRep\ longLabel Microsatellites - Di-nucleotide and Tri-nucleotide Repeats\ shortLabel Microsatellite\ track microsat\ type bed 4\ visibility hide\ xenoRefGene Other RefSeq genePred xenoRefPep xenoRefMrna Non-Sea hare RefSeq Genes 1 100 12 12 120 133 133 187 0 0 0\ This track shows known protein-coding and non-protein-coding genes \ for organisms other than sea hare, taken from the NCBI RNA reference \ sequences collection (RefSeq). The data underlying this track are \ updated weekly.
\ \\ This track follows the display conventions for \ gene prediction \ tracks.\ The color shading indicates the level of review the RefSeq record has \ undergone: predicted (light), provisional (medium), reviewed (dark).
\\ The item labels and display colors of features within this track can be\ configured through the controls at the top of the track description page. \
\ The RNAs were aligned against the sea hare genome using blat; those\ with an alignment of less than 15% were discarded. When a single RNA aligned \ in multiple places, the alignment having the highest base identity was \ identified. Only alignments having a base identity level within 0.5% of \ the best and at least 25% base identity with the genomic sequence were kept.\
\ \\ This track was produced at UCSC from RNA sequence data\ generated by scientists worldwide and curated by the \ NCBI RefSeq project.
\ \\ Kent WJ.\ \ BLAT--the BLAST-like alignment tool.\ Genome Res. 2002 Apr;12(4):656-64.\ PMID: 11932250; PMC: PMC187518\
\ \\ Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J,\ Landrum MJ, McGarvey KM et al.\ \ RefSeq: an update on mammalian reference sequences.\ Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63.\ PMID: 24259432; PMC: PMC3965018\
\ \\ Pruitt KD, Tatusova T, Maglott DR.\ \ NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.\ Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4.\ PMID: 15608248; PMC: PMC539979\
\ genes 1 color 12,12,120\ group genes\ longLabel Non-Sea hare RefSeq Genes\ shortLabel Other RefSeq\ track xenoRefGene\ type genePred xenoRefPep xenoRefMrna\ visibility dense\ quality Quality Scores wig 0 63 Sea hare Sequencing Quality Scores 0 100 0 128 255 255 128 0 0 0 0\ The Quality Scores track shows the sequencing quality score\ of each base in the assembly. The height at each position of the track \ indicates the quality of the base. \ When zoomed out to a large range, the heights reflect the averaged scores. \
\\ This track may be configured in a variety of ways to highlight different aspects \ of the displayed information. Click the \ Graph \ configuration help link for an explanation of the configuration options.
\ \\ The quality scores were provided as part of the sea hare assembly. \ The database representation and graphical display code were written by\ Hiram Clawson.\ map 0 altColor 255,128,0\ autoScale Off\ color 0,128,255\ graphTypeDefault Bar\ gridDefault OFF\ group map\ longLabel Sea hare Sequencing Quality Scores\ maxHeightPixels 128:36:16\ shortLabel Quality Scores\ spanList 1,1024\ track quality\ type wig 0 63\ visibility hide\ windowingFunction Mean\ simpleRepeat Simple Repeats bed 4 + Simple Tandem Repeats by TRF 0 100 0 0 0 127 127 127 0 0 0
\ This track displays simple tandem repeats (possibly imperfect repeats) located\ by Tandem Repeats\ Finder (TRF) which is specialized for this purpose. These repeats can\ occur within coding regions of genes and may be quite\ polymorphic. Repeat expansions are sometimes associated with specific\ diseases.
\ \\ For more information about the TRF program, see Benson (1999).\
\ \\ TRF was written by \ Gary Benson.
\ \\ Benson G.\ \ Tandem repeats finder: a program to analyze DNA sequences.\ Nucleic Acids Res. 1999 Jan 15;27(2):573-80.\ PMID: 9862982; PMC: PMC148217\
\ varRep 1 group varRep\ longLabel Simple Tandem Repeats by TRF\ shortLabel Simple Repeats\ track simpleRepeat\ type bed 4 +\ visibility hide\ intronEst Spliced ESTs psl est Sea hare ESTs That Have Been Spliced 1 100 0 0 0 127 127 127 1 0 0\ This track shows alignments between sea hare expressed sequence tags\ (ESTs) in \ GenBank and the genome that show signs of splicing when\ aligned against the genome. ESTs are single-read sequences, typically about\ 500 bases in length, that usually represent fragments of transcribed genes.\
\ \\ To be considered spliced, an EST must show\ evidence of at least one canonical intron (i.e., the genomic\ sequence between EST alignment blocks must be at least 32 bases in\ length and have GT/AG ends). By requiring splicing, the level\ of contamination in the EST databases is drastically reduced\ at the expense of eliminating many genuine 3' ESTs.\ For a display of all ESTs (including unspliced), see the\ sea hare EST track.\
\ \\ This track follows the display conventions for\ \ PSL alignment tracks. In dense display mode, darker shading\ indicates a larger number of aligned ESTs.\
\ \\ The strand information (+/-) indicates the\ direction of the match between the EST and the matching\ genomic sequence. It bears no relationship to the direction\ of transcription of the RNA with which it might be associated.\
\ \\ The description page for this track has a filter that can be used to change\ the display mode, alter the color, and include/exclude a subset of items\ within the track. This may be helpful when many items are shown in the track\ display, especially when only some are relevant to the current task.\
\ \\ To use the filter:\
\ This track may also be configured to display base labeling, a feature that\ allows the user to display all bases in the aligning sequence or only those\ that differ from the genomic sequence. For more information about this option,\ go to the\ \ Base Coloring for Alignment Tracks page.\ Several types of alignment gap may also be colored;\ for more information, go to the\ \ Alignment Insertion/Deletion Display Options page.\
\ \\ To make an EST, RNA is isolated from cells and reverse\ transcribed into cDNA. Typically, the cDNA is cloned\ into a plasmid vector and a read is taken from the 5'\ and/or 3' primer. For most — but not all — ESTs, the\ reverse transcription is primed by an oligo-dT, which\ hybridizes with the poly-A tail of mature mRNA. The\ reverse transcriptase may or may not make it to the 5'\ end of the mRNA, which may or may not be degraded.\
\ \\ In general, the 3' ESTs mark the end of transcription\ reasonably well, but the 5' ESTs may end at any point\ within the transcript. Some of the newer cap-selected\ libraries cover transcription start reasonably well. Before the\ cap-selection techniques\ emerged, some projects used random rather than poly-A\ priming in an attempt to retrieve sequence distant from the\ 3' end. These projects were successful at this, but as\ a side effect also deposited sequences from unprocessed\ mRNA and perhaps even genomic sequences into the EST databases.\ Even outside of the random-primed projects, there is a\ degree of non-mRNA contamination. Because of this, a\ single unspliced EST should be viewed with considerable\ skepticism.\
\ \\ To generate this track, sea hare ESTs from GenBank were aligned\ against the genome using blat. Note that the maximum intron length\ allowed by blat is 750,000 bases, which may eliminate some ESTs with very\ long introns that might otherwise align. When a single\ EST aligned in multiple places, the alignment having the\ highest base identity was identified. Only alignments having\ a base identity level within 0.5% of the best and at least 96% base identity\ with the genomic sequence are displayed in this track.\
\ \\ This track was produced at UCSC from EST sequence data\ submitted to the international public sequence databases by\ scientists worldwide.\
\ \\ Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW.\ \ GenBank.\ Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42.\ PMID: 23193287; PMC: PMC3531190\
\ \\ Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.\ GenBank: update.\ Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6.\ PMID: 14681350; PMC: PMC308779\
\ \\ Kent WJ.\ BLAT - the BLAST-like alignment tool.\ Genome Res. 2002 Apr;12(4):656-64.\ PMID: 11932250; PMC: PMC187518\
\ rna 1 baseColorUseSequence genbank\ group rna\ indelDoubleInsert on\ indelQueryInsert on\ intronGap 30\ longLabel Sea hare ESTs That Have Been Spliced\ maxItems 300\ shortLabel Spliced ESTs\ showDiffBasesAllScales .\ spectrum on\ track intronEst\ type psl est\ visibility dense\ uniprot UniProt bigBed 12 + UniProt SwissProt/TrEMBL Protein Annotations 0 100 0 0 0 127 127 127 0 0 0\ This track shows protein sequences and annotations on them from the UniProt/SwissProt database,\ mapped to genomic coordinates. \
\\ UniProt/SwissProt data has been curated from scientific publications by the UniProt staff,\ UniProt/TrEMBL data has been predicted by various computational algorithms.\ The annotations are divided into multiple subtracks, based on their "feature type" in UniProt.\ The first two subtracks below - one for SwissProt, one for TrEMBL - show the\ alignments of protein sequences to the genome, all other tracks below are the protein annotations\ mapped through these alignments to the genome.\
\ \Track Name | \Description | \
---|---|
UCSC Alignment, SwissProt = curated protein sequences | \Protein sequences from SwissProt mapped to the genome. All other\ tracks are (start,end) SwissProt annotations on these sequences mapped\ through this alignment. Even protein sequences without a single curated \ annotation (splice isoforms) are visible in this track. Each UniProt protein \ has one main isoform, which is colored in dark. Alternative isoforms are \ sequences that do not have annotations on them and are colored in light-blue. \ They can be hidden with the TrEMBL/Isoform filter (see below). |
UCSC Alignment, TrEMBL = predicted protein sequences | \Protein sequences from TrEMBL mapped to the genome. All other tracks\ below are (start,end) TrEMBL annotations mapped to the genome using\ this track. This track is hidden by default. To show it, click its\ checkbox on the track configuration page. |
UniProt Signal Peptides | \Regions found in proteins destined to be secreted, generally cleaved from mature protein. | \
UniProt Extracellular Domains | \Protein domains with the comment "Extracellular". | \
UniProt Transmembrane Domains | \Protein domains of the type "Transmembrane". | \
UniProt Cytoplasmic Domains | \Protein domains with the comment "Cytoplasmic". | \
UniProt Polypeptide Chains | \Polypeptide chain in mature protein after post-processing. | \
UniProt Regions of Interest | \Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. | \
UniProt Domains | \Protein domains, zinc finger regions and topological domains. | \
UniProt Disulfide Bonds | \Disulfide bonds. | \
UniProt Amino Acid Modifications | \Glycosylation sites, modified residues and lipid moiety-binding regions. | \
UniProt Amino Acid Mutations | \Mutagenesis sites and sequence variants. | \
UniProt Protein Primary/Secondary Structure Annotations | \Beta strands, helices, coiled-coil regions and turns. | \
UniProt Sequence Conflicts | \Differences between Genbank sequences and the UniProt sequence. | \
UniProt Repeats | \Regions of repeated sequence motifs or repeated domains. | \
UniProt Other Annotations | \All other annotations, e.g. compositional bias | \
\ For consistency and convenience for users of mutation-related tracks,\ the subtrack "UniProt/SwissProt Variants" is a copy of the track\ "UniProt Variants" in the track group "Phenotype and Literature", or \ "Variation and Repeats", depending on the assembly.\
\ \\ Genomic locations of UniProt/SwissProt annotations are labeled with a short name for\ the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide"\ etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt\ record for more details. TrEMBL annotations are always shown in \ light blue, except in the Signal Peptides,\ Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks.
\ \\ Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will\ show the full name of the UniProt disease acronym.\
\ \\ The subtracks for domains related to subcellular location are sorted from outside to inside of \ the cell: Signal peptide, \ extracellular, \ transmembrane, and cytoplasmic.\
\ \\ In the "UniProt Modifications" track, lipoification sites are highlighted in \ dark blue, glycosylation sites in \ dark green, and phosphorylation in \ light green.
\ \\ Duplicate annotations are removed as far as possible: if a TrEMBL annotation\ has the same genome position and same feature type, comment, disease and\ mutated amino acids as a SwissProt annotation, it is not shown again. Two\ annotations mapped through different protein sequence alignments but with the same genome\ coordinates are only shown once.
\ \On the configuration page of this track, you can choose to hide any TrEMBL annotations.\ This filter will also hide the UniProt alternative isoform protein sequences because\ both types of information are less relevant to most users. Please contact us if you\ want more detailed filtering features.
\ \Note that for the human hg38 assembly and SwissProt annotations, there\ also is a public\ track hub prepared by UniProt itself, with \ genome annotations maintained by UniProt using their own mapping\ method based on those Gencode/Ensembl gene models that are annotated in UniProt\ for a given protein. For proteins that differ from the genome, UniProt's mapping method\ will, in most cases, map a protein and its annotations to an unexpected location\ (see below for details on UCSC's mapping method).
\ \\ Briefly, UniProt protein sequences were aligned to the transcripts associated\ with the protein, the top-scoring alignments were retained, and the result was\ projected to the genome through a transcript-to-genome alignment.\ Depending on the genome, the transcript-genome alignments was either\ provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or\ derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI\ RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements \ in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus \ are tried, in this order. The resulting protein-genome alignments of this process \ are available in the file formats for liftOver or pslMap from our data archive\ (see "Data Access" section below).\
\ \An important step of the mapping process protein -> transcript ->\ genome is filtering the alignment from protein to transcript. Due to\ differences between the UniProt proteins and the transcripts (proteins were\ made many years before the transcripts were made, and human genomes have\ variants), the transcript with the highest BLAST score when aligning the\ protein to all transcripts is not always the correct transcript for a protein\ sequence. Therefore, the protein sequence is aligned to only a very short list\ of one or sometimes more transcripts, selected by a three-step procedure:\
\ For strategy 2 and 3, many of the transcripts found do not differ in coding\ sequence, so the resulting alignments on the genome will be identical.\ Therefore, any identical alignments are removed in a final filtering step. The\ details page of these alignments will contain a list of all transcripts that\ result in the same protein-genome alignment. On hg38, only a handful of edge\ cases (pseudogenes, very recently added proteins) remain in 2023 where strategy\ 3 has to be used.
\ \In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a\ protein sequence to the correct transcript, we use a three stage process:\
This system was designed to resolve the problem of incorrect mappings of\ proteins, mostly on hg38, due to differences between the SwissProt\ sequences and the genome reference sequence, which has changed since the\ proteins were defined. The problem is most pronounced for gene families\ composed of either very repetitive or very similar proteins. To make sure that\ the alignments always go to the best chromosome location, all _alt and _fix\ reference patch sequences are ignored for the alignment, so the patches are\ entirely free of UniProt annotations. Please contact us if you have feedback on\ this process or example edge cases. We are not aware of a way to evaluate the\ results completely and in an automated manner.
\\ Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered\ with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome\ positions with pslMap and filtered again with pslReps. UniProt annotations were\ obtained from the UniProt XML file. The UniProt annotations were then mapped to the\ genome through the alignment described above using the pslMap program. This approach\ draws heavily on the LS-SNP pipeline by Mark Diekhans.\ Like all Genome Browser source code, the main script used to build this track\ can be found on Github.\
\ \\ This track is automatically updated on an ongoing basis, every 2-3 months.\ The current version name is always shown on the track details page, it includes the\ release of UniProt, the version of the transcript set and a unique MD5 that is\ based on the protein sequences, the transcript sequences, the mapping file\ between both and the transcript-genome alignment. The exact transcript\ that was used for the alignment is shown when clicking a protein alignment\ in one of the two alignment tracks.\
\ \\ For reproducibility of older analysis results and for manual inspection, previous versions of this track\ are available for browsing in the form of the UCSC UniProt Archive Track Hub (click this link to connect the hub now). The underlying data of\ all releases of this track (past and current) can be obtained from our downloads server, including the UniProt\ protein-to-genome alignment.
\ \\ The raw data of the current track can be explored interactively with the\ Table Browser, or the\ Data Integrator.\ For automated analysis, the genome annotation is stored in a bigBed file that \ can be downloaded from the\ download server.\ The exact filenames can be found in the \ track configuration file. \ Annotations can be converted to ASCII text by our tool bigBedToBed\ which can be compiled from the source code or downloaded as a precompiled\ binary for your system. Instructions for downloading source code and binaries can be found\ here.\ The tool can also be used to obtain only features within a given range, for example:\
\ bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/aplCal1/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout \
\ Please refer to our\ mailing list archives\ for questions, or our\ Data Access FAQ\ for more information. \ \ \\ \
To facilitate mapping protein coordinates to the genome, we provide the\ alignment files in formats that are suitable for our command line tools. Our\ command line programs liftOver or pslMap can be used to map\ coordinates on protein sequences to genome coordinates. The filenames are\ unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).
\ \Example commands:\
\ wget -q https://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/uniprot/2022_03/unipToGenome.over.chain.gz\ wget -q https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver\ chmod a+x liftOver\ echo 'Q99697 1 10 annotationOnProtein' > prot.bed\ liftOver prot.bed unipToGenome.over.chain.gz genome.bed\ cat genome.bed\\ \ \
\ This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris\ Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo\ Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data\ available for download.\
\ \\ UniProt Consortium.\ \ Reorganizing the protein space at the Universal Protein Resource (UniProt).\ Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5.\ PMID: 22102590; PMC: PMC3245120\
\ \\ Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A.\ \ The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure\ information on human protein variants.\ Hum Mutat. 2004 May;23(5):464-70.\ PMID: 15108278\
\ genes 1 allButtonPair on\ compositeTrack on\ dataVersion /gbdb/$D/uniprot/version.txt\ exonNumbers off\ group genes\ hideEmptySubtracks on\ itemRgb on\ longLabel UniProt SwissProt/TrEMBL Protein Annotations\ mouseOverField comments\ shortLabel UniProt\ track uniprot\ type bigBed 12 +\ urls uniProtId="http://www.uniprot.org/uniprot/$$#section_features" pmids="https://www.ncbi.nlm.nih.gov/pubmed/$$"\ visibility hide\