Description

This container shows results from projects where the variant frequencies, aka allele frequencies, are publicly available. The tracks were collected from the projects listed below. More detailed data for projects that provide haplotype-phased genotypes/variants can also be found in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track, showing the linkage between variants.

If you want us to add other projects, please contact us. We were unable to obtain variant frequencies from the following projects: UK Biobank (request pending), Regeneron's Million Exomes and Mexico City Studies (request rejected).

The following projects were added:

Display Conventions

Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).

Data Access

Most of the data in these tracks are not available for download from UCSC and the data can only be browsed on our website. But all variant data can be downloaded for free from the original project websites. Accessing it usually requires a click-through license or filling out an access request form on the respective websites, by following these instructions:

MXB: Allele frequencies by geographical state and ancestry are available via the MexVar platform. Raw genotype data are available under controlled access at the EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email andres.moreno@cinvestav.mx.

TOPMED: VCFs with summarized allele frequencies are available from the TOPMED BRAVO website. They require a login.

SFARI SPARK: Allele frequencies can be displayed on the SFARI Genome Browser. Full CRAMs and VCFs with genotypes are available from SFARI Base. They require a data access request, which is usually reviewed quickly. More information is available in the SPARK Welcome Packet.

Australia MGRB: VCF access can be requested via a form from Sydney Genomics.

GenomeAsia Pilot: VCFs are available from UCSC and also from the GenomeAsia 100K website. No license nor login.

KOVA: TSV data can be requested on the KOVA Downloads website. Our Github repo contains a script that converts this format to VCF.

Finngen: TSV data can be requested via the form at Finngen, which triggers an automated email containing the download link. A script in our Github repo converts this file to VCF (see methods below).

SweGen: VCF files can be requested at SweGen via a form, the request needs manual approval, which usually is quick. If there is no reply, email SweGen directly.

NPM: VCF download can be requested on the Chorus Browser website, which requires an account and data access request.

Methods

The following are quotes from the respective papers and/or websites of the datasets:

MXB: Genotyping was performed with the Illumina Multi-Ethnic Global Array (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for ancestry-informative and medically relevant variants. Only autosomal, biallelic SNPs passing quality control are included. Samples were selected from 898 recruitment sites, with prioritization of indigenous language speakers. Data processing included GenomeStudio → PLINK conversion, strand alignment, removal of duplicates, update of map positions using dbSNP Build 151 and low-quality variants/individuals, and relatedness filtering.

SGDP: The version used was https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/, merged with bcftools and lifted to hg38 with CrossMap.

KOVA: Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). Downstream analyses followed a modified version of the gnomAD quality-control framework and were primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality <20, read depth <10, allelic balance <0.2, or overlapping low-complexity regions were excluded (Supplementary Fig. 2).
At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not available for download from our site but can be requested from the KOVA website.

ABraOM: For Academic use only. Licensing for commercial use might be available under request and agreement. By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022). Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30x coverage, and reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female (~55,000 X CPM, <200 MSY CPM) and male (~27,500 X CPM, >550 MSY CPM) patterns. Germline SNVs and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate (GENESIS), retaining a single proband per related pair and excluding one contaminated sample (>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of 36.6x.

SFARI SPARK: The genome browser track project was approved by the Simons Foundation under request number 14584.1. WES and WGS Data were downloaded from SFARI Base. pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and normalized. There was no minimum allele frequency cutoff.
The methods are documented as follows by SFARI:

Finngen: R12 annotated variants were downloaded from the Google Cloud bucket link received though an email and converted to VCF with a custom Python script.

SweGen: Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using VariantRecalibrator and ApplyRecalibration.
At UCSC, the hg38 VCF was downloaded from SweFreq.

Australia MGRB: MGRB samples underwent whole-genome sequencing on Illumina HiSeq X instruments at KCCG under ISO 15189 accreditation, using paired-end TruSeq DNA Nano libraries sequenced one lane per sample. Reads were aligned to human reference genome Build 37 (GRCh37) and processed following GATK best practices, including indel realignment and base quality score recalibration, with variant calling performed using GATK HaplotypeCaller to generate g.vcf files. Data processing utilized the Genome One Discovery pipeline and analysis was conducted using the Hail framework.

NPM Singapore: Whole Genome Sequencing (WGS) data processing followed GATK4 best practices. GATK4 germline variant analysis workflow written in WDL was adapted to use Nextflow and deployed at the National Supercomputing Centre, Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single sample gVCFs. The gVCF files were joint-called then loaded in Hail, an open-source python-based data analysis library suited to work with population-scale with genomic data collections. Low-quality WGS libraries and low-quality variants were removed. QC-ed variants were functionally annotated using Ensembl Variant Effect Predictor (VEP) (version 95). Functional annotations for variant impacting protein-coding were also complemented with information on the potential alteration to their cognate protein's 3D structure and drug binding ability.

Saudi Genome Program: Data were downloaded from Figshare, and converted to VCF.

Credits

MXB: We thank the Center for Research and Advanced Studies (Cinvestav) of Mexico for generating and providing the frequency data, the National Institute of Medical Sciences and Nutrition (INCMNSZ) for DNA extraction, and the Ministry of Health together with the National Institute of Public Health (INSP) for the design and implementation of the National Health Survey 2000 (ENSA 2000). We also thank the ENSA-Genomics Consortium for their contributions to sample collection and data processing that made possible the construction of the MXB genomic resource.

MCPS: Data produced by Regeneron RGC and collaborators, which are the University of Oxford, Universidad Nacional Autónoma de México (UNAM) and National Institute of Genomic Medicine in Mexico. The Regeneron Genetics Center, University of Oxford, Universidad Nacional Autónoma de México (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and AstraZeneca UK Limited (collectively, the "Collaborators") bear no responsibility for the analyses or interpretations of the data presented here. Any opinions, insights, or conclusions presented herein are those of the authors and not of the Collaborators.

Regeneron Million Exomes: The Regeneron Genetics Center, and its collaborators (collectively, the "Collaborators") bear no responsibility for the analyses or interpretations of the data presented here. Any opinions, insights, or conclusions presented herein are those of the authors and not of the Collaborators. This research has been conducted using the UK Biobank Resource under application number 26041.

SGDP: This project was funded by the Simons Foundation. Thanks to David Reich and Swapan Mallick for help with importing the data.

KOVA: Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV format.

Finngen: We want to acknowledge the participants and investigators of the FinnGen study.

SweGen: The SweGen allele frequency data was generated by Science for Life Laboratory. The data may be redistributed in original or modified form, but must always be distributed together with the file "terms_of_use.txt" that is stored together with the data on our download server, and any redistributed data derived from the SweGen data set must follow those terms and conditions. The data may not be used to attempt to identify any individual in this or other studies.

NPM Singapore: Thanks to the NPM Data Access Committee and Eleanor for granting our data request. By browsing the data, you agree to use the data only for academic, non-commercial research to improve human health (biology/disease). We request all data users agree to protect the confidentiality of the data subjects in any research papers or publications that they may prepare, by taking all reasonable care to limit the possibility of identification. In particular, the data users shall not to use, or attempt to use, the data to deliberately compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy. If you use any of the data obtained from the CHORUS variant browser, we request that you cite the NPM flagship paper (Wong et al, 2023). All data users of the data must take note that the data provider and relevant SG10K_Health cohort owners bear no responsibility for the further analysis or interpretation of the data.

Thanks to Alex Ioannidis, UCSC, for the idea and motivation for this track. Thanks to Andreas Lahner, MGZ, for feedback and suggestions.

References

Barberena-Jonas, C. et al. (2025). MexVar database: Clinical genetic variation beyond the Hispanic label in the Mexican Biobank. Nature Medicine (in press).

Sohail M, Moreno-Estrada A. The Mexican Biobank Project promotes genetic discovery, inclusive science and local capacity building. Dis Model Mech. 2024 Jan 1;17(1). PMID: 38299665; PMC: PMC10855211

Sohail M, Palma-Martínez MJ, Chong AY, Quinto-Corés CD, Barberena-Jonas C, Medina-Muñoz SG, Ragsdale A, Delgado-Sánchez G, Cruz-Hervert LP, Ferreyra-Reyes L et al. Mexican Biobank advances population and medical genomics of diverse ancestries. Nature. 2023 Oct;622(7984):775-783. PMID: 37821706; PMC: PMC10600006

Ziyatdinov A, Torres J, Alegre-Díaz J, Backman J, Mbatchou J, Turner M, Gaynor SM, Joseph T, Zou Y, Liu D et al. Genotyping, sequencing and analysis of 140,000 adults from Mexico City. Nature. 2023 Oct;622(7984):784-793. PMID: 37821707; PMC: PMC10600010

GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature. 2019 Dec;576(7785):106-111. PMID: 31802016; PMC: PMC7054211

Sun KY, Bai X, Chen S, Bao S, Zhang C, Kapoor M, Backman J, Joseph T, Maxwell E, Mitra G et al. A deep catalogue of protein-coding variation in 983,578 individuals. Nature. 2024 Jul;631(8021):583-592. PMID: 38768635; PMC: PMC11254753

Tadaka S, Kawashima J, Hishinuma E, Saito S, Okamura Y, Otsuki A, Kojima K, Komaki S, Aoki Y, Kanno T et al. jMorp: Japanese Multi-Omics Reference Panel update report 2023. Nucleic Acids Res. 2024 Jan 5;52(D1):D622-D632. PMID: 37930845; PMC: PMC10767895

Naslavsky MS, Scliar MO, Yamamoto GL, Wang JYT, Zverinova S, Karp T, Nunes K, Ceroni JRM, de Carvalho DL, da Silva Simões CE et al. Whole-genome sequencing of 1,171 elderly admixed individuals from São Paulo, Brazil. Nat Commun. 2022 Mar 4;13(1):1004. PMID: 35246524; PMC: PMC8897431

Jain A, Bhoyar RC, Pandhare K, Mishra A, Sharma D, Imran M, Senthivel V, Divakar MK, Rophina M, Jolly B et al. IndiGenomes: a comprehensive resource of genetic variants from over 1000 Indian genomes. Nucleic Acids Res. 2021 Jan 8;49(D1):D1225-D1232. PMID: 33095885; PMC: PMC7778947

Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020 Mar 20;367(6484). PMID: 32193295; PMC: PMC7115999

Koenig Z, Yohannes MT, Nkambule LL, Zhao X, Goodrich JK, Kim HA, Wilson MW, Tiao G, Hao SP, Sahakian N et al. A harmonized public resource of deeply sequenced diverse human genomes. Genome Res. 2024 Jun 25;34(5):796-809. PMID: 38749656; PMC: PMC11216312

Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016 Oct 13;538(7624):201-206. PMID: 27654912; PMC: PMC5161557

Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO et al. A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population. Exp Mol Med. 2022 Nov;54(11):1862-1871. PMID: 36323850; PMC: PMC9628380

Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H, Aavikko M, Kaunisto MA et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023 Jan;613(7944):508-518. PMID: 36653562; PMC: PMC9849126

Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K et al. The Singapore National Precision Medicine Strategy. Nat Genet. 2023 Feb;55(2):178-186. PMID: 36658435

Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK. Patterns of population structure and genetic variation within the Saudi Arabian population. bioRxiv. 2025 Jan 13;. PMID: 39868174; PMC: PMC11761371

Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, Lundin P, Che H et al. SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population. Eur J Hum Genet. 2017 Nov;25(11):1253-1260. PMID: 28832569; PMC: PMC5765326

SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. Neuron. 2018 Feb 7;97(3):488-493. PMID: 29420931; PMC: PMC7444276

Lacaze P, Pinese M, Kaplan W, Stone A, Brion MJ, Woods RL, McNamara M, McNeil JJ, Dinger ME, Thomas DM. The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. Rationale and cohort design. Eur J Hum Genet. 2019 Feb;27(2):308-316. PMID: 30353151; PMC: PMC6336775