This track shows allele frequencies from high-coverage whole-genome sequencing of 180 individuals (15 per population) from 12 indigenous African populations representing all four major African language phyla (Khoesan, Niger-Congo, Nilo-Saharan, Afroasiatic). The cohort, generated by the Tishkoff lab and collaborators (Fan et al., Cell 2023), spans the Amhara, Dizi, Chabu and Mursi from Ethiopia; the Hadza and Sandawe from Tanzania; the Central African rainforest hunter-gatherers (Baka and Bagyeli, merged), Fulani and Tikari from Cameroon; and the Herero, Ju|'hoansi and !Xoo (the latter two collectively the "San") from Botswana. The dataset was generated to capture demographic history and signatures of local adaptation in African populations that are poorly represented in other reference panels.
Only aggregate allele frequencies (AC, AF, AN summed over all 180 individuals) are shown for each variant; per-population frequencies are not provided in the released sites VCF. The original variant calls were on the GRCh37/hs37d5 reference and were lifted to hg38 at UCSC.
Variants display as standard VCF allele frequency tracks. On mouseover and click, the allele count (AC), total allele number (AN) and allele frequency (AF) are shown. When zoomed in, alleles are colored by base. Multi-allelic records were split into biallelic rows during normalization upstream.
Whole genome sequencing of 180 individuals (15 unrelated samples per population) was performed at >30× average coverage on the Illumina HiSeq X Ten platform using PCR-free library preparation with paired-end 150 bp reads and a 350 bp insert size. Adapters were trimmed with trimadap, optical duplicates were marked with SAMBLASTER (v0.1.22), and reads were aligned to the hs37d5 decoy version of GRCh37 with BWA-MEM (v0.7.10). Reads with mapping quality < 20 were filtered. Per-sample short variants were called with GATK HaplotypeCaller (nightly-2016-09-26-gfade77f) in gVCF mode using a custom genotype prior (0.4995, 0.001, 0.4995) to reduce reference bias, following the SGDP recommendation. Joint genotyping was performed with GATK GenotypeGVCFs. Variants were filtered with GATK VQSR using 1000 Genomes Phase 3, Illumina Omni 5M and HapMap as SNP truth sets and Mills indels as the indel truth set. Variants overlapping potential duplications detected by Delly (v0.7.6) and low-complexity regions were excluded. After QC the cohort yielded 32.4 M SNPs and 2.8 M small indels. The publicly released SNP-only sites VCF used here contains 33.6 M biallelic SNPs with aggregate AC/AF/AN summaries. See Fan et al. (2023) for full methods.
The hg19 SNPs sites VCF was provided directly by Matthew Hansen at the Tishkoff lab (University of Pennsylvania) via a Box link (180wgs.SNPs.sites.AF.vcf.gz). Bare chromosome names (1-22) were converted to UCSC-style names with bcftools annotate --rename-chrs, the VCF was lifted from hg19 to hg38 with CrossMap.py vcf using the UCSC hg19ToHg38.over.chain.gz chain, then sorted, bgzip-compressed and tabix-indexed with bcftools sort and tabix. Step-by-step processing instructions are in the makeDoc file; the supporting scripts live under kent/src/hg/makeDb/scripts/varFreqs.
The original (hg19) variant calls and supplementary data accompany the publication; see the "Data and code availability" section of Fan et al. (2023). The dataset is not available for redistribution from our website, so the Table Browser, Data Integrator and download server are disabled for this track. The hg19 sites VCF can be requested from the Tishkoff lab at the University of Pennsylvania.
Thanks to Matthew Hansen and Sarah Tishkoff (University of Pennsylvania) for sharing the sites-only allele-frequency VCF, and to all participating individuals and field collaborators in Ethiopia, Tanzania, Cameroon and Botswana whose contributions made this dataset possible.
Fan S, Spence JP, Feng Y, Hansen MEB, Terhorst J, Beltrame MH, Ranciaro A, Hirbo J, Beggs W, Thomas N et al. Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation. Cell. 2023 Mar 2;186(5):923-939.e14. PMID: 36868214; PMC: PMC10568978