Description

This track is the only short-read dataset in the Long-Read Variants track collection; it is included for comparison with the long-read callsets.

This track shows structural variants (SVs) from the expanded 1000 Genomes Project cohort of 3,202 high-coverage Illumina short-read whole-genome sequences (including 602 trios), sequenced at ~30x on NovaSeq 6000 and described in Byrska-Bishop et al. 2022. SVs were called with the GATK-SV / svtools integrated pipeline; this release adds re-genotyped novel insertions and recomputed allele frequencies per continental group.

The track contains 173,366 SVs across seven classes: 90,259 deletions (DEL), 49,693 insertions (INS), 28,242 duplications (DUP), 3,568 complex events (CPX), 920 inversions (INV), 673 multi-allelic copy-number variants (CNV) and 11 reciprocal translocations (CTX). Allele counts, allele frequencies and per-superpopulation frequencies (AFR, AMR, EAS/ASN, EUR, SAS/SAN) are provided for each site.

Display Conventions and Configuration

Items are colored by SV type:

Insertions are placed at the insertion site; deletions, duplications, inversions, complex and copy-number variants span the affected reference interval. Translocations show only the chr1-side breakpoint; the partner chromosome is reported on the detail page.

Filters are available for SV type, SV length, overall allele frequency, population-max allele frequency and per-population AFs (African and European). The detail page also shows heterozygous / homozygous-alternate carrier counts, the set of upstream SV callers, the upstream pipeline source and the VCF FILTER status.

Methods

Byrska-Bishop et al. 2022 sequenced the 3,202-sample expanded 1000 Genomes Project cohort (2,504 original unrelated samples plus 698 samples that complete 602 parent-child trios) on Illumina NovaSeq 6000 at ~30x coverage with 2x150 bp reads. SNVs and indels were called with GATK HaplotypeCaller. SVs were discovered and integrated from three analytic pipelines - GATK-SV, svtools and Absinthe - through a machine-learning integration model; novel insertions were re-genotyped to produce the freeze V3 callset with added allele frequencies (*.wAF.vcf.gz). The final ensemble callset contains 173,366 SVs across seven classes: 90,259 DELs, 49,693 INSs, 28,242 DUPs, 920 INVs, 3,568 complex SVs (CPX), 673 multi-allelic CNVs and 11 inter-chromosomal translocations (CTX), with AC, AN, AF and per-superpopulation AFs (AFR, AMR, EAS/ASN, EUR, SAS/SAN).

Why a short-read track in a long-read collection? Short-read SV callsets such as this one generally have high precision for deletions and duplications but miss many insertions, repeat expansions and variants in complex/low-mappability regions that long-read technologies can resolve. Displaying this callset alongside the long-read tracks in this collection makes it easier to spot variants that are unique to long-read data or that have substantially different breakpoints when called from short reads.

The freeze V3 VCF 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz was downloaded from the IGSR 1000 Genomes Illumina SV integration folder.

The step-by-step build commands (download, format conversion, bigBed build) are recorded in the UCSC makeDoc for this track container: doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in makeDb/scripts/lrSv.

Data Access

The data can be explored interactively in table format with the Table Browser or the Data Integrator, and accessed programmatically through our API, track=onekg3202Sr.

The bigBed is available from our download server as onekg3202sr.bb. Example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/onekg3202sr.bb -chrom=chr21 -start=0 -end=100000000 stdout.

The original joint-genotyped VCF is available from the IGSR 1000 Genomes Illumina SV integration folder.

Credits

Thanks to Byrska-Bishop, Marth and the 1000 Genomes / NYGC team for releasing this dataset, and to the GATK-SV developers for the cohort calling pipeline.

References

Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022 Sep 1;185(18):3426-3440.e19. PMID: 36055201; PMC: PMC9439720