Description

This track shows structural variants (SVs) identified by PacBio HiFi long-read sequencing of probands and their families enrolled in the Genomic Answers for Kids (GA4K) program at Children's Mercy Research Institute. GA4K is a longitudinal pediatric genomics initiative that aims to enroll 30,000 children with suspected rare genetic disorders, together with their parents, to build a large-scale resource of clinical and genomic data.

The callset contains 115,554 SVs (52,564 deletions, 58,219 insertions, 4,408 duplications, 363 inversions) from 502 sequenced samples. Variants are site-level (no per-sample genotypes) and each SV has been replicated, meaning that it was either observed in two or more unrelated GA4K individuals, or matched an SV from an external long-read reference set (Decode or the Human Pangenome Reference Consortium).

Display Conventions and Configuration

Items are colored by SV type:

Insertions are placed at the insertion site with a width of 1 bp; deletions, duplications and inversions span the affected interval. Filters are available for SV type, SV length, carrier-sample count and allele frequency. The detail page also shows the total number of samples genotyped at each site.

Methods

The Genomic Answers for Kids (GA4K) program at Children's Mercy Research Institute is a longitudinal pediatric rare-disease initiative described in Cohen et al. 2022. GA4K probands and their families are sequenced with PacBio HiFi long reads (Revio and Sequel II), and the 502-sample GA4K PacBio SV release (pb_joint_merged.sv.vcf.gz) is produced by running pbsv per sample and merging with JASMINE v1.1.4 (--output-genotypes). The merged site-level VCF is filtered to SVs replicated in at least two independent observations (either matching a second unrelated CMH individual in the same Jasmine cluster, or matching an SV in the deCODE Icelandic or HPRC callsets via svpack match). The released catalog contains 115,554 replicated SVs (52,564 deletions, 58,219 insertions, 4,408 duplications and 363 inversions) with recomputed carrier counts (SVC), total sample counts (SVN) and allele frequencies (SVF = SVC/SVN).

The source VCF was cloned from the Children's Mercy Research Institute GA4K GitHub repository, github.com/ChildrensMercyResearchInstitute/GA4K (pacbio_sv_vcf/pb_joint_merged.sv.vcf.gz).

The step-by-step build commands (download, format conversion, bigBed build) are recorded in the UCSC makeDoc for this track container: doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in makeDb/scripts/lrSv.

Data Access

The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=ga4kSv.

For automated download and analysis, the annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called ga4kSv.bb. Individual regions or the whole annotation can be obtained using the bigBedToBed utility, available as a precompiled binary or from source as described on our utilities page. Example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/ga4kSv.bb -chrom=chr21 -start=0 -end=100000000 stdout.

The original VCF is available from the Children's Mercy Research Institute GA4K data release at github.com/ChildrensMercyResearchInstitute/GA4K.

Credits

Thanks to the Children's Mercy Research Institute and the Genomic Answers for Kids participants and their families for making this dataset publicly available.

References

Cohen ASA, Farrow EG, Abdelmoity AT, Alaimo JT, Amudhavalli SM, Anderson JT, Bansal L, Bartik L, Baybayan P, Belden B et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet Med. 2022 Jun;24(6):1336-1348. PMID: 35305867