This track displays structural variants (SVs) — deletions, insertions, and complex substitutions of at least 50 bp — identified by the Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese minority ethnic groups.
The upstream release combined the 58 CPC samples with 47 samples from Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus. For this track we recomputed allele counts (AC), allele numbers (AN) and sample counts (NS) using only the 58 CPC sample columns (those with HIFI032* or RY* prefixes in the source VCF) and dropped all snarls that no CPC sample carries (HPRC-specific SVs). To see the HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.
A pangenome is a graph that represents many genomes simultaneously, letting variants that are missing from a single linear reference be captured and typed directly. Variants are shown natively on the hs1 browser and lifted to hg38 using the UCSC hs1ToHg38.over.chain.gz chain. The track contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did not lift, typically in T2T-added repetitive regions).
Items are colored by SV type:
Each bed item spans from the start of the REF allele to its end on the reference. Pure insertions (where REF is a single base) therefore appear as narrow single-base marks; DELs and CPX items span the affected reference interval.
The name field is the graph snarl ID (two node identifiers separated by strand arrows, e.g. >2541>2547). It is stable across the graph but has no meaning outside the CPC pangenome graph file.
The source VCF was decomposed with bcftools norm -m -any, so each graph snarl appears as one VCF row per alternative allele (a single bubble in the graph may have 2-20+ alt paths). For this track we first compute the CPC-only allele count per alt, drop any alt that no CPC sample carries, then collapse all remaining alts sharing the same snarl ID into one track item:
Available filters:
Gao et al. 2023 generated PacBio HiFi long reads (mean ~30.65x, Sequel II/IIe platforms) for 58 QC-passed samples representing 36 minority Chinese ethnic groups, complemented with Illumina short reads and Oxford Nanopore ultralong reads. Haplotype-phased de novo assemblies were produced with hifiasm v0.16.1 (116 high-quality haplotype assemblies retained after QC) and combined with 47 HPRC Phase 1 assemblies into a single variation graph built on T2T-CHM13v2 with the Minigraph-Cactus pipeline (Minigraph v0.19 for the SV skeleton, Cactus v2.1.1 base alignment, hal2vg). Graph bubbles were decomposed into variant records with vcfwave and normalized with bcftools norm -m -any, yielding the source VCF (CPC.HPRC.Phase1.processed.SVs.normed.vcf.gz). The upstream Gao et al. release identified 78,072 SVs across the combined 105-sample graph. For this track we restrict to the 58 CPC samples (columns matching HIFI032* or RY*), recompute AC/AN/NS from those columns only, drop snarls with no CPC carrier (HPRC-specific sites), filter to alts with ≥50 bp REF/ALT length difference, and collapse by graph snarl ID. The final track contains 46,092 snarl sites on hs1; the hg38 version is lifted with the UCSC hs1ToHg38.over.chain.gz chain (36,030 sites, 10,062 did not lift).
The source VCF is distributed by the Chinese-Pangenome-Consortium-Phase-I GitHub repository.
The step-by-step build commands (CPC-only recount, liftOver, snarl collapse, bigBed build) are recorded in the UCSC makeDoc for this track container: doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in makeDb/scripts/lrSv.
The data can be explored interactively with the Table Browser or Data Integrator, and accessed from scripts via our API (track=cpc1Sv).
For automated download, the bigBed files are at http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb (native) and http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/cpc1.bb (lifted). Use bigBedToBed to extract features: e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb -chrom=chr21 -start=0 -end=100000000 stdout
The original pangenome VCF is distributed by the Chinese Pangenome Consortium; see the CPC Phase I repository.
Thanks to the Chinese Pangenome Consortium and the HPRC Phase 1 team for producing and releasing the combined pangenome and its decomposed variant calls.
Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y et al. A pangenome reference of 36 Chinese populations. Nature. 2023 Jul;619(7968):112-121. PMID: 37316654; PMC: PMC10322713