#######################################################################
# NMD escape regions from Gencode (2025-03-24 max/Claude)
# Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
# Collapsed version uses gene symbols from input, colors by rule, transcript lists
# Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
#
# 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
# exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
#
# 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
# len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
# single CDS exon (UTR introns) as "intronless", AND silently suppressed their
# Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
# transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
#
# 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology:
# "single coding exon AND no 3'UTR intron" instead of "exonCount==1".
# 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs
# are cleared by the scanning 40S or sit upstream of the stop codon and are
# never encountered by the terminating ribosome), so transcripts with a single
# coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2.
# Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3
# because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq
# curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and
# 10,790 Gencode V49 transcripts into Rule 2.
# Post-fix rule counts (collapsed regions):
#   RefSeq Curated:  R1=54,015  R2=2,942  R3=49,443  R4=6,503  (total 112,903)
#   Gencode V49:     R1=134,464 R2=6,599  R3=85,319  R4=7,547  (total 233,929)

cd /hive/data/genomes/hg38/bed/nmd/gencode/

# run the script on gencode bigGenePred - produces decorator + collapsed BED files
~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
    /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
    knownGeneNmdDeco.bed nmdEscRegions.bed

# build decorator bigBed
bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
    -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as

# build collapsed bigBed
bedSort nmdEscRegions.bed nmdEscRegions.bed
bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \
    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as

# symlinks to /gbdb
ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb
ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb


#######################################################################
# NMD escape regions from NCBI RefSeq (2025-03-24 max)
#
# 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only,
# no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all
# outputs moved to refseqAll.bak/ within the same build dir.

cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/

# run the script on ncbiRefSeq curated genePredExt
# Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format)
# alongside the collapsed output, but we intentionally do not convert it to bigBed
# for RefSeq. The decorator workflow currently only ships for Gencode/knownGene
# (via knownGeneNmdDeco.bb).
~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \
    /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \
    nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed

# build collapsed bigBed
bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed
bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \
    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as

# symlink to gbdb
ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb

#######################################################################
# NMD escape regions from MANE Select Plus Clinical (2026-04-24 max)
# Same script, run on the MANE bigGenePred. MANE puts the HGNC symbol in
# bigGenePred field 18 (Gencode uses field 17), so pass --gene-sym-field 18.

mkdir -p /hive/data/genomes/hg38/bed/nmd/mane && cd /hive/data/genomes/hg38/bed/nmd/mane

# --ncbi-id-field 21 puts the NCBI RefSeq accession (NM_/NR_) into the
# collapsed bigBed's ncbiIds column so the trackDb stanza can offer NM_ as
# the default label via labelFields/defaultLabelFields.
# --no-collapse emits one row per (transcript, region). MANE Select gives
# one transcript per gene; MANE Plus Clinical adds a second transcript for
# 74 genes (e.g. LMNA). Keeping rows per-transcript means each label-field
# column holds a single value, which renders cleaner than a comma list.
~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred --gene-sym-field 18 \
    --ncbi-id-field 21 --no-collapse \
    /gbdb/hg38/mane/mane.bb \
    nmdManeDeco.bed nmdEscMane.bed

bedSort nmdEscMane.bed nmdEscMane.bed
# MANE uses its own .as so the labelFields dropdown shows clean column
# labels ("Gene Symbol" / "Gencode Accession (ENST)" / "RefSeq Accession
# (NM_/NR_)") that wouldn't fit the Gencode/RefSeq tracks (whose
# "transcripts" column holds different accession types).
bedToBigBed nmdEscMane.bed ../../../chrom.sizes nmdEscMane.bb \
    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscManeCollapsed.as

ln -sf /hive/data/genomes/hg38/bed/nmd/mane/nmdEscMane.bb /gbdb/hg38/nmd/nmdEscMane.bb

# Collapsed-region counts (current script, no Rule 1/4 algorithmic fix):
#   MANE 1.5:        68,345
#   Gencode V49:    233,929
#   RefSeq Curated: 112,903

# 2026-04-24 max: Fixed Rule 1 to measure 50 bp upstream of the last splice
# junction of the transcript (including 3'UTR introns), not the last CDS-CDS
# junction; output regions are clipped to CDS. The old logic stripped 3'UTR
# from the exon list before computing the "last coding junction", which
# over-painted the last CDS exon as NMD-escape whenever there was only one
# CDS exon, even when a 3'UTR intron sat far downstream (e.g. NBDY: the
# entire 207 bp CDS was painted Rule 1 despite the last junction being
# 2.6 kb past the stop). Rule 4 updated in the same pass: when a 3'UTR
# intron exists, the last CDS-containing exon has a downstream EJC and is
# now eligible for Rule 4.
#
# The script supports two ways of counting the 50 bp walk-back from the
# last junction (--rule1-mode):
#   cds  (default) - count only CDS nucleotides, skipping 3'UTR. A
#                    transcript like NBDY (last junction 2.6 kb past the
#                    stop, in 3'UTR) gets 50 bp of CDS painted, matching
#                    the literal "last 50 bp" reading of the rule label.
#   mrna           - count mRNA nucleotides including 3'UTR, then clip
#                    output to CDS. NBDY-like transcripts get nothing
#                    painted because the 50 mRNA-bp window stays inside
#                    3'UTR. Tracks the 55 bp-rule literature, where the
#                    ribosome-EJC distance is measured in mRNA bp.
# We ship the 'cds' mode; the 'mrna' mode is retained for comparison.
#
# Post-fix collapsed-region counts (--rule1-mode=cds):
#   MANE 1.5:        68,028  (--no-collapse: one row per transcript)
#   Gencode V49:    233,375
#   RefSeq Curated: 112,356

#######################################################################
# Lindeboom et al. NMDetective scores (2025-03-23 max/Claude)
# NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet.
# Four bedGraph custom track files downloaded to:
#   /hive/data/genomes/hg38/bed/nmd/lindeboom/
# Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398
# Custom track data in the session links from that page
# - NMDetectiveA.ct  - Random forest prediction of NMD efficiency
# - NMDetectiveB.ct  - Decision tree prediction of NMD efficiency
# - nmdDectA-ptc.ct  - Random forest, first out-of-frame PTC
# - nmdDectB-ptc.ct  - Decision tree, first out-of-frame PTC

# Convert bedGraph custom tracks to bigWig and symlink from /gbdb:
cd /hive/data/genomes/hg38/bed/nmd/lindeboom/
bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh
