
References:

https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/

https://en.wikipedia.org/wiki/Pfizer%E2%80%93BioNTech_COVID-19_vaccine

The WHO document that first published the BioNTech sequence:

https://mednet-communities.net/inn/db/media/docs/11889.doc

The Stanford github source:

https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-vaccines-BNT-162b2-and-mRNA-1273

The Stanford document:

https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-vaccines-BNT-162b2-and-mRNA-1273/blob/main/Assemblies%20of%20putative%20SARS-CoV2-spike-encoding%20mRNA%20sequences%20for%20vaccines%20BNT-162b2%20and%20mRNA-1273.docx.pdf

It is a bit tricky to extract the sequence from these WORD documents.
They won't save to text, and even copy and paste is a bit troublesome
due to the coloring annotations in the paper.

To format a fasta file better if it doesn't look clean:

printf ">sequenceName\n" > betterFormat.fa
grep -v "^>" badFormat.fa | xargs echo | tr -d ' ' | fold -w 60 \
   >> betterFormat.fa

And then verify nothing lost after conversion:

faCount badFormat.fa betterFormat.fa

# the sequences obtained from the documents:

-rw-rw-r-- 1 4384 Mar 30 21:24 WHO_BNT162b2.fa
-rw-rw-r-- 1 4088 Apr  1 23:19 ModernaMrna1273.fa
-rw-rw-r-- 1 4263 Apr  1 23:40 ReconstructedBNT162b2.fa

The github supplied a fasta file:

  wget -O- "https://raw.githubusercontent.com/NAalytics/Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-vaccines-BNT-162b2-and-mRNA-1273/main/Figure1Figure2_032321.fasta" | faCount stdin

Figure1_032321_Spike-encoding_contig_assembled_from_BioNTech/Pfizer_BNT-162b2_vaccine   4175    1004    1313    1060    798     0       223
Figure_2_32321_Spike-encoding_contig_assembled_from_Moderna_mRNA-1273_vaccine  4004     898     1364    1118    624     0       283
total   8179    1902    2677    2178    1422    0       506


faCount *.fa
#seq    len     A       C       G       T       N       cpg
ModernaMrna1273 4004    898     1364    1118    624     0       283
ReconstructedBNT162b2        4175    1004    1313    1060    798     0       223
WHO_BNT162b2    4284    1106    1315    1062    801     0       223

Using the description of the coding region within the sequence, extract
just the coding rna:

-rw-rw-r-- 1 3916 Apr  1 12:22 WHO_BNT162b2.mRNA
-rw-rw-r-- 1 3909 Apr  1 23:21 ModernaMrna1273.mRNA
-rw-rw-r-- 1 3908 Apr  1 23:42 ReconstructedBNT162b2.mRNA

#seq    len     A       C       G       T       N       cpg
ModernaMrna1273 3828    852     1308    1074    594     0       276
ReconstructedBNT162b2        3825    916     1186    993     730     0       211
WHO_BNT162b2    3825    916     1186    993     730     0       211

Note these all start with the ATG:

head -2 *.mRNA
==> ModernaMrna1273.mRNA <==
>ModernaMrna1273
ATGTTCGTGTTCCTGGTGCTGCTGCCCCTGGTGAGCAGCCAGTGCGTGAACCTGACCACCCGG

==> ReconstructedBNT162b2.mRNA <==
>ReconstructedBNT162b2
ATGTTCGTG

==> WHO_BNT162b2.mRNA <==
>WHO_BNT162b2
AUGUUCGUGUUCCUGGUGCUGCUGCCUCUGGUGUCCAGCCAGUGUG

The extra bases in the Moderna sequnce is an extra TAG stop codon.

When converted to protein, each of these three sequences produce the
exact same protein

for fa in WHO_BNT162b2 ModernaMrna1273 ReconstructedBNT162b2
do
  faTrans ${fa}.mRNA ${fa}.faa
done

-rw-rw-r-- 1 1315 Apr  7 11:11 WHO_BNT162b2.faa
-rw-rw-r-- 1 1319 Apr  7 11:11 ModernaMrna1273.faa
-rw-rw-r-- 1 1319 Apr  7 11:11 ReconstructedBNT162b2.faa

The Moderna has an extra Z at the end:

tail --lines 1 *.faa
==> ModernaMrna1273.faa <==
GSCCKFDEDDSEPVLKGVKLHYTZZZ

==> ReconstructedBNT162b2.faa <==
GSCCKFDEDDSEPVLKGVKLHYTZZ

==> WHO_BNT162b2.faa <==
GSCCKFDEDDSEPVLKGVKLHYTZZ

Otherwise, they are idential AA sequences.

############################################################################
# construct alignment to wuhCor1
############################################################################

# Place all three sequences in one file:
# The display works better when the U is a T, convert
# the WHO sequence to T

cat ModernaMrna1273.fa ReconstructedBNT162b2.fa > threeVaccines.fa
cat WHO_BNT162b2.fa | tr 'U' 'T' >> threeVaccines.fa

blat -tileSize=3 -stepSize=1 -minMatch=1 -t=dnax -q=rnax -noSimpRepMask \
   -maxIntron=1 ../../wuhCor1.2bit threeVaccines.fa wuhCor1.vaccines.psl
# Loaded 29903 letters in 1 sequences
# Blatx 1 sequences in database, 1 files in query

pslScore wuhCor1.vaccines.psl
# NC_045512v2     22435   25384   ModernaMrna1273:930-3879        1105    68.80
# NC_045512v2     21559   25384   ReconstructedBNT162b2:51-3876   1701    72.30
# NC_045512v2     21559   25384   WHO_BNT162b2:51-3876    1701    72.30

faCount threeVaccines.fa | tawk '{print $1,"1.."$2+1}' | head -4 | tail -3 > threeVaccines.cds
pslToBigPsl -cds=threeVaccines.cds -fa=threeVaccines.fa wuhCor1.vaccines.psl stdout \
   | sort -k1,1 -k2,2n > wuhCor1.vaccines.bigPsl

bedToBigBed -type=bed12+13 -tab -as=$HOME/kent/src/hg/lib/bigPsl.as \
  wuhCor1.vaccines.bigPsl ../../chrom.sizes wuhCor1.vaccines.bb

gfClient -t=dnax -q=rnax blat1b 17900 /gbdb/wuhCor1 threeVaccines.fa stdout \
   | pslFilter -minScore=1000 stdin wuhCor1.vaccines.psl

pslScore  wuhCor1.vaccines.psl

NC_045512v2     21559   25384   ModernaMrna1273:54-3879 1419    68.60
NC_045512v2     21559   25384   ReconstructedBNT162b2:51-3876  1701    72.30
NC_045512v2     21559   29903   WHO_BNT162b2:51-4246    1747    72.50

root      8751     1  0  2020 ?        00:00:00 SCREEN -S wuhCor1-t /scratch/gfServer start blat1b 17900 -trans -mask -debugLog -seqLog -ipLog -log=gfServer.trans.log -tileSize=3 -stepSize=1 -minMatch=1 -noSimpRepMask wuhCor1.2bit
root      8752  8751  0  2020 pts/44   00:01:22 /scratch/gfServer start blat1b 17900 -trans -mask -debugLog -seqLog -ipLog -log=gfServer.trans.log -tileSize=3 -stepSize=1 -minMatch=1 -noSimpRepMask wuhCor1.2bit
root      8755     1  0  2020 ?        00:00:00 SCREEN -S wuhCor1-u /scratch/gfServer start blat1b 17901 -stepSize=5 -debugLog -seqLog -ipLog -log=gfServer.log -tileSize=3 -stepSize=1 -minMatch=1 -noSimpRepMask -tileSize=7 -stepSize=1 -minMatch=1 wuhCor1.2bit
root      8756  8755  0  2020 pts/71   00:06:22 /scratch/gfServer start blat1b 17901 -stepSize=5 -debugLog -seqLog -ipLog -log=gfServer.log -tileSize=3 -stepSize=1 -minMatch=1 -noSimpRepMask -tileSize=7 -stepSize=1 -minMatch=1 wuhCor1.2bit

qateam   31904 31849  0 11:52 pts/0    00:00:00 grep --color=auto wuhCor1


gfClient -t=dnax -q=rnax blat-backup 4040 -genome=wuhCor1 -genomeDataDir=wuhCor1 /gbdb/wuhCor1 threeVaccines.ra a.wuhCor1.vaccines.psl

# rm /gbdb/wuhCor1/bbi/vaccines.bigPsl.bb
ln -s `pwd`/wuhCor1.vaccines.bb /gbdb/wuhCor1/bbi

############################################################################
# provide download file links
############################################################################

cd /hive/data/genomes/wuhCor1/bed/vaccines
mkdir /usr/local/apache/htdocs-hgdownload/goldenPath/wuhCor1/vaccines
ln -s `pwd`/ReconstructedBNT162b2.fa /usr/local/apache/htdocs-hgdownload/goldenPath/wuhCor1/vaccines
ln -s `pwd`/WHO_BNT162b2.fa /usr/local/apache/htdocs-hgdownload/goldenPath/wuhCor1/vaccines
ln -s `pwd`/ModernaMrna1273.fa /usr/local/apache/htdocs-hgdownload/goldenPath/wuhCor1/vaccines
ln -s `pwd`/wuhCor1.vaccines.psl /usr/local/apache/htdocs-hgdownload/goldenPath/wuhCor1/vaccines


gfServer status blat1b 17900

version 36x9
type translated
host blat1b
port 17900
tileSize 3
stepSize 1
minMatch 1
pcr requests 0
blat requests 4729
bases 247748
aa 1156368
misses 107
noSig 4
trimmed 6
warnings 3

gfServer status blat1b 17901

version 36x9
type nucleotide
host blat1b
port 17901
tileSize 7
stepSize 1
minMatch 1
pcr requests 8938
blat requests 55355
bases 74910106
misses 176
noSig 3
trimmed 154
warnings 153

