This file describes how we made the browser database on 
NCBI build 28 (Dec 22, 2001 freeze)

GETTING FRESH mRNA AND EST SEQUENCE FROM GENBANK.

This will create a genbank.127 directory containing compressed
GenBank flat files and a mrna.127 containing unpacked sequence
info and auxiliarry info in a relatively easy to parse (.ra) 
format.

  o - Point your browser to ftp://ncbi.nlm.nih.gov/genbank and
      look at the README.  Figure out the current release number
      (which is 127).
  o - Consider deleting one of the older genbank releases.  It's
      good to at least keep one previous release though.
  o - Where there is space make a new genbank directory.  Create a
      symbolic link to it:
	  mkdir /whereverYouPutIt/genbank.127
          ln -s /whereverYouPutIt/genbank.127 ~/genbank
  o - cd ~/genbank
  o - ftp ncbi.nlm.nih.gov  (do anonymous log-in).  Then do the
      following commands inside ftp:
      	   cd genbank
           prompt
	   mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv*
      This will take at least 2 hours.
  o - Log onto server and change to your genbank directory.
  o - cd /cluster/store1
  o - mkdir mrna.127
  o - cd genbank.127
  o - gunzip -c gbpri*.gz | gbToFaRa ~/hg/h/mrna.fil ../mrna.127/mrna.fa ../mrna.127/mrna.ra ../mrna.127/mrna.ta stdin
  o - gunzip -c gbest*.gz | gbToFaRa ~/hg/h/mrna.fil ../mrna.127/est.fa ../mrna.127/est.ra ../mrna.127/est.ta stdin


CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARRY INFO  (done)

o - ln -s /cluster/store1/gs.11/build28 ~/oo
o - Create the database.
     - ssh hgwdev
     - Enter mysql via:
           mysql -u hgcat -pbigsecret
     - At mysql prompt type:
	create database hg10;
	quit
     - make a semi-permanent read-only alias:
        alias hg10 mysql -u hguser -phguserstuff -A hg10
o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql 
o - Store the mRNA (non-alignment) info in database.
     hgLoadRna new hg10
     hgLoadRna add hg10 /cluster/store1/mrna.127/mrna.fa /cluster/store1/mrna.127/mrna.ra
     hgLoadRna add hg10 /cluster/store1/mrna.127/est.fa /cluster/store1/mrna.127/est.ra
    The last line will take quite some time to complete.  It will count up to
    about 3,800,000 before it is done.

REPEAT MASKING  (done)

For the NCBI assembly we repeat mask on the sensitive mode setting.
Patrick Gavin did this, and perhaps will fill in the details later.
Overall the process was to break the NT contigs into pieces of
1000000 bases each with
    cd NTXXX
    mkdir split
    cd split
    faSplit size ../NTXXX.fa 1000000 NTXXX -maxN=1000001 -lift=../NTXXX.lft
and then arrange to do a cluster run that generates NTXXX_X.fa.out files
in the split directory.   After this lift up the .out files to whole contig
coordinates with
    liftUp NTXXX.fa.out NTXXX.lft warn split/*.fa.out
Then lift these up to chromosome coordinates with
    cd ~/oo
    tcsh jkStuff/liftOut2.sh
Then load the .out files into the database with:
    ssh hgwdev
    cd ~/oo
    hgLoadOut hg10 ?/*.fa.out ??/*.fa.out

STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (done)

Create packed chromosome sequence files 
     ssh kkstore
     cd ~/oo
     tcsh jkStuff/makeNib.sh

Load chromosome sequence info into database
     ssh hgwdev
     mysql -A -u hgcat -pbigsecret hg10  < ~/src/hg/lib/chromInfo.sql
     cd ~/oo
     hgNibSeq -preMadeNib hg10 /cluster/store1/gs.11/build28/nib ?/chr*.fa ??/chr*.fa

Store o+o info in database.
     cd /cluster/store1/gs.11/build28
     jkStuff/liftGl.sh contig.gl
     hgGoldGapGl hg10 /cluster/store1/gs.11 build28 
     cd /cluster/store1/gs.11
     hgClonePos hg10 build28 ffa/sequence.inf /cluster/store1/gs.11
(Ignore warnings about missing clones - these are in chromosomes 21 and 22)
     hgCtgPos hg10 build28 

Make and load GC percent table
     ssh hgwdev
     cd /cluster/store1/gs.11/build28/bed
     mkdir gcPercent
     cd gcPercent
     mysql -A -u hgcat -pbigsecret hg10  < ~/src/hg/lib/gcPercent.sql
     hgGcPercent hg10 ../../nib



MAKING AND STORING mRNA AND EST ALIGNMENTS (done)

o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa
    from /cluster/store1/mrna.127  into /var/tmp/hg/h/mrna

o - Use BLAT to generate refSeq, mRNA and EST alignments as so:
      Make sure that /scratch/hg/gs.11/build28/contigs is loaded
      with NT_*.fa and pushed to the cluster nodes.
	  cd ~/oo/bed
	  foreach i (refSeq mrna est)
	      mkdir $i
	      cd $i
	      echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst
	      ls -1 /scratch/hg/mrna.127/$i.fa > mrna.lst
	      mkdir psl
	      gensub2 genome.lst mrna.lst gsub spec
	      jabba make hut spec
	      jabba push hut
	  end 
    check on progress with jabba check hut in mrna, est, and refSeq
    directories.

      
o - Process refSeq mRNA and EST alignments into near best in genome.
      cd ~/oo/bed

      cd refSeq
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null
      liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp all_refSeq.psl
      cd ..

      cd mrna
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp chrom.psl
      cd ..

      cd est
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp all_est.psl
      cd ..

o - Load refSeq alignments into database
      ssh hgwdev
      cd /cluster/store1/gs.11/build28/bed/refSeq
      pslCat -dir chrom > refSeqAli.psl
      hgLoadPsl hg10 -tNameIx refSeqAli.psl

o - Load mRNA alignments into database.
      ssh hgwdev
      cd /cluster/store1/gs.11/build28/bed/mrna/chrom
      foreach i (*.psl)
          mv $i $i:r_mrna.psl
      end
      hgLoadPsl hg10 *.psl
      cd ..
      hgLoadPsl hg10 all_mrna.psl -nobin

o - Load EST alignments into database.
      ssh hgwdev
      cd /cluster/store1/gs.11/build28/bed/est/chrom
      foreach i (*.psl)
            mv $i $i:r_est.psl
      end
      hgLoadPsl hg10 *.psl
      cd ..
      hgLoadPsl hg10 all_est.psl -nobin

o - Create subset of ESTs with introns and load into database.
      - ssh kkstore
      cd ~/oo
      tcsh jkStuff/makeIntronEst.sh
      - ssh hgwdev
      cd ~/oo/bed/est/intronEst
      hgLoadPsl hg10 *.psl

o - Put orientation info on ESTs into database:
     ssh kkstore
     cd ~/oo/bed/est
     pslSortAcc nohead contig /cluster/fast1/temp/contig.psl
     mkdir /scratch/hg/gs.11/build28/bed
     cp -r contig /scratch/hg/gs.11/build28/bed/est
     sudo /cluster/install/utilities/updateLocal
     cd ~/oo/bed
     mkdir estOrientInfo
     cd estOrientInfo
     mkdir ei
     ls -1 /scratch/hg/gs.11/build28/bed/est > psl.lst
   Get onto kk and cd to ~/oo/bed/estOrientInfo.  Copy in
   gsub from the previous version and edit it to say where
   things are located in scratch on this version.  Then:
     gensub2 gsub psl.lst single spec
     para create spec
     para try 
     para push 
   check until done, or use 'para shove'

o - Create rnaCluster table
   ssh hgwdev
   cd ~/oo
   mkdir -p bed/rnaCluster/chrom
   foreach i (? ??)
       cd $i
       foreach j (chr*.fa)
	   set c = $j:r
	   echo clusterRna hg10 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c
	   clusterRna hg10 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c
       end
       cd ..
   end
   cd bed/rnaCluster
   hgLoadBed hg10 rnaCluster chrom/*.bed


PRODUCING KNOWN GENES (done)

o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/
    into /cluster/store1/mrna.127/refSeq.
o - Unpack this into fa files and get extra info with:
       cd /cluster/store1/mrna.127/refSeq
       gunzip hs.gbff
       gunzip hs.faa.gz
       gbToFaRa ~/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta
o - Align refSeq.fa to genome as described under mRNA/EST alignments above.
o - Get extra info from NCBI and produce refGene table as so:
       ssh hgwdev
       cd ~/oo/bed/refSeq
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref 
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc
o - Similarly download refSeq proteins in fasta format to refSeq.pep
o - Align these by processes described under mRNA/EST alignments above.
o - Produce refGene, refPep, refMrna, and refLink tables as so:
       ssh hgwdev
       cd ~/oo/bed/refSeq
       ln -s /cluster/store1/mrna.127 mrna
       hgRefSeqMrna hg10 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl loc2ref mrna/refSeq/hs.faa mim2loc
o - Add Jackson labs info
     cd ~/oo/bed
     mkdir jaxOrtholog
     cd jaxOrtholog
     ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt
     awk -f filter.awk *.rpt > jaxOrtholog.tab
    Load this into mysql with something like:
     mysql -u hgcat -pBIGSECRET hg10 < ~/src/hg/lib/jaxOrtholog.sql
     mysql -u hgcat -pBIGSECRET -A hg10
    and at the mysql> prompt
     load data local infile 'jaxOrtholog.tab' into table jaxOrtholog;
o - Add RefSeq status info (done 6/19/02)
    hgRefSeqStatus hg10 loc2ref


SIMPLE REPEAT TRACK (done)

o - Create cluster jabba job like so:
        ssh kkstore
	cd ~/oo/bed
	mkdir simpleRepeat
	cd simpleRepeat
	cp ~/lastOo/bed/simpleRepeat/gsub
	mkdir trf
	echo single > single.lst
        echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst
	gensub2 genome.lst single.lst gsub spec
	jabba make hut spec
        jabba push hut
     Do jabba push huts until it is done.   This one is getting
     erratic errors, requiring multiple runs on some jobs currently.
     I'm not sure if the problem lies in the cluster, in trfBig, or trf.
     Jorge suggested it might have been because some of the local
     disks were full.
        liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed

o - Load this into the database as so
        ssh hgwdev
	cd ~/oo/bed/simpleRepeat
	hgLoadBed hg10 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql


PRODUCING GENSCAN PREDICTIONS (done)
    
o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so:
     ssh roar
     cd ~/build28
     source jkStuff/gsBig.sh &
    Wait about 3 1/2 days for these to finish.
     started 11:00 am Oct 5.
     finished by 9:00 pm Oct 8.

o - Convert these to chromosome level files as so:
     cd ~/build28
     mkdir bed/genscan
     liftUp bed/genscan/genscan.gtf jkStuff/liftAll.lft warn ?/ctg*/genscan.gtf ??/ctg*/genscan.gtf
     liftUp bed/genscan/genscanSubopt.bed jkStuff/liftAll.lft warn ?/ctg*/genscanSub.bed ??/ctg*/genscanSub.bed
     cat ?/ctg*/genscan.pep ??/ctg*/genscan.pep > bed/genscan/genscan.pep

o - Load into the database as so:
     ssh hgwdev
     cd ~/build28/bed/genscan
     ldHgGene hg10 genscan genscan.gtf
     hgPepPred hg10 generic genscanPep genscan.pep
     hgLoadBed hg10 genscanSubopt genscanSubopt.bed


PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (done)

Make sure that the NT*.fa files are lower-case repeat masked, and that
the simpleRepeat track is made.  Then put doubly (simple & interspersed)
repeat mask files onto /cluster local drive as so.
    ssh kkstore
    tcsh
    mkdir /scratch/hg/gs.11/build28/trfFa
    cd ~/oo
    foreach i (? ??)
	cd $i
        foreach j (NT*)
	    maskOutFa $j/$j.fa ../bed/simpleRepeat/trf/$j.bed -softAdd /scratch/hg/gs.11/build28/trfFa/$j.fa.trf
	echo done $i/$j
	end
	cd ..
    end
Then ask admins to do a binrsync.



TWINSCAN GENE PREDICTIONS (done 06/10/02)

    mkdir -p ~/oo/bed/twinscan
    cd ~/oo/bed/twinscan
    wget http://genes.cs.wustl.edu/human/Gtf.tgz
    wget http://genes.cs.wustl.edu/human/Ptx.tgz
    gunzip -c Gtf.tgz | tar xvf -
    gunzip -c Ptx.tgz | tar xvf -
    ldHgGene hg10 twinscan *.gtf -exon=CDS
    - pare down to id (and remove 0-length files) :
    foreach f (chr*.ptx)
      perl -wpe 's/^\>.*\s+source_id\s*\=\s*(\S+).*$/\>$1/;' < \
	$f > $f:r-fixed.fa
      if (-z $f:r-fixed.fa) then
        rm $f:r-fixed.fa
      endif
    end
    hgPepPred hg10 generic twinscanPep chr*-fixed.fa


CREATE GOLDEN TRIANGLE (todo)

Make sure that rnaCluster table is in place.   Then
extract Affy expression info into a form suitable
for Eisen's clustering program with:
      cd ~/oo/bed
      mkdir triangle
      cd triangle
      eisenInput hg10 affyHg10.txt
Transfer this to Windows and do k-means clustering
with k=200 with cluster.  Transfer results file back
to ~/oo/bed/triangle/affyCluster_K_G200.kgg.  Then
do
      promoSeqFromCluster hg10 1000 affyCluster_K_G200.kgg kg200.unmasked kg200.lift
Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions
to kg200.   Then
      cat kg200/*.fa > all1000.fa
and set up cluster Improbizer run to do 100 controls for every real
run on each - putting the output in imp.200.1000.e.  When improbizer
run is done make a file summarizing the runs as so:
      cd imp.200.1000.e
      motifSig ../imp.200.1000.e.iri ../kg200 motif control*
get rid of insignificant motifs with:
      cd ..
      awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri
turn rest into just dnaMotifs with
      iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt
Extract all promoters with
      featureBits hg10 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa
Locate motifs on all promoters with
      dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2
      liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed







DOING HUMAN/MOUSE ALIGMENTS (todo)

o - Start with the mouse assembly in 1 Mb chunks lower-case
    repeat and tandem-repeat masked in /scratch/hg/mm1/trfFa,
    and the human assembly in contigs similiarly masked in
    /scratch/hg/gs.11/build28/trfFa
    Then
        ssh kkstore
	cd ~/oo/bed
	mkdir blatMus
	cd blatMus
	ls -1 /scratch/hg/mm1/trfFa/*.fa.trf > mouseAll
	mkdir mm
	cd mm
	splitFile ../mouseAll 50 mm
	cd ..
	ls -1 mm/* > mouse.lst
    Then bundle up the human into pieces of less than 12
    meg mostly by
	ls -lS /scratch/hg/gs.11/build28/trfFa/*.fa.trf > bigHuman
    edit this file and move all of the lines less than 3 meg
    into the file smallHuman.  Then do
        awk '{printf("/scratch/hg/gs.11/build28/trfFa/%s\n", $9);}' bigHuman > bigH
        awk '{printf("/scratch/hg/gs.11/build28/trfFa/%s\n", $9);}' smallHuman > smallH
	mkdir hs
	cd hs
	splitFile ../bigH 1 big
	rm big177
	splitFile ../smallH 4 small
	rm small 705
        cd ..
	ls -1 hs/* > human.lst
    (The rm's probably indicate splitFile needs a fix - they are zero length).
    Finally generate the job list with
        gensub2 human.lst mouse.lst gsub spec

o - Do the cluster run as so
       ssh kk
       cd ~/oo/bed/blatMus
       mkdir psl
       para create spec
       para try
    and then do para push/check/push/check/shove etc.

o - Sort alignments as so 
       ssh kkstore
       cd ~/oo/bed/blatMus
       pslCat -dir -check psl | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin -pslQ | pslSortAcc nohead chromPile /cluster/store2/temp stdin
o - Get rid of big pile-ups due to contamination as so:
       mkdir chrom
       cd chromPile
       foreach i (*.psl)
           echo $i
           pslUnpile -maxPile=250 $i ../chrom/$i
       end
o - Rename to correspond with tables as so and load into database:
       ssh hgwdev
       cd ~/oo/bed/blatMus/chrom
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatMus.psl
       end
       hgLoadPsl hg10 *.psl
o - load sequence into database as so:
	ssh kks00
	faSplit about /projects/hg3/mouse/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/mouse/arachne.3/whole/unplaced
	ssh hgwdev
	hgLoadRna addSeq '-abbr=gnl|' hg10 /projects/hg3/mouse/arachne.3/whole/unpla*.fa
	hgLoadRna addSeq '-abbr=con' hg10 /projects/hg3/mouse/arachne.3/whole/SET*.mfa
    This will take quite some time.  Perhaps an hour .

o - Produce 'best in genome' filtered version:
        ssh kks00
	cd ~/mouse/vsOo33
	pslSort dirs blatMouseAll.psl temp blatMouse
	pslReps blatMouseAll.psl bestMouseAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1
	pslSortAcc nohead bestMouse temp bestMouseAll.psl
	cd bestMouse
        foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_bestMouse.psl
        end
o - Load best in genome into database as so:
	ssh hgwdev
	cd ~/mouse/vsOo33/bestMouse
        hgLoadPsl hg10 *.psl

PRODUCING CROSS_SPECIES mRNA ALIGMENTS (done for mRNA, not for EST)

Here you align vertebrate mRNAs against the masked genome on the
cluster you set up during the previous step.

o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into
    /cluster/store1/genbank.127

o - Process these out of genbank flat files as so:
       ssh kkstore
       cd /cluster/store1/genbank.127
       mkdir ../mrna.127
       gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbvrt*.gz gbinv*.gz | gbToFaRa ~/hg/h/xenoRna.fil ../mrna.127/xenoRna.fa ../mrna.127/xenoRna.ra ../mrna.127/xenoRna.ta stdin
       cd ../mrna.127
       faSplit sequence xenoRna.fa 2 xenoRna
       ssh kks00
       cd /projects/hg2
       mkdir mrna.127
       cp /cluster/store1/mrna.127/xenoRna.* mrna.127

Set up cluster run.  First make sure genome is in kks00:/scratch/hg/gs.8/build28/contigTrf
in RepeatMasked + trf form.  (This is probably done already in mouse alignment
stage).  Also make sure /scratch/hg/mrna.127 is loaded with xenoRna.fa Then do:
       ssh kkstore
       cd /cluster/store1/gs.8/build28/oo/bed
       mkdir xenoMrna
       cd xenoMrna
       mkdir psl
       ls -1S /scratch/hg/gs.11/build28/trfFa > human.lst
       ls -1S /scratch/hg/mrna.127/xenoRna?*.fa > mrna.lst
       cp ~/lastOo/bed/xenoMrna/gsub .
       gensub2 human.lst mrna.lst gsub spec
       jabba make hut spec
       jabba push hut
Do jabba check hut until the run is done, doing jabba push hut if
necessary on occassion.

Sort xeno mRNA alignments as so:
       ssh kkstore
       cd ~/oo/bed/xenoMrna
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.25
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoMrna.psl
       rm -r chrom raw.psl cooked.psl chrom.psl

Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoMrna
       hgLoadPsl hg10 xenoMrna.psl -tNameIx
       cd /cluster/store1/mrna.127
       hgLoadRna add hg10 /cluster/store1/mrna.127/xenoRna.fa xenoRna.ra

Similarly do xenoEst aligments:
       ssh kkstore
       cd /cluster/store1/gs.8/build28/oo/bed
       mkdir xenoEst
       cd xenoEst
       mkdir psl
       ls -1S /scratch/hg/gs.11/build28/trfFa/*.fa > human.lst
       ls -1S /scratch/hg/mrna.127/xenoEst?*.fa > mrna.lst
       cp ~/lastOo/bed/xenoEst/gsub .
       gensub2 human.lst mrna.lst gsub spec
       jabba make hut spec
       jabba shove hut

Sort xenoEst alignments:
       ssh kkstore
       cd ~/oo/bed/xenoEst
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.10
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoEst.psl
       rm -r chrom raw.psl cooked.psl chrom.psl

Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoEst
       hgLoadPsl hg10 xenoEst.psl -tNameIx
       cd /cluster/store1/mrna.127
       foreach i (xenoEst*.fa)
	   echo processing $i
	   hgLoadRna add hg10 /cluster/store1/mrna.127/$i $i:r.ra
       end


PRODUCING FISH ALIGNMENTS (todo)

o - Do fish/human alignments.
       ssh kk
       cd ~/mm/bed
       mkdir blatFish
       cd blatFish
       mkdir psl
       ls -1S /scratch/hg/fish > fish.lst
       ls -1S /scratch/hg/build28/trfFa > human.lst
     Copy over gsub from previous version and edit paths to
     point to current assembly.
       gensub2 human.lst fish.lst gsub spec
       para create spec
       para try
     Make sure jobs are going ok with para check.  Then
       para push
     wait about 2 hours and do another
       para push
     do para checks and if necessary para pushes until done
     or use para shove.
o - Sort alignments as so 
       pslCat -dir psl | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin
o - Copy to hgwdev:/scratch.  Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/oo/bed/blatFish/chrom
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatFish.psl
       end
       hgLoadPsl hg10 *.psl




TIGR GENE INDEX (todo)

  o - Download ftp://ftp.tigr.org/private/HGI_ren/*aug.tgz into
      ~/oo.29/bed/tgi and then execute the following commands:
          cd ~/oo.29/bed/tgi
	  mv cattleTCs_aug.tgz cowTCs_aug.tgz
	  foreach i (mouse cow human pig rat)
	      mkdir $i
	      cd $i
	      gtar -zxf ../${i}*.tgz
	      gawk -v animal=$i -f ../filter.awk * > ../$i.gff
	      cd ..
	  end
	  mv human.gff human.bak
	  sed s/THCs/TCs/ human.bak > human.gff
	  ldHgGene -exon=TCs hg7 tigrGeneIndex *.gff


      
LOAD STS MAP (done)
     - login to hgwdev
      cd ~/oo/bed
      hg10 < ~/src/hg/lib/stsMap.sql
      mkdir stsMap
      cd stsMap
      bedSort /projects/cc/hg/mapplots/data/tracks/build28/stsMap.bed stsMap.bed
      - Enter database with "hg10" command.
      - At mysql> prompt type in:
          load data local infile 'stsMap.bed' into table stsMap;
      - At mysql> prompt type
          quit


LOAD CHROMOSOME BANDS (done)
      - login to hgwdev
      cd /cluster/store1/gs.11/build28/bed
      mkdir cytoBands
      cp /projects/cc/hg/mapplots/data/tracks/build28/cytobands.bed cytoBands
      hg10 < ~/src/hg/lib/cytoBand.sql
      Enter database with "hg10" command.
      - At mysql> prompt type in:
          load data local infile 'cytobands.bed' into table cytoBand;
      - At mysql> prompt type
          quit

LOAD MOUSEREF TRACK (todo)
    First copy in data from kkstore to ~/oo/bed/mouseRef.  
    Then substitute 'genome' for the appropriate chromosome 
    in each of the alignment files.  Finally do:
       hgRefAlign webb hg10 mouseRef *.alignments

LOAD AVID MOUSE TRACK (todo)
      ssh cc98
      cd ~/oo/bed
      mkdir avidMouse
      cd avidMouse
      wget http://pipeline.lbl.gov/tableCS-LBNL.txt
      hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed
      hgLoadBed avidRepeat avidRepeat.bed
      hgLoadBed avidUnique avidUnique.bed


LOAD SNPS (todo)
      - ssh hgwdev
      - cd ~/oo/bed
      - mkdir snp
      - cd snp
      - Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.oo33.gz
      - Unpack.
        grep RANDOM gp.oo33 > snpTsc.txt
        grep MIXED  gp.oo33 >> snpTsc.txt
        grep BAC_OVERLAP  gp.oo33 > snpNih.txt
        grep OTHER  gp.oo33 >> snpNih.txt
        awk -f filter.awk snpTsc.txt > snpTsc.contig.bed
        awk -f filter.awk snpNih.txt > snpNih.contig.bed
        liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed
        liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed
	hgLoadBed hg10 snpTsc snpTsc.bed
	hgLoadBed hg10 snpNih snpNih.bed

LOAD CPGISLANDS (done)
     - login to hgwdev
     - cd /cluster/store1/gs.11/build28/bed
     - mkdir cpgIsland
     - cd cpgIsland
     - hg10 < ~kent/src/hg/lib/cpgIsland.sql
     - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/Freezes/cpg_dec2001.tar.gz
     - tar -xf cpg*.tar
     - copy filter.awk from a previous release, e.g. ~kent/oo.33/bed/cpgIsland
     - awk -f filter.awk */*/NT_*/*.cpg > contig.bed
     - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed
     - Enter database with "hg10" command.
     - At mysql> prompt type in:
          load data local infile 'cpgIsland.bed' into table cpgIsland

LOAD ENSEMBL GENES (done)
     cd ~/oo/bed
     mkdir ensembl
     cd ensembl
     Place the gtf dump file, xxx.gtf.gtz (for ensembl gene
     predictions) here. (e.g. wget
     http://www.sanger.ac.uk/~birney/all_april_ctg.gtf.gz)
     gunzip xxx.gtf.gtz
     Starting from Hg10, ensembl seems to have done their own lifting,
     so just collect all chromosome .gtf files.
     cat ?.gtf ??.gtf >all.gtf

     Add "chr" to front of each line to make it compatible with ldHgGene 
     addchr all.gtf > ensembl.gtf
     rm all.gtf
     ldHgGene hg10 ensGene ensembl.gtf
o - Load Ensembl peptides:
     (poke around ensembl to get their peptide files as ensembl.pep)
     
     wget ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep/*s.pep.fa.gz

     gunzip Homo_sapiens.pep.fa.gz
     
     Substitute ENST for ENSP in ensemble.pep with subs
     edit subs.in to read: ENSP|ENST

     subs -e Homo_sapiens.pep.fa >/dev/null

     hgPepPred hg10 generic ensPep ensembl.pep

LOAD SANGER22 GENES 
      cd ~/oo/bed
      mkdir sanger22
      cd sanger22
      not sure where these files were downloaded from
      grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg10 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0
          | ldHgGene hg10 sanger22pseudo stdin
  Note: this creates sanger22extras, but doesn't currently create
  a correct sanger22 table, which are replaced in the next steps
      sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg10 sanger22 stdin
      sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg10 sanger22pseudo stdin

      hgPepPred hg10 generic sanger22pep *.pep

LOAD SANGER 20 GENES (todo)
     First download files from James Gilbert's email to ~/oo/bed/sanger20 and
     go to that directory while logged onto hgwdev.  Then:
        grep -v Pseudogene chr_20*.gtf | ldHgGene hg10 sanger20 stdin
	hgSanger20 hg10 *.gtf *.info


LOAD RNAGENES (todo)
      - login to hgwdev
      - cd ~kent/src/hg/lib
      - hg10 < rnaGene.sql
      - cd /cluster/store1/gs.11/build28/bed
      - mkdir rnaGene
      - cd rnaGene
      - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz
      - gunzip *.gz
      - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff
      - hgRnaGenes hg10 chrom.gff

LOAD EXOFISH (todo)
     - login to hgwdev
     - cd /cluster/store1/gs.11/build28/bed
     - mkdir exoFish
     - cd exoFish
     - hg10 < ~kent/src/hg/lib/exoFish.sql
     - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr)
       into /cluster/store1/gs.11/build28/bed/exoFish/all_maping_ecore
     - awk -f filter.awk all_maping_ecore > exoFish.bed
     - hgLoadBed hg10 exoFish exoFish.bed

LOAD MOUSE SYNTENY (done)
     - login to hgwdev.
     - cd ~/kent/src/hg/lib
     - hg10 < mouseSyn.sql
     - mkdir ~/oo/bed/mouseSyn
     - cd ~/oo/bed/mouseSyn
     - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as
       mouseSyn.txt
     - awk -f format.awk *.txt > mouseSyn.bed
     - delete first line of mouseSyn.bed
     - Enter database with "hg10" command.
     - At mysql> prompt type in:
          load data local infile 'mouseSyn.bed' into table mouseSyn


LOAD GENIE (todo)
     - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf
     - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf
     - ldHgGene hg10 genieAlt predChrom.gtf

     - cat */ctg*/ctg*.affymetrix.aa > pred.aa
     - hgPepPred hg10 genie pred.aa 

     - hg10
         mysql> delete * from genieAlt where name like 'RS.%';
         mysql> delete * from genieAlt where name like 'C.%';

LOAD SOFTBERRY GENES (done)
     - ln -s /cluster/store1/gs.11/build28/ ~/oo
     - cd ~/oo/bed
     - mkdir softberry
     - cd softberry
     - wget ftp://www.softberry.com/pub/SC_HUM_DEC01/soft_fg_hum_dec01.tar.gz
     - mkdir output
     - Run the fixFormat.pl routine in
     	~/oo/bed/softberry like so:
        ./fixFormat.pl output
     - This will stick all the properly converted .gff and .pro files in
     the directory named "output".
     - cd output
     ldHgGene hg10 softberryGene chr*.gff
     hgPepPred hg10 softberry *.pro
     hgSoftberryHom hg10 *.pro
     - RT 283 fixed 7/19/02:
     cd ~/hg10/bed/softberry/output
     foreach f (chr*.gff)
       set f2 = $f:r-fixed.gff
       ../removeBogusExons.pl $f > $f2
     end
     ldHgGene hg10 softberryGene chr*-fixed.gff

LOAD GENEID GENES (done)
     mkdir ~/oo/bed/geneid
     cd ~/oo/bed/geneid
     mkdir download
     cd download
   Now download *.gtf and *.prot from 
   http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/geneid_v1.1/
     cd ..
     ldHgGene hg10 geneid download/*.gtf -exon=CDS
     hgPepPred hg10 generic geneidPep download/*.prot

SGP GENE PREDICTIONS
     mkdir ~/oo/bed/sgp
     cd ~/oo/bed/sgp
     mkdir download
     cd download
   Now download *.gtf and *.prot from 
   http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/SGP/
   Get rid of the extra .N in the transcripts with subs.  
     cd ..
     cp ~/oo/bed/geneid/subs.in .
     subs -e download/*.gtf > /dev/null
     ldHgGene hg10 sgpGene download/*.gtf -exon=CDS
     hgPepPred hg10 generic sgpPep download/*.prot


LOAD ACEMBLY
    - Get acembly*gene.gff from Jean and Michelle Thierry-Miegs and
      place in ~/oo/bed/acembly
    - cd ~/oo/bed/acembly
    - Replace c_chr with chr in acembly*.gff
    - Use /cluster/store1/gs.11/build28/mattStuff/filterGff.pl to do the above
        and to concat the results into the the aceChrom.gff gile. 
      Read the script to see what it does. It's tiny and simple.
    - Use /cluster/store1/gs.11/build28/mattStuff/filterProteins.pl to do the
below steps.
    - Remove G_t*_  and likewise G_t2_, etc., from acemblyproteins.*.fasta
       and concatenate all the protein.fasta files into a single acembly.pep file
    - Load into database as so:
        ldHgGene hg10 acembly aceChrom.gff
        hgPepPred hg10 generic acemblyPep acembly.pep

LOAD GENOMIC DUPES (todo)
o - Load genomic dupes
    ssh hgwdev
    cd ~/oo/bed
    mkdir genomicDups
    cd genomicDups
    wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip
    unzip *.zip
    awk -f filter.awk oo33_dups_for_kent > genomicDups.bed
    mysql -u hgcat -pbigSECRET hg10 < ~/src/hg/lib/genomicDups.sql
    hgLoadBed hg10 -oldTable genomicDups genomicDupes.bed

FAKING DATA FROM PREVIOUS VERSION
(This is just for until proper track arrives.  Rescues about
97% of data  Just an experiment, not really followed through on).

o - Rescuing STS track:
     - log onto hgwdev
     - mkdir ~/oo/rescue
     - cd !$
     - mkdir sts
     - cd sts
     - bedDown hg3 mapGenethon sts.fa sts.tab
     - echo ~/oo/sts.fa > fa.lst
     - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g
     - log onto cc01
     - cc ~/oo/rescue/sts
     - split all.con into 3 parts and condor_submit each part
     - wait for assembly to finish
     - cd psl
     - mkdir all
     - ln ?/*.psl ??/*.psl *.psl all
     - pslSort dirs raw.psl temp all
     - pslReps raw.psl contig.psl /dev/null
     - rm raw.psl
     - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl
     - rm contig.psl
     - mv chrom.psl ../convert.psl


LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (markd)

    - loading both blastz alignments and reference (single coverage) alignments
    - in xAli format, which includes sequence
    - done in a tmp dir and intermediate files discarded

    - create psl files for each per-contig lav file

       set sc=""
       set tbl="blastzMm2"
       foreach chrdir (/cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/lav/chr*)
         set chr=$chrdir:t
         set outdir=${tbl}.psl/$chr
         mkdir -p $outdir
         foreach lav ($chrdir/*.lav${sc})
           lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl
         end
       end

    - Make temporary per-contig mouse nib files for use by pslToXa
      (which doesn't currenty support FASTA)
        mkdir -p mm.contig.nib
        foreach fa (/cluster/store2/mm.2002.02/mm2/*/chr*/chr*.fa)
            set nib=mm.contig.nib/$fa:r:t.nib
            faToNib $fa $nib
        end

    - Convert to per-chromsome files, sort, and add sequence
       mkdir -p ${tbl}.xa
       foreach chrdir (${tbl}.psl/*)
         set chr=$chrdir:t
         pslCat -check -nohead -ext=.psl -dir ${tbl}.psl/$chr \
          | sort -k 15n -k 16n \
          | pslToXa stdin ${tbl}.xa/${chr}_${tbl}.xa mm.contig.nib /cluster/store1/gs.11/build28/nib
       end

    - repeat both loops, this time doing the single-coverage alignment:
        set sc=".sc"
        set tbl="blastzMm2Sc
        <above loops>

    - Load tables
        - alter table chr19_blastzMm2  max_rows=10000000 avg_row_length=1200;

        cd blastzMm2.xa
        hgLoadPsl -xa hg10 *.xa
        cd blastzMm2.xa
        hgLoadPsl -xa hg10 *.xa

   - Load aligned ancient repeats, from
       /cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/aar
     Ryan create:
       /cluster/store1/gs.11/build28/bed/blastz.mm2.2002-04-14/aar/xali
     - Loaded into aarMm2, delete repeats with < 50 aligned positions
     - The original data contained may duplicated alignments, so did a 
           sort -u | sort -k 15n -k 16n
       on each PSL before loading.
     - Need to filter for #aligned < 50
        
   - Append mm2 nib info to mouseChrom table with mm2. prefix
     - cd  /cluster/store2/mm.2002.02/mm2/
     - hgNibSeq -preMadeNib -table=mouseChrom -chromPrefix=mm2. -append hg10 /cluster/store2/mm.2002.02/mm2/nib ?/chr*.fa ??/chr*.fa

