This file describes how we made the browser database on 
NCBI build 29 (April, 2002 freeze)

(The numbered stuff was brought in from
 /cluster/store1/gs.12/build29/build.ncbi.doc)

HOW TO BUILD A ASSEMBLY FROM NCBI FILES
---------------------------------------

NOTE: It is best to run most of this stuff on kkstore since it
is not adverse to handling files > 2Gb

1) Download seq_contig.md, ncbi_buildXX.agp,
	contig_overlaps.agp and contig fa file into directory.

2) Unpack contig fa file into ../ffa/ncbi_buildXX.fa

#2.1) Extract Hs to NT conversion from .fa files to convert .agp file (NOT NEEDED ANYMORE)
#
#	/cluster/bin/scripts/extractHs ncbi_buildXX.fa
#
#2.2) Create allcontig.agp.buildXX file (NOT NEEDED ANYMORE)#
#
#	/cluster/bin/scripts/convertHsAgp hs.to.nt <agp file> > allcontig.agp.buildXX
	
2.3) Sanity check things with
        ~kent/bin/i386/checkYbr ncbi_buildXX.agp ../ffa/ncbi_buildXX.fa seq_contig.md
     report any errors back to Richa and Greg at NCBI.

3) Convert fa files into UCSC style fa files and place in "contigs" directory
	mkdir contigs
	/cluster/bin/i386/faNcbiToUcsc -split -ntLast ncbi_buildXX.fa contigs

4) Create lift files (this will create chromosome directory structure) and inserts file

	/cluster/bin/scripts/createNcbiLifts seq_contig.md .

5) Create contig agp files (will create contig directory structure)
	
	/cluster/bin/scripts/createNcbiCtgAgp seq_contig.md ncbi_buildXX.agp .

5.1) Create contig gl files

        ~kent/bin/i386/agpToGl contig_overlaps.agp . -md=seq_contig.md

6) Create chromsome agp files

	/cluster/bin/scripts/createNcbiChrAgp .

6.1) Copy over jkStuff
	mkdir jkStuff
        cp ../../gs.11/build28/jkStuff/*.sh jkStuff
        cp ../../gs.11/build28/jkStuff/*.csh jkStuff
        cp ../../gs.11/build28/jkStuff/*.gsub jkStuff        

6.2) Patch in size of chromosome Y into Y/lift/ordered.lft
     by grabbing it from the last line of Y/chrY.agp

6.3) Create chromosome gl files
  
        jkStuff/liftGl.sh contig.gl

7) Distribute contig .fa and .out files to appropriate directory (assumes all files
   are in "contigs" directory).

	/cluster/bin/scripts/distNcbiCtgFa contigs .

8) Reverse complement NT contig fa files that are flipped in the assembly
   (uses faRc program)

	/cluster/bin/scripts/revCompNcbiCtgFa seq_contig.md .
	
9) Generate RepeatMasked files for contigs (Patrick)
   For the NCBI assembly we repeat mask on the sensitive mode setting.

        cd ~/oo
        /cluster/bin/scripts/RMfa RMJobs */NT_*/*.fa
        log into kk
        cd ~/oo
        para create RMJobs
        para try
        make sure jobs don't die right away
        para push

10) Lift up RepeatMask .out files to chromosome coordinates via
       tcsh jkStuff/liftOut2.sh

11) Generate contig and chromosome level masked and unmasked files via:
       tcsh jkStuff/chrFa.sh
       tcsh jkStuff/makeFaMasked.sh

12) Copy all contig and chrom fa files to /scratch on kkstore to get ready for
    cluster jobs, and ask to propagate to nodes

	/cluster/bin/scripts/cpNcbiFaScratch . </scratch/hg/gs.X/>

13) Create jkStuff/ncbi.lft for lifting stuff built w/NCBI assembly.
    Note: this ncbi.lift will not lift floating contigs to chr_random coords,
    but it will show the strand orientation of the floating contigs 
    (grep for '|').
	mdToNcbiLift seq_contig_randoms.md jkStuff/ncbi.lft 


CREATING DATABASE  (DONE)

o - ln -s /cluster/store1/gs.12/build29 ~/oo
o - Make sure there is at least 5 gig free on hgwdev:/usr/local/mysql 
o - Create the database.
     - ssh hgwdev
     - Enter mysql via:
           hgsql
     - At mysql prompt type:
	create database hg11;
	quit
     - make a semi-permanent read-only alias:
        alias hg11 mysql -u hguser -phguserstuff -A hg11
o - Tell the hgCentral database about it.  Log onto genome-centdb
    and enter mysql via
        mysql -u root -pbigSecret hgCentral
    At the mysql prompt type:
       insert into dbDb values("hg11", "Human April 2002", "/cluster/store1/gs.12/build29/nib", "Human", "USP18");
o - Create the trackDb table as so
       cd ~/src/hg/makeDb/hgTrackDb
    Edit makefile to add hg11 after hg10 and do
       make update
       cvs commit makefile


LOAD REPEAT MASKS (DONE 7/10/02)
    Load the RepeatMasker .out files into the database with:
       hgLoadOut hg11 ?/*.fa.out ??/*.fa.out

STORING O+O SEQUENCE AND ASSEMBLY INFORMATION (DONE)

Create packed chromosome sequence files 
     ssh kkstore
     cd ~/oo
     tcsh jkStuff/makeNib.sh

Load chromosome sequence info into database
     ssh hgwdev
     hgsql hg11 < ~/src/hg/lib/chromInfo.sql
     cd ~/oo
     hgNibSeq -preMadeNib hg11 /cluster/store1/gs.12/build29/nib ?/chr*.fa ??/chr*.fa

Store o+o info in database.
     cd /cluster/store1/gs.12/build29
     jkStuff/liftGl.sh contig.gl
     hgGoldGapGl hg11 /cluster/store1/gs.12 build29 
     cd /cluster/store1/gs.12
     hgClonePos hg11 build29 ffa/sequence.inf /cluster/store1/gs.12 -maxErr=3
(Ignore warnings about missing clones - these are in chromosomes 21 and 22)
     hgCtgPos hg11 build29 

Make and load GC percent table
     ssh hgwdev
     cd ~/oo
     mkdir -p bed/gcPercent
     cd bed/gcPercent
     hgsql hg11  < ~/src/hg/lib/gcPercent.sql
     hgGcPercent hg11 ../../nib

SIMPLE REPEAT TRACK (DONE)

o - Create cluster parasol job like so:
        ssh kk
	cd ~/oo/bed
	mkdir simpleRepeat
	cd simpleRepeat
	cp /cluster/store1/gs11.build28/bed/simpleRepeat/gsub ./gsub
	mkdir trf
	ls -1S /scratch/hg/gs.12/build29/contig/*.fa > genome.lst
	gensub2 genome.lst single gsub spec
	para create spec
	para try
	para check
	para push
        liftUp simpleRepeat.bed ~/oo/jkStuff/liftAll.lft warn trf/*.bed

o - Load this into the database as so
        ssh hgwdev
	cd ~/oo/bed/simpleRepeat
	hgLoadBed hg11 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql


PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS (DONE)

Make sure that the NT*.fa files are lower-case repeat masked.
Do something much like the simpleRepeat track, but only
masking out stuff with a period of 12 or less as so:
    ssh kk
    cd ~/oo/bed
    mkdir trfMask
    cd trfMask

# I couldn't find a valid gsub according to these instructions so I used the one
#  from /cluster/store1/gs.12/build29.bad/bed/trfMask
#  instead of doing ->  cp ~/lastOo/bed/trfMask/gsub ./gsub
    mkdir trf
    ls -1S /scratch/hg/gs.12/build29/contig/*.fa > genome.lst
    gensub2 genome.lst single gsub spec
    para create spec
    para try
    para check
    para push
When that is done do:
    ssh kkstore
    mkdir /scratch/hg/gs.12/build29/trfFa
    cd ~/oo
NOTE:Below is a tcsh script
    foreach i (? ??)
	cd $i
        foreach j (NT*)
	    maskOutFa $j/$j.fa ../bed/trfMask/trf/$j.bed -softAdd /scratch/hg/gs.12/build29/trfFa/$j.fa.trf
	echo done $i/$j
	end
	cd ..
    end
Then ask admins to do a binrsync. DONE



GETTING FRESH mRNA AND EST SEQUENCE FROM GENBANK. (DONE)

This will create a genbank.129 directory containing compressed
GenBank flat files and a mrna.129 containing unpacked sequence
info and auxiliary info in a relatively easy to parse (.ra) 
format.

  o - Point your browser to ftp://ncbi.nlm.nih.gov/genbank and
      look at the README.  Figure out the current release number
      (which is 129).
  o - Consider deleting one of the older genbank releases.  It's
      good to at least keep one previous release though.
  o - Where there is space make a new genbank directory.  Create a
      symbolic link to it:
	  mkdir /cluster/store1/genbank.129
          ln -s /cluster/store1/genbank.129 ~/genbank
  o - cd ~/genbank
  o - ftp ncbi.nlm.nih.gov  (do anonymous log-in).  Then do the
      following commands inside ftp:
      	   cd genbank
           prompt
	   mget gbpri* gbrod* gbv* gbsts* gbest* gbmam* gbinv*
      This will take at least 2 hours.
  o - Log onto server and change to your genbank directory.
  o - cd /cluster/store1
  o - mkdir mrna.129

  o - cd mrna.129
  o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz | gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin
  o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz | gbToFaRa ~kent/hg/h/mrna.fil mrna.fa mrna.ra mrna.ta stdin -byOrganism=org

  o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin
  o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/mrna.fil est.fa est.ra est.ta stdin -byOrganism=org

  o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin
  o - gunzip -c /cluster/store1/genbank.129/gbest*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoEst.fa xenoEst.ra xenoEst.ta stdin -byOrganism=org

  o - gunzip -c /cluster/store1/genbank.129/gbpri*.gz /cluster/store1/genbank.129/gbmam*.gz /cluster/store1/genbank.129/gbrod*.gz /cluster/store1/genbank.129/gbvrt*.gz /cluster/store1/genbank.129/gbinv*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil xenoRna.fa xenoRna.ra xenoRna.ta stdin -byOrganism=org

  o - cd /cluster/store1/genbank.129
  o - gunzip -c gbpri*.gz gbmam*.gz gbrod*.gz gbvrt*.gz gbinv*.gz | gbToFaRa ~kent/hg/h/xenoRna.fil ../mrna.129/xenoRna.fa ../mrna.129/xenoRna.ra ../mrna.129/xenoRna.ta stdin

STORING mRNA/EST SEQUENCE AND AUXILIARY INFO  (DONE)

o - Store the mRNA (non-alignment) info in database.
     hgLoadRna new hg11
     hgLoadRna add hg11 /cluster/store1/mrna.129/mrna.fa /cluster/store1/mrna.129/mrna.ra
     hgLoadRna add hg11 /cluster/store1/mrna.129/est.fa /cluster/store1/mrna.129/est.ra

    The last line will take quite some time to complete.  It will count up to
    about 3,800,000 before it is done.


MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE)

o - Load up the local disks of the cluster with refSeq.fa, mrna.fa and est.fa
    Copy the above 3 files from /cluster/store1/mrna.129 into /scratch/hg/h/mrna
    Request the admins to do a binrsync to the cluster.
DONE
o - Use BLAT to generate refSeq, mRNA and EST alignments as so:
      Make sure that /scratch/hg/gs.12/build29/contigs is loaded
      with NT_*.fa and pushed to the cluster nodes.
          ssh kkstore

	  cd ~/oo/bed
	  foreach i (refSeq mrna est)
	      mkdir -p $i
	      cd $i
              cp ~kent/lastOo/bed/$i/gsub .
	      echo /scratch/hg/gs.12/build29/contig/*.fa | wordLine stdin > genome.lst
	      ls -1 /scratch/hg/h/mrna/$i.fa > mrna.lst
	      mkdir -p psl
	      gensub2 genome.lst mrna.lst gsub spec
	      para create spec
              cd ..
	  end 
DONE

    Now, by hand cd to the mrna, refSeq, and est directories respectively
     and run a para push and para check in each one. DONE
      
o - Process refSeq mRNA and EST alignments into near best in genome.
      cd ~/oo/bed

      cd refSeq
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null
      liftUp -nohead all_refSeq.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp all_refSeq.psl
      cd .. DONE

      cd mrna
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minAli=0.96 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_mrna.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp all_mrna.psl
      cd .. DONE

      cd est
      pslSort dirs raw.psl /cluster/fast1/temp psl
      pslReps -minAli=0.93 -nearTop=0.01 raw.psl contig.psl /dev/null
      liftUp -nohead all_est.psl ../../jkStuff/liftAll.lft carry contig.psl
      pslSortAcc nohead chrom /cluster/fast1/temp all_est.psl
      cd .. DONE

o - Load refSeq alignments into database DONE
      ssh hgwdev
      cd /cluster/store1/gs.12/build29/bed/refSeq
      pslCat -dir chrom > refSeqAli.psl
      hgLoadPsl hg11 -tNameIx refSeqAli.psl

o - Load mRNA alignments into database. DONE
      ssh hgwdev
      cd /cluster/store1/gs.12/build29/bed/mrna/chrom
      foreach i (*.psl)
          mv $i $i:r_mrna.psl
      end
      hgLoadPsl hg11 *.psl
      cd ..
      hgLoadPsl hg11 all_mrna.psl -nobin

o - Load EST alignments into database. DONE
      ssh hgwdev
      cd /cluster/store1/gs.12/build29/bed/est/chrom
      foreach i (*.psl)
            mv $i $i:r_est.psl
      end
      hgLoadPsl hg11 *.psl
      cd ..
      hgLoadPsl hg11 all_est.psl -nobin

o - Create subset of ESTs with introns and load into database. DONE
      - ssh kkstore
      cd ~/oo
      tcsh jkStuff/makeIntronEst.sh
      - ssh hgwdev
      cd ~/oo/bed/est/intronEst
      hgLoadPsl hg11 *.psl

o - Put orientation info on ESTs into database:
     ssh kkstore
     cd ~/oo/bed/est
     pslSortAcc nohead contig /cluster/fast1/temp contig.psl
     mkdir /scratch/hg/gs.12/build29/bed
     cp -r contig /scratch/hg/gs.12/build29/bed/est
     sudo /cluster/install/utilities/updateLocal
     cd ~/oo/bed
     mkdir estOrientInfo
     cd estOrientInfo
     mkdir ei
     ls -1 /scratch/hg/gs.12/build29/bed/est/*.psl > psl.lst

   Now ssh to kk and cd to ~/oo/bed/estOrientInfo.  Copy in
   gsub from the previous version and edit it to say where
   things are located in scratch on this version.  Then:
     gensub2 psl.lst single gsub spec
     para create spec
     para try 
     para push
   check until done, or use 'para shove'

When the cluster run is done do:
liftUp estOrientInfo.bed ~/oo/jkStuff/liftAll.lft warn ei/*.tab
hgLoadBed hg11 estOrientInfo estOrientInfo.bed -sqlTable=$HOME/src/hg/lib/estOrientInfo.sql
DONE

o - Create rnaCluster table
   ssh hgwdev
   cd ~/oo
   mkdir -p bed/rnaCluster/chrom
   foreach i (? ??)
       cd $i
       foreach j (chr*.fa)
	   set c = $j:r
	   echo clusterRna hg11 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c
	   clusterRna hg11 /dev/null ../bed/rnaCluster/chrom/$c.bed -chrom=$c
       end
       cd ..
   end
   cd bed/rnaCluster
   hgLoadBed hg11 rnaCluster chrom/*.bed
DONE

PRODUCING KNOWN GENES (DONE)

o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/
    into /cluster/store1/mrna.129/refSeq. DONE
o - Unpack this into fa files and get extra info with:
       cd /cluster/store1/mrna.129/refSeq
       gunzip hs.gbff
       gunzip hs.faa.gz
       gbToFaRa ~/hg/h/mrna.fil ../refSeq.fa ../refSeq.ra ../refSeq.ta DONE
o - Align refSeq.fa to genome as described under mRNA/EST alignments above. DONE
o - Get extra info from NCBI and produce refGene table as so:
       ssh hgwdev
       cd ~/oo/bed
       mkdir refSeq
       cd refSeq
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref DONE
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc DONE
o - Similarly download refSeq proteins in fasta format to refSeq.pep - I believe this is hs.faa
o - RefSeq should have already been aligned to the genome by processes 
        described under mRNA/EST alignments above.
o - Produce refGene, refPep, refMrna, and refLink tables as so:
       ssh hgwdev
       cd ~/oo/bed/refSeq
       ln -s /cluster/store1/mrna.129 mrna
       hgRefSeqMrna hg11 mrna/refSeq.fa mrna/refSeq.ra all_refSeq.psl loc2ref mrna/refSeq/hs.faa mim2loc DONE
o - Add Jackson labs info DONE
     cd ~/oo/bed
     mkdir jaxOrtholog
     cd jaxOrtholog
     ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt
     awk -f filter.awk *.rpt > jaxOrtholog.tab
    Load this into mysql with something like:
     mysql -u hgcat -pBIGSECRET hg11 < ~/src/hg/lib/jaxOrtholog.sql
     mysql -u hgcat -pBIGSECRET -A hg11
    and at the mysql> prompt
     load data local infile 'jaxOrtholog.tab' into table jaxOrtholog;
o - Add RefSeq status info (DONE 6/19/02)
    hgRefSeqStatus hg11 loc2ref


PRODUCING GENSCAN PREDICTIONS (done)
   
o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so:

        Load up the cluster with hard-masked contigs in
		/cluster/store1/gs.12/build29/bed/genscan/mContigs
	(For hg11, the .masked files were not saved during repeat masking.  
	So the contig (.fa) files in /cluster/store1/gs.12/build29/? and ?? 
	were processed to convert all lower case bases into N and named 
	as *.fa.masked and placed under genscan/mContigs).

        Log into kkr1u00 (not kk!).  kkr1u00 is the driver node for the small
        cluster (kkr2u00 -kkr8u00. Genscan has problem running on the
        big cluster, due to limitation of memory and swap space on each
        processing node).
                cd ~/oo
                cd bed/genscan
        Make 3 subdirectories for genscan to put their output files in
                mkdir gtf pep subopt
        Generate a list file, genome.list, of all the contigs
		ls -1S ./mContigs/*.masked > genome.list
        
	Edit genome.list to remove jobs on files of "*.fa.masked" which
        have pure Ns due to heterochromatin (unsequencable stuff) and 
	will cause genscan to run forever.
        
	Create template file, gsub, for gensub2.  For example (3 lines file):
                #LOOP
                /cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000
                #ENDLOOP
        Create a file containing a single line.
                echo single > single
        Generate job list file, jobList, for Parasol
                gensub2 genome.list single gsub jobList

        First issue the following Parasol command:
                para create jobList
        Run the following command, which will try first 10 jobs from jobList
                para try
        Check if these 10 jobs run OK by
                para check
        If they have problems, debug and fix your program, template file,
        commands, etc. and try again.  If they are OK, then issue the following
        command, which will ask Parasol to start all the remaining jobs.  For
	hg11, there were 2043 jobs in total.
                para push
        Issue either one of the following two commands to check the
        status of the cluster and your jobs, until they are done.
                parasol status
                para check

o - Convert these to chromosome level files as so:     
     cd ~/mm
     cd bed/genscan
     liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/*.gtf
     liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/*.bed
     cat pep/*.pep > genscan.pep

o - Load into the database as so:
     ssh hgwdev
     cd ~/mm/bed/genscan
     ldHgGene hg11 genscan genscan.gtf
     hgPepPred hg11 generic genscanPep genscan.pep
     hgLoadBed hg11 genscanSubopt genscanSubopt.bed

CREATE GOLDEN TRIANGLE (todo)

Make sure that rnaCluster table is in place.   Then
extract Affy expression info into a form suitable
for Eisen's clustering program with:
      cd ~/oo/bed
      mkdir triangle
      cd triangle
      eisenInput hg11 affyHg10.txt
Transfer this to Windows and do k-means clustering
with k=200 with cluster.  Transfer results file back
to ~/oo/bed/triangle/affyCluster_K_G200.kgg.  Then
do
      promoSeqFromCluster hg11 1000 affyCluster_K_G200.kgg kg200.unmasked
Then RepeatMask the .fa file inkg200.unmasked, and copy masked versions
to kg200.   Then
      cat kg200/*.fa > all1000.fa
and set up cluster Improbizer run to do 100 controls for every real
run on each - putting the output in imp.200.1000.e.  When improbizer
run is done make a file summarizing the runs as so:
      cd imp.200.1000.e
      motifSig ../imp.200.1000.e.iri ../kg200 motif control*
get rid of insignificant motifs with:
      cd ..
      awk '{if ($2 > $3) print; }' imp.200.1000.e.iri > sig.200.1000.e.iri
turn rest into just dnaMotifs with
      iriToDnaMotif sig.200.1000.e.iri motif.200.1000.e.txt
Extract all promoters with
      featureBits hg11 rnaCluster:upstream:1000 -bed=upstream1000.bed -fa=upstream1000.fa
Locate motifs on all promoters with
      dnaMotifFind motif.200.1000.e.txt upstream1000.fa hits.200.1000.e.txt -rc -markov=2
      liftPromoHits upstream1000.bed hits.200.1000.e.txt triangle.bed

CREATE STS/FISH/BACENDS/CYTOBANDS DIRECTORY STRUCTURE AND SETUP (done)

o - Create directory structure to hold information for these tracks
	cd /projects/hg2/booch/psl/
	mkdir gs.12
	mkdir gs.12/build29
	mkdir gs.12/build29/sts
	mkdir gs.12/build29/primers
	mkdir gs.12/build29/bacends
	mkdir gs.12/build29/fish
	mkdir gs.12/build29/cytobands

o - Copy in Makefiles from previous assembly
	cp gs.11/build28/Makefile gs.12/build29
	cp gs.11/build28/sts/Makefile gs.12/build29/sts
	cp gs.11/build28/primers/Makefile gs.12/build29/primers
	cp gs.11/build28/bacends/Makefile gs.12/build29/bacends
	cp gs.11/build28/fish/Makefile gs.12/build29/fish
	cp gs.11/build28/cytobands/Makefile gs.12/build29/cytobands

o - Update all Makefiles with latest OOVERS and GSVERS

o - Create accession_info file
	make accession_info.rdb

UPDATE STS INFORMATION (done)

o - Download and unpack updated information from dbSTS:

	In a web browser, go to ftp://ftp.ncbi.nih.gov/repository/dbSTS/.  Download 
    	dbSTS.sts, dbSTS.aliases, and dbSTS.FASTA.dailydump.Z to 
    	/projects/hg2/booch/psl/update

	-Unpack dbSTS.FASTA.dailydump.Z
	gunzip dbSTS.FASTA.dailydump.Z

o - Create updated files (takes a while ~1.5 days right now)
	cd /projects/hg2/booch/psl/update
	make update

o - Make new directory for this info and move files there
	ssh kks00
	mkdir /cluster/store1/sts.# (# = next number not used)
	cp all.STS.fa /cluster/store1/sts.#
	cp all.primers /cluster/store1/sts.#
	cp all.primers.fa /cluster/store1/sts.#

STS ALIGNMENTS (done)
(alignments done without RepeatMasking, so start ASAP!)

o - Create full sequence alignments
	ssh kk
	cd /cluster/home/booch/sts
	- update Makefile with latest OOVERS and GSVERS
	- update stsMarkers.lst with latest location of all.STS.fa (from above)
	make new.assembly
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make stsMarkers.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly

o - Create primer alignments
	ssh kk
	cd /cluster/home/booch/primers
	- update Makefile with latest OOVERS and GSVERS
	- update primers.lst with latest location of all.primers.fa (from above)
	make new.assembly
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make primers.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly
	
CREATE AND LOAD STS MARKERS TRACK (done)

o - Create final version of sts sequence placements
	ssh kks00
	cd /projects/hg2/booch/psl/gs.12/build29/sts
	make stsMarkers.final

o - Create final version of primers placements
	cd /projects/hg2/booch/psl/gs.12/build29/primers
	make primers.final

o - Create bed file
	cd /projects/hg2/booch/psl/gs.12/build29
	make stsMap.bed

o - Create database tables
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < all_sts_primer.sql
	mysql -uhgcat -pXXXXXXX < all_sts_seq.sql
	mysql -uhgcat -pXXXXXXX < stsAlias.sql
	mysql -uhgcat -pXXXXXXX < stsInfo.sql
	mysql -uhgcat -pXXXXXXX < stsMap.sql

o - Load the tables
	load /projects/hg2/booch/psl/gs.12/build29/sts/stsMarkers.psl.filter.lifted into all_sts_seq	
	load /projects/hg2/booch/psl/gs.12/build29/primers/primers.psl.filter.lifted into all_sts_primer	
	load /projects/hg2/booch/psl/gs.12/build29/stsAlias.bed into stsAlias
	load /projects/hg2/booch/psl/gs.12/build29/stsInfo.bed into stsInfo
	load /projects/hg2/booch/psl/gs.12/build29/stsMap.bed into stsMap

o - Load the sequences (change sts.# to match correct location)
	hgLoadRna addSeq hg11 /cluster/store1/sts.2/all.STS.fa
	hgLoadRna addSeq hg11 /cluster/store1/sts.2/all.primers.fa


BACEND SEQUENCE ALIGNMENTS (done)
(alignments done without RepeatMasking, so start ASAP!)

o - Create full sequence alignments
	ssh kk
	cd /cluster/home/booch/bacends
	- update Makefile with latest OOVERS and GSVERS
	- update bacEnds.lst with latest location of BACends.fa (doesn't usually change)
	make new
	make jobList.scratch (if contig files propagated to nodes)
		- or _
	make jobList.disk (if contig files not propagated)
	para create jobList
	para push (or para try/para check if want to make sure it runs)
	make stsMarkers.psl

o - Copy files to final destination and remove
	ssh kks00
	make copy.assembly
	make clean.assembly

BACEND PAIRS TRACK

o - Update Makefile with location of pairs files, if necessary
	cd /projects/hg2/booch/psl/gs.12/build29/bacends
	edit Makefile (PAIRS=....)

o - Create bed file
	ssh kks00
	cd /projects/hg2/booch/psl/gs.12/build29/bacends
	make bacEndPairs.bed

o - Create database tables
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < all_bacends.sql
	mysql -uhgcat -pXXXXXXX < bacEndPairs.sql

o - Load the tables
	load /projects/hg2/booch/psl/gs.12/build29/bacends/bacEnds.psl.filter.lifted into all_bacends	
	load /projects/hg2/booch/psl/gs.12/build29/bacends/bacEndPairs.bed into bacEndPairs

o - Load the sequences (change bacends.# to match correct location)
	hgLoadRna addSeq hg11 /cluster/store1/bacends.2/BACends.fa
		
UPDATE FISH CLONES INFORMATION

o - Download the latest info from NCBI
	point browser at http://www.ncbi.nlm.nih.gov/genome/cyto/cytobac.cgi?CHR=all&VERBOSE=ctg
	change "Show details on sequence-tag" to "yes"
	change "Download or Display" to "Download table for UNIX"
	press Submit - save as /projects/hg2/booch/psl/fish/hbrc/hbrc.YYYYMMDD.table

o - Format file just downloaded
	cd /projects/hg2/booch/psl/fish/
	make HBRC

o - Copy it to the new freeze location
	cp /projects/hg2/booch/psl/fish/all.fish.format /projects/hg2/booch/psl/gs.12/build29/fish/


CREATE AND LOAD FISH CLONES TRACK
(must be done after STS markers track and BAC end pairs track)

o - Extract the file with clone positions from database
	ssh hgwdev
	mysql -uhgcat -pXXXXXXXX hg11
	mysql>  select * into outfile "/tmp/booch/clonePos.txt" from clonePos;
	mysql> quit
	mv /tmp/booch/clonePos.txt /projects/hg2/booch/psl/gs.12/build29/fish

o - Create bed file
	cd /projects/hg2/booch/psl/gs.12/build29/fish
	make bed

o - Create database table
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < fishClones.sql

o - Load the table
	load /projects/hg2/booch/psl/gs.12/build29/fish/fishClones.bed into fishClones
	

CREATE AND LOAD CHROMOSOME BANDS TRACK
(must be done after FISH Clones track) 

o - Create bed file
	ssh hgwdev
	make setBands.txt
	make cytobands.pct.ranges
	make predict

o - Create database table
	ssh hgwdev
	cd /projects/hg2/booch/psl/tables
	mysql -uhgcat -pXXXXXXX < cytoBand.sql
	

o - Load the table
	load /projects/hg2/booch/psl/gs.12/build29/cytobands/cytobands.bed into cytoBand


CREATE CHROMOSOME REPORTS


CREATE STS MAP COMPARISON PLOTS


DOING HUMAN/MOUSE ALIGMENTS (todo)

o - Start with the mouse assembly in 1 Mb chunks lower-case
    repeat and tandem-repeat masked on kkstore by copying files there
    in the following way.

Mouse contigs:
mkdir /scratch/hg/mm2/rmsk
cp bed/rmsk/out/* /scratch/hg/mm2/rmsk
cp -R /cluster/store2/mm.2002.02/mm2/trfFa/ /scratch/hg/mm2/

Human contigs:
mkdir /scratch/hg/gs.12/build29/rmsk
cp /cluster/store1/gs.12/build29/?/*/*.out /scratch/hg/gs.12/build29/rmsk
cp /cluster/store1/gs.12/build29/??/*/*.out /scratch/hg/gs.12/build29/rmsk
cp -R /cluster/store2/gs.12/build29/bed/trfFa /scratch/hg/gs.12/build29/trfFa

    Then
        ssh kkstore
	cd ~/oo/bed
	mkdir blatMus
	cd blatMus
	ls -1 /scratch/hg/mm2/trfFa/*.fa.trf > mouseAll
	mkdir mm
	cd mm
	splitFile ../mouseAll 50 mmsplitFile ../smallH 4 small
	cd ..
	ls -1 mm/* > mouse.lst
    Then bundle up the human into pieces of less than 12
    meg mostly by
	ls -lhS /scratch/hg/gs.12/build29/trfFa/*.fa.trf > bigHuman
    edit this file and move all of the lines less than 3 meg
    into the file smallHuman.  Then do
        awk '{printf("%s\n", $9);}' bigHuman > bigH
        awk '{printf("%s\n", $9);}' smallHuman > smallH
	mkdir hs
	cd hs
	splitFile ../bigH 1 big
	rm big32 # Note, this is just an empty file that the splitFile program erroneously created
	splitFile ../smallH 4 small
	rm small504
        cd ..
	ls -1 hs/* > human.lst
    (The rm commands above indicate that splitFile needs a fix - they are zero length).

Copy the old gsub here
cp /cluster/store1/gs.11/build28/bed/blatMouse/gsub .
    Finally generate the job list with
        gensub2 human.lst mouse.lst gsub spec

o - Do the cluster run as so
       ssh kk
       cd ~/oo/bed/blatMus
       mkdir psl
       para create specE
       para try
    and then do para push/check/push/check/shove etc.

o - Sort alignments as so 
       ssh kkstore
       cd ~/oo/bed/blatMus
       pslCat -dir -check psl | liftUp -type=.psl stdout ../../jkStuff/liftAll.lft warn stdin | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin -pslQ | pslSortAcc nohead chromPile /cluster/store2/temp stdin
o - Get rid of big pile-ups due to contamination as so:
       mkdir chrom
       cd chromPile
       foreach i (*.psl)
           echo $i
           pslUnpile -maxPile=250 $i ../chrom/$i
       end
o - Rename to correspond with tables as so and load into database:
       ssh hgwdev
       cd ~/oo/bed/blatMus/chrom
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatMus.psl
       end
       hgLoadPsl hg11 *.psl
o - load sequence into database as so:
	ssh kks00
	faSplit about /projects/hg3/mouse/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/mouse/arachne.3/whole/unplaced
	ssh hgwdev
	hgLoadRna addSeq '-abbr=gnl|' hg11 /projects/hg3/mouse/arachne.3/whole/unpla*.fa
	hgLoadRna addSeq '-abbr=con' hg11 /projects/hg3/mouse/arachne.3/whole/SET*.mfa
    This will take quite some time.  Perhaps an hour .

o - Produce 'best in genome' filtered version:
        ssh kks00
	cd ~/mouse/vsOo33
	pslSort dirs blatMouseAll.psl temp blatMouse
	pslReps blatMouseAll.psl bestMouseAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1
	pslSortAcc nohead bestMouse temp bestMouseAll.psl
	cd bestMouse
        foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_bestMouse.psl
        end
o - Load best in genome into database as so:
	ssh hgwdev
	cd ~/mouse/vsOo33/bestMouse
        hgLoadPsl hg11 *.psl

PRODUCING CROSS_SPECIES mRNA ALIGNMENTS DONE

Here you align vertebrate mRNAs against the masked genome on the
cluster you set up during the previous step.

o - Make sure that gbpri, gbmam, gbrod, and gbvert are downloaded from Genbank into
    /cluster/store1/genbank.129 DONE

o - Process these out of genbank flat files as so:
       ssh kkstore
       cd /cluster/store1/genbank.129
       cd ../mrna.129
       faSplit sequence xenoRna.fa 2 xenoRna
       ssh kks00
       cd /scratch/hg
       mkdir mrna.129
       cp /cluster/store1/mrna.129/xenoRna*.* mrna.129
Request binrysnc of /scratch/hg/mrna.129 from the admins

Set up cluster run.  First make sure genome is in kks00:/scratch/hg/gs.12/build29/contig/trf
in RepeatMasked + trf form.  (This is probably done already in mouse alignment
stage).  Also make sure /scratch/hg/mrna.129 is loaded with xenoRna.fa Then do:
       ssh kkstore
       cd /cluster/store1/gs.12/build29/bed
       mkdir xenoMrna
       cd xenoMrna
       mkdir psl
       ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst
       ls -1S /scratch/hg/mrna.129/xenoRna?*.fa > mrna.lst
       cp ~kent/lastOo/bed/xenoMrna/gsub .
       gensub2 human.lst mrna.lst gsub spec
       para create spec
       para try
       para check
       para push 
Do para check until the run is done, doing para push if
necessary on occassion.

Sort xeno mRNA alignments as so:
       ssh kkstore
       cd ~/oo/bed/xenoMrna
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.25
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoMrna.psl
       rm -r chrom raw.psl cooked.psl chrom.psl
DONE
Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoMrna
       hgLoadPsl hg11 xenoMrna.psl -tNameIx
       cd /cluster/store1/mrna.129
       hgLoadRna add hg11 /cluster/store1/mrna.129/xenoRna.fa /cluster/store1/hgLoadRna add hg11 /cluster/store1/mrna.129/xenoRna.fa xenoRna.ra

DONE
Similarly do xenoEst aligments:
   Prepare the est data:
        cd /cluster/store1/mrna.129
        faSplit sequence xenoEst.fa 16 xenoEst    

       ssh kkstore
       cd /cluster/store1/gs.8/build29/oo/bed
       mkdir xenoEst
       cd xenoEst
       mkdir psl
       ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst
       cp /cluster/store1/mrna.129/xenoEst?*.fa /scratch/hg/mrna.129
       ls -1S /scratch/hg/mrna.129/xenoEst?*.fa > mrna.lst
       cp ~kent/lastOo/bed/xenoEst/gsub .

Request a binrysnc from the admin's of kkstore's /scratch/hg/mrna.129
When done, do:
       gensub2 human.lst mrna.lst gsub spec
       para create spec
       para push DONE

Sort xenoEst alignments:
       ssh kkstore
       cd ~/oo/bed/xenoEst
       pslSort dirs raw.psl /cluster/store2/temp psl
       pslReps raw.psl cooked.psl /dev/null -minAli=0.10
       liftUp chrom.psl ../../jkStuff/liftAll.lft warn cooked.psl
       pslSortAcc nohead chrom /cluster/store2/temp chrom.psl
       pslCat -dir chrom > xenoEst.psl
       rm -r chrom raw.psl cooked.psl chrom.psl

Load into database as so:
       ssh hgwdev
       cd ~/oo/bed/xenoEst
       hgLoadPsl hg11 xenoEst.psl -tNameI
       cd /cluster/store1/mrna.129
       hgLoadRna add hg11 /cluster/store1/mrna.129/xenoEst.fa /cluster/store1/mrna.129/xenoEst.ra
    
DONE

PRODUCING FISH ALIGNMENTS (DONE)

o - Do fish/human alignments.
       ssh kk
       cd ~/oo/bed
       mkdir blatFish
       cd blatFish
       mkdir psl
       ls -1S /scratch/hg/fish/*.fa > fish.lst
       ls -1S /scratch/hg/gs.12/build29/trfFa/*.fa.trf > human.lst
     Copy over gsub from previous version and edit paths to
     point to current assembly.
       gensub2 human.lst fish.lst gsub spec
       para create spec DONE
       para try
     Make sure jobs are going ok with para check.  Then
       para push
     wait about 2 hours and do another
       para push
     do para checks and if necessary para pushes until done
     or use para shove.
o - Sort alignments as so 
       pslCat -dir psl | liftUp -type=.psl stdout ~/oo/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin
o - Copy to hgwdev:/scratch.  Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/oo/bed/blatFish/chrom
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatFish.psl
       end
       hgLoadPsl hg11 *.psl

Now load the fish seqeuence data
hgLoadRna addSeq hg11 /projects/hg3/fish/tet6/tet*.fa
DONE


TIGR GENE INDEX (done 7/1/02, re-load w/new data 7/30/02)
    mkdir -p ~/hg11/bed/tigr
    cd ~/hg11/bed/tigr
    # wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build29.tgz
    wget ftp://ftp.tigr.org/private/HGI_ren/TGI_track_HumanGenome_build29_corrected.tgz
    gunzip -c TGI*.tgz | tar xvf -
    foreach f (*cattle*)
      set f1 = `echo $f | sed -e 's/cattle/cow/g'`
      mv $f $f1
    end
    foreach o (mouse cow human pig rat)
      setenv O $o
      foreach f (chr*_$o*s)
        tail +2 $f | perl -wpe 's /THC/TC/; s/(TH?C\d+)/$ENV{O}_$1/;' > $f.gff
      end
    end
    ldHgGene -exon=TC hg11 tigrGeneIndex *.gff


LOAD STS MAP (todo) DONE BY TERRY I BELIEVE - HE WILL UPDATE THIS
     - login to hgwdev
      cd ~/oo/bed
      hg11 < ~/src/hg/lib/stsMap.sql
      mkdir stsMap
      cd stsMap
      bedSort /projects/cc/hg/mapplots/data/tracks/build29/stsMap.bed stsMap.bed
      - Enter database with "hg11" command.
      - At mysql> prompt type in:
          load data local infile 'stsMap.bed' into table stsMap;
      - At mysql> prompt type
          quit


LOAD CHROMOSOME BANDS (todo) ALSO DONE BY TERRY I BELIEVE
      - login to hgwdev
      cd /cluster/store1/gs.12/build29/bed
      mkdir cytoBands
      cp /projects/cc/hg/mapplots/data/tracks/oo.29/cytobands.bed cytoBands
      cd cytoBands
      hg11 < ~/src/hg/lib/cytoBand.sql
      Enter database with "hg11" command.
      - At mysql> prompt type in:
          load data local infile 'cytobands.bed' into table cytoBand;
      - At mysql> prompt type
          quit

LOAD MOUSEREF TRACK (todo)
    First copy in data from kkstore to ~/oo/bed/mouseRef.  
    Then substitute 'genome' for the appropriate chromosome 
    in each of the alignment files.  Finally do:
       hgRefAlign webb hg11 mouseRef *.alignments

LOAD AVID MOUSE TRACK (todo)
      ssh cc98
      cd ~/oo/bed
      mkdir avidMouse
      cd avidMouse
      wget http://pipeline.lbl.gov/tableCS-LBNL.txt
      hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed
      hgLoadBed avidRepeat avidRepeat.bed
      hgLoadBed avidUnique avidUnique.bed

LOAD SNPS (Done; Daryl Thomas May 28, 2002)
      ssh hgwdev
      cd ~/oo/bed
      mkdir snp
      cd snp
     -Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.ncbi.b29.gz
     -Unpack.
      ln -s ../../seq_contig.md .
      calcFlipSnpPos seq_contig.md gp.ncbi.b29 gp.ncbi.b29.flipped
      mv gp.ncbi.b29 gp.ncbi.b29.original
      gzip gp.ncbi.b29.original
      grep RANDOM       gp.ncbi.b29.flipped >  snpTsc.txt
      grep MIXED        gp.ncbi.b29.flipped >> snpTsc.txt
      grep BAC_OVERLAP  gp.ncbi.b29.flipped >  snpNih.txt
      grep OTHER        gp.ncbi.b29.flipped >> snpNih.txt
      awk -f filter.awk snpTsc.txt > snpTsc.contig.bed
      awk -f filter.awk snpNih.txt > snpNih.contig.bed
      liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed
      liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed
      hgLoadBed hg11 snpTsc snpTsc.bed
      hgLoadBed hg11 snpNih snpNih.bed
     -gzip all of the big files

LOAD CPGISLANDS (done 7/18/02)
     - login to hgwdev
     mkdir -p ~/hg11/cpgIsland
     cd ~/hg11/cpgIsland
     - Asif Chinwalla <achinwal@watson.wustl.edu> emailed the data in an 
       attachment; it was unpacked into ~/hg11/cpgIsland
     - copy filter.awk from a previous release, e.g. ~kent/oo.33/bed/cpgIsland 
       to cpg_apr2002.masked
     awk -f filter.awk */*.cpg > cpgIsland.bed
     hgLoadBed hg11 cpgIsland -tab -noBin \
       -sqlTable=$HOME/kent/src/hg/lib/cpgIsland.sql cpgIsland.bed

LOAD ENSEMBL GENES (done 7/9/02)
     mkdir -p ~/hg11/bed/ensembl
     cd ~/hg11/bed/ensembl
     # wget complains about a Redirection loop, but GET handles it (?):
     GET http://www.ebi.ac.uk/~stabenau/human_29_gtf.gz > human_29_gtf.gtf.gz
     # add "chr" to the chrom ids:
     gunzip -c human_29_gtf.gtf.gz | \
       perl -w -p -e 's/^(\w)/chr$1/' > human_29_gtf-fixed.gtf
     ldHgGene hg11 ensGene human_29_gtf-fixed.gtf
     # Load Ensembl peptides, replace ">ENSP" with ">ENST":
     wget ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep/Homo_sapiens.pep.all.fa.gz
     gunzip -c Homo_sapiens.pep.all.fa.gz | sed -e 's/^>ENSP/>ENST/' \
	> ensembl.pep
     hgPepPred hg11 generic ensPep ensembl.pep

LOAD SANGER22 GENES 
      cd ~/oo/bed
      mkdir sanger22
      cd sanger22
      not sure where these files were downloaded from
      grep -v Pseudogene Chr22*.genes.gff | hgSanger22 hg11 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0
          | ldHgGene hg11 sanger22pseudo stdin
  Note: this creates sanger22extras, but doesn't currently create
  a correct sanger22 table, which are replaced in the next steps
      sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg11 sanger22 stdin
      sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene hg11 sanger22pseudo stdin

      hgPepPred hg11 generic sanger22pep *.pep

LOAD SANGER 20 GENES (todo)
     First download files from James Gilbert's email to ~/oo/bed/sanger20 and
     go to that directory while logged onto hgwdev.  Then:
        grep -v Pseudogene chr_20*.gtf | ldHgGene hg11 sanger20 stdin
	hgSanger20 hg11 *.gtf *.info


LOAD RNAGENES (todo)
      - login to hgwdev
      - cd ~kent/src/hg/lib
      - hg11 < rnaGene.sql
      - cd /cluster/store1/gs.12/build29/bed
      - mkdir rnaGene
      - cd rnaGene
      - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz
      - gunzip *.gz
      - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff
      - hgRnaGenes hg11 chrom.gff

LOAD EXOFISH (todo)
     - login to hgwdev
     - cd /cluster/store1/gs.12/build29/bed
     - mkdir exoFish
     - cd exoFish
     - hg11 < ~kent/src/hg/lib/exoFish.sql
     - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr)
       into /cluster/store1/gs.12/build29/bed/exoFish/all_maping_ecore
     - awk -f filter.awk all_maping_ecore > exoFish.bed
     - hgLoadBed hg11 exoFish exoFish.bed

LOAD MOUSE SYNTENY (todo)
     - login to hgwdev.
     - cd ~/kent/src/hg/lib
     - hg11 < mouseSyn.sql
     - mkdir ~/oo/bed/mouseSyn
     - cd ~/oo/bed/mouseSyn
     - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as
       mouseSyn.txt
     - awk -f format.awk *.txt > mouseSyn.bed
     - delete first line of mouseSyn.bed
     - Enter database with "hg11" command.
     - At mysql> prompt type in:
          load data local infile 'mouseSyn.bed' into table mouseSyn


LOAD GENIE (todo)
     - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf
     - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf
     - ldHgGene hg11 genieAlt predChrom.gtf

     - cat */ctg*/ctg*.affymetrix.aa > pred.aa
     - hgPepPred hg11 genie pred.aa 

     - hg11
         mysql> delete * from genieAlt where name like 'RS.%';
         mysql> delete * from genieAlt where name like 'C.%';

LOAD SOFTBERRY GENES (DONE 8/8/02)
     ln -s /cluster/store1/gs.12/build29/ ~/hg11
     mkdir -p ~/hg11/bed/softberry
     cd ~/hg11/bed/softberry
     GET ftp://www.softberry.com/pub/sc_fgenesh_ap02/sb_fgenesh_ap02.tar.gz \
       > sb_fgenesh_ap02.tar.gz
     gunzip -c sb_fgenesh_ap02.tar.gz | tar xvf -
     cd sb_fgenesh_ap02
     ssh hgwdev
     cd ~/hg11/bed/softberry/sb_fgenesh_ap02
     ldHgGene hg11 softberryGene chr*.gff
     hgPepPred hg11 softberry *.pro
     hgSoftberryHom hg11 *.pro

LOAD GENEID GENES (todo)
     mkdir ~/oo/bed/geneid
     cd ~/oo/bed/geneid
     mkdir download
     cd download
   Now download *.gtf and *.prot from 
   http://www1.imim.es/genepredictions/H.sapiens/golden_path_20011222/geneid_v1.1/
     cd ..
     ldHgGene hg11 geneid download/*.gtf -exon=CDS
     hgPepPred hg11 generic geneidPep download/*.prot

LOAD ACEMBLY (DONE 05/31/02)
    mkdir -p ~/oo/bed/acembly
    cd ~/oo/bed/acembly
    - Get acembly*gene.gff from Jean and Danielle Thierry-Mieg
    wget ftp://ftp.ncbi.nih.gov/repository/acedb/ncbi_29.human.genes/acembly.ncbi_29.genes.gff.tar.gz
    wget ftp://ftp.ncbi.nih.gov/repository/acedb/ncbi_29.human.genes/acembly.ncbi_29.genes.proteins.fasta.tar.gz
    gunzip -c acembly.ncbi_29.genes.gff.tar.gz | tar xvf -
    gunzip -c acembly.ncbi_29.genes.proteins.fasta.tar.gz | tar xvf -
    cd acembly.ncbi_29.genes.gff
    - Strip out floating-contig features (lines with *|NT_?????? as the chr ID),
      and add 'chr' prefix to all chr nums:
    foreach f (acemblygenes.*.gff)
      egrep -v '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' $f | \
        perl -wpe 's/^(\w)/chr$1/' > $f:r-fixed.gff
    end
    - Save just the floating-contig features to different files for lifting 
    - and lift up the floating-contig features to chr*_random coords:
    foreach c ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Un)
      egrep '^[a-zA-Z0-9]+\|NT_[0-9][0-9][0-9][0-9][0-9][0-9]' acemblygenes.$c.gff | \
        perl -wpe 's/^(\w+)\|(\w+)/$1\/$2/' > $c-random-ctg.gff
      liftUp $c-random-lifted.gff ../../../$c/lift/random.lft warn $c-random-ctg.gff
    end

    cd ../acembly.ncbi_29.genes.proteins.fasta
    - Remove G_t*_ prefixes from acemblyproteins.*.fasta:
    foreach f (acemblyproteins.*.fasta)
      perl -wpe 's/^\>G_t[\da-zA-Z]+_/\>/' $f > $f:r-fixed.fasta
    end
    - Load into database as so:
    cd ..
    ldHgGene hg11 acembly acembly.ncbi_29.genes.gff/*-fixed.gff acembly.ncbi_29.genes.gff/*-lifted.gff
    hgPepPred hg11 generic acemblyPep acembly.ncbi_29.genes.proteins.fasta/*-fixed.fasta

LOAD GENOMIC DUPES (todo)
o - Load genomic dupes
    ssh hgwdev
    cd ~/oo/bed
    mkdir genomicDups
    cd genomicDups
    wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip
    unzip *.zip
    awk -f filter.awk oo33_dups_for_kent > genomicDups.bed
    mysql -u hgcat -pbigSECRET hg11 < ~/src/hg/lib/genomicDups.sql
    hgLoadBed hg11 -oldTable genomicDups genomicDupes.bed

FAKING DATA FROM PREVIOUS VERSION
(This is just for until proper track arrives.  Rescues about
97% of data  Just an experiment, not really followed through on).

o - Rescuing STS track:
     - log onto hgwdev
     - mkdir ~/oo/rescue
     - cd !$
     - mkdir sts
     - cd sts
     - bedDown hg3 mapGenethon sts.fa sts.tab
     - echo ~/oo/sts.fa > fa.lst
     - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g
     - log onto cc01
     - cc ~/oo/rescue/sts
     - split all.con into 3 parts and condor_submit each part
     - wait for assembly to finish
     - cd psl
     - mkdir all
     - ln ?/*.psl ??/*.psl *.psl all
     - pslSort dirs raw.psl temp all
     - pslReps raw.psl contig.psl /dev/null
     - rm raw.psl
     - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl
     - rm contig.psl
     - mv chrom.psl ../convert.psl


LOADING MOUSE MM2 BLASTZ ALIGNMENTS FROM PENN STATE: (markd)

    - loading both blastz alignments and reference (single coverage) alignments
    - in xAli format, which includes sequence
    - done in a tmp dir and intermediate files discarded

    - create psl files for each per-contig lav file

       set sc=""
       set tbl="blastzMm2"
       foreach chrdir (/cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/lav/chr*)
         set chr=$chrdir:t
         set outdir=lav-psl${sc}/$chr
         mkdir -p $outdir
         foreach lav ($chrdir/*.lav${sc})
           lavToPsl -target-strand=+ $lav $outdir/$lav:t:r.psl
         end
       end

    - Convert to per-chromsome files, sort, and add sequence
       mkdir -p lav-xa{sc}
       foreach chrdir (lav-psl${sc}/*)
         set chr=$chrdir:t
         pslCat -check -nohead -ext=.psl -dir lav-psl${sc}/$chr \
          | liftUp -type=.psl -pslQ -nohead stdout /cluster/store2/mm.2002.02/mm2/jkStuff/liftAll.lft warn stdin \
          | sort -k 15n -k 16n \
          | pslToXa stdin lav-xa${sc}/${chr}_${tbl}.xa /cluster/store2/mm.2002.02/mm2/nib /cluster/store1/gs.12/build29/nib
       end

    - repeat both loops, this time doing the single-coverage alignment:
        set sc=".sc"
        set tbl="blastzMm2Sc
        <above loops>

    - Load tables
        cd lav-xa
        hgLoadPsl -xa hg11 *.xa
        cd lav-xa.sc
        hgLoadPsl -xa hg11 *.xa
  

   - Load aligned ancient repeats, from
       /cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/aar
     Ryan create:
       /cluster/store1/gs.12/build29/bed/blastz.mm2.2002-04-14/aar/xali
     - Loaded into aarMm2

MITOCHONDRIAL DNA PSEUDO-CHROMOSOME - DONE

Download the fasta file from http://www.gen.emory.edu/MITOMAP/mitomapRCRS.fasta
Put it in /cluster/store1/mrna.129
ssh hgwdev
cd ~/oo
mkdir M
cp /cluster/store1/mrna.129/mitomapRCRS.fasta M/chrM.fa
Edit jkStuff/makeNib.sh to make sure it also has the "M" directory in its file list
tcsh jkStuff/makeNib.sh
hgNibSeq -preMadeNib hg11 /cluster/store1/gs.12/build29/nib ?/chr*.fa ??/chr*.fa


LOAD Ingo Ebersber's chimp BLAT alignments DONE
cd ~/oo
mkdir bed/chimpBlat
cd bed/chimpBlat

#!/bin/sh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
do
  wget http://email.eva.mpg.de/~ebersber/custom_track_chimp/MPI-sg_apr02/chr${i}_gp_F01Apr02.psl
done

Remove the first line from each psl file. It is junk.
pslCat *.psl > chimpBlat.psl
hgLoadPsl hg11 chimpBlat.psl


MAKING THE DOWNLOADABLE DATABASE FILES - DONE

mkdir /usr/local/apache/htdocs/goldenPath/05apr2002
mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/chromosomes
mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips
mkdir /usr/local/apache/htdocs/goldenPath/05apr2002/database

o zip up the chromosomes individually
ssh kkstore (we use kkstore because no NFS traffic via kkstore = faster data transfer)

cd ~/oo
In tcsh run this script
  foreach i (*/chr*.fa)
      echo zip $i:r.zip $i
      zip $i:r.zip $i
  end

Then do:  
ssh hgwdev
mv */chr*.zip /usr/local/apache/htdocs/goldenPath/05apr2002/chromosomes

Request that the admins push this to hgwbeta.

o Make the big zips

 - Make database.zip
ssh hgwbeta
cd /usr/local/apache/htdocs/goldenPath/05apr2002/database
zip ../bigZips/database.zip *

ssh hgwdev
cd ~/oo

 - Make chromAgp.zip
zip chromAgp.zip */chr*.agp
mv chromAgp.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make chromFa.zip
zip chromFa.zip */chr*.fa
mv chromFa.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make chromOut.zip
zip chromOut.zip */chr*.out
mv chromOut.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make contigAgp.zip
zip contigAgp.zip */*/*.agp
mv contigAgp.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make contigFa.zip
zip contigFa.zip */*/*.fa
mv contigFa.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make contigOut.zip
zip contigOut.zip */*/*.out
mv contigOut.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make liftAll.zip
zip liftAll.zip jkStuff/liftAll.lft
mv liftAll.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZips

 - Make mrna.zip
zip mrna.zip /cluster/store1/mrna.129/mrna.fa
mv mrna.zip /usr/local/apache/htdocs/goldenPath/05apr2002/bigZip

o Dump the database
ssh hgwbeta
We dump the database on hgwbeta in order to only dump the most accurate datbase state.

There is one trick here: mysqldump becomes the mysql user
and the directory you want to dump to must have that
user the ability to write to it.

Here's what to do:

cd /var/tmp
mkdir hg11-dump
chmod 777 hg11-dump      (since you aren't root this is quickest)
cd hg11-dump
mysqldump --user=hguser --password=hguserstuff --all --tab=. hg11

Then, that directory will quickly fill with .sql and .txt files
When it is done do:

cd /var/tmp/hg11-dump
gzip *.txt
mv * /usr/local/apache/htdocs/goldenPath/05apr2002/database

##############################################################################
# liftOver to hg19 requested by user (DONE - 2017-12-22 - Hiram)
    # picked up the sequence from hgdownload to make a 2bit file
    # fixed the chrM name in the sequence
    # created 2bit file:
    # -rw-rw-r-- 1 830361244 Dec 22 08:45 /hive/data/genomes/hg11/hg11.2bit
    # and then a .ooc file:
    cd /hive/data/genomes/hg11
    mkdir /hive/data/genomes/hg11

    time blat hg12.2bit \
	/dev/null /dev/null -tileSize=11 -makeOoc=jkStuff/hg11.11.ooc \
		-repMatch=1024

    mkdir /hive/data/genomes/hg11/bed/blat.hg19.2017-12-22
    cd /hive/data/genomes/hg11/bed/blat.hg19.2017-12-22
    time (doSameSpeciesLiftOver.pl -verbose=2 \
        -bigClusterHub=ku -dbHost=hgwdev -workhorse=hgwdev \
        -ooc=/hive/data/genomes/hg11/jkStuff/hg11.11.ooc \
         hg11 hg19) > do.log 2>&1
    # real    552m56.176s

##############################################################################
# liftOver to hg38 while we are here (DONE - 2017-12-22 - Hiram)

    mkdir /hive/data/genomes/hg11/bed/blat.hg38.2017-12-20
    cd /hive/data/genomes/hg11/bed/blat.hg38.2017-12-20
    time (doSameSpeciesLiftOver.pl -verbose=2 \
        -bigClusterHub=ku -dbHost=hgwdev -workhorse=hgwdev \
        -ooc=/hive/data/genomes/hg11/jkStuff/hg11.11.ooc \
         hg11 hg38) > do.log 2>&1
    # real    553m21.557s

##############################################################################
