 The database of sp092903 and proteins092903 need to 
 be built from SWISS-PROT, TrEMBL, and TrEMBL-NEW first, using spToDb
 and other programs.
 (see /cluster/store4/fan/pb/buildProteins092903.doc for details).

o Create a working subdirectory mm3, make symbolic link,  and go there

	mkdir /cluster/store4/fan/pb/mm3
	cd /cluster/store4/fan/pb/mm3
	ln -s /cluster/store4/fan/pb/mm3 ~/mm3

o Build mm3Temp database by:

  	create database mm3Temp;

  Get mm3Temp.sql for table definitions

	dumpdbdef hg16Temp >mm3Temp.sql

  Create tables in mm3Temp:

	mysql -u hgcat -p$HGPSWD -A mm3Temp <mm3Temp.sql

  From mysql prompt:

	drop table mm3Temp.history;

  NOTE: the above step is needed because later hgKgMrna would have
        a seqment fault error if the history table is not dropped.

o Get mrna input data files

  - Get mrna.fa file:

	/cluster/data/genbank/bin/i386/gbGetSeqs -native -db=mm3 -gbRoot=/cluster/data/genbank genbank mrna mrna.fa 

    There may be some error messages like:

	warning: AH003062.1 does not appear to be a valid mRNA sequence, skipped: ...

    This is normal.

  - Get mrna.ra file:

    /cluster/data/genbank/bin/i386/gbGetSeqs -get=ra -native -db=mm3 -gbRoot=/cluster/data/genbank genbank mrna mrna.ra 

  - Get all_mrna.psl file:

    /cluster/data/genbank/bin/i386/gbGetSeqs -get=psl -native -db=mm3 -gbRoot=/cluster/data/genbank genbank mrna all_mrna.psl 
	
o generate a list of all mrna accession numbers

	fgrep ">" mrna.fa > mrna.lis

o Process LocusLink data to generate mrnaRefseq table

        - create a subdirectory 100603 under ~fan/data/ll and cd to there

        - get the latest LocusLink data from

                wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref
                wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc
                wget ftp://ftp.ncbi.nih.gov/refseq/LocusLink/mim2loc

	- and copy over to ~mm3
	
		cp -p *loc* ~/mm3 

        - load LocusLink data to this 2 tables using mysql

                LOAD DATA local INFILE 'loc2acc' into table mm3Temp.locus2Acc0;
                LOAD DATA local INFILE 'loc2ref' into table mm3Temp.locus2Ref0;

        - run hgMrnaRefseq to generate mrnaRefseq.tab

                hgMrnaRefseq mm3

	- create table mm3.mrnaRefseq:

		CREATE TABLE mrnaRefseq (
		mrna varchar(40) NOT NULL default '',
  		refseq varchar(40) NOT NULL default '',
  		KEY mrna (mrna),
  		KEY refseq (refseq)
		) TYPE=MyISAM;

        - load data into all appropriate genome databases

                LOAD DATA local INFILE 'mrnaRefseq.tab' into table mm3.mrnaRefseq; 

o generate FASTA format protein seuqnce file

	kgGetPep 092903 > mrnaPep.fa

o run pslReps to get tighter mRNAs

	pslReps -minCover=0.40 -sizeMatters -minAli=0.97 -nearTop=0.002 all_mrna.psl tight_mrna.psl /dev/null

o Run hgKgMrna to build "refGene" tables in mm3Temp database

  hgKgMrna mm3Temp mrna.fa mrna.ra tight_mrna.psl loc2ref mrnaPep.fa mim2loc proteins092903 >hgKgMrna.out 2>hgKgMrna.err

o create the mrnaGene table in mm3Temp DB, by running mrnaGene.sql 
  at mySql prompt

  Load mrnaGene data into the table

	LOAD DATA local INFILE 'refGene.tab' into table mm3Temp.mrnaGene;

  mm3Temp.mrnaGene is needed by spm6

  create KG related tables in mm3

	mysql -u hgcat -p$HGPSWD -A mm3 < kgRelated.sql

        LOAD DATA local INFILE 'refMrna.tab' into table mm3Temp.refMrna;
  
  Load pep and mrna data into the knownGenePep and knownGeneMrna tables

        LOAD DATA local INFILE 'refPep.tab'  into table mm3.knownGenePep;
        LOAD DATA local INFILE 'refMrna.tab' into table mm3.knownGeneMrna;

o run spm3 to generate the proteinMrna.tab and protein.lis file

	spm3 092903 mm3

  create table spMrna in mm3Temp and load proteinMrna.tab into mm3Temp.spMrna.

	load data local infile "proteinMrna.tab" into table mm3Temp.spMrna;

o run kgBestMrna

  create a subdirectory kgBestMrna
  cd kgBestMrna
  cp -p ../protein.lis .

	kgBestMrna 092903 mm3 2>kgBestMrna.err >kgBestMrna.out2

  The log file of best picks will be generated by kgBestMrna and stored at kgBestMrna.out.

  This may take a day and half to finish!
  The output file is best.lis.  

  	cp -p best.lis ..

  This step could be broken into 2 or 3 pieces and run in parallel to leverage hgwdev's 4 CPUs.

o Create spMrna table in mm3, by copy and paste spMrna.sql
  at mysql prompt.

    Load the data by:

      LOAD DATA local INFILE 'best.lis' into table mm3.spMrna;
      
o Run spm6 to generate sorted.lis and knownGene0.tab 
  for further duplicates processing

  	spm6 092903 mm3

  create table knownGene0 in mm3Temp

  load the knownGene0.tab into the knownGene0 table in mm3Temp

	LOAD DATA local INFILE 'knownGene0.tab' into table mm3Temp.knownGene0;

o Run spm7 to perform duplicates processing

  	spm7 092903 mm3 > spm7.out

o create knownGene and dupSpMrna tables in mm3 by using knownGene.sql and dupSpMrna.sql

        LOAD DATA local INFILE 'knownGene.tab' into table mm3.knownGene;
        LOAD DATA local INFILE 'duplicate.tab' into table mm3.dupSpMrna;

o collect DNA based RefSeq data to create dnaGene.tab and dnaLink.tab

        dnaGene mm3 proteins092903

o create table knownGeneLink in mm3

        LOAD DATA local INFILE 'dnaLink.tab' into table mm3.knownGeneLink;

o load the data into tables:

        LOAD DATA local INFILE 'dnaGene.tab' into table mm3.knownGene;

o Remove invalid KG entries in knownGenePep and knownGeneMrna tables:

	rmKGPepMrna mm3 092903

  First, use mysql to delete old knownGenePep and knownGeneMrna table entries:

	use mm3
	delete from mm3.knownGenePep;
	delete from mm3.knownGeneMrna;

  Then load in new filtered data:

    LOAD DATA local INFILE 'knownGenePep.tab'  into table mm3.knownGenePep;
    LOAD DATA local INFILE 'knownGeneMrna.tab' into table mm3.knownGeneMrna;

o Use the Genome Browser to check if the "Known Gene" track is functioning
  correctly.


o Now create alias tables to facilitate hgFind.

  First create tables of kgXref, kgAlias and kgProtAlias in mm3, using

	kgXref.sql
	kgAlias.sql  
	kgProtAlias.sql  

o Build kgXref table
 
  Generate xref .tab file for KG

	kgXref mm3 proteins092903

  Load it into mySQL

	load data local infile "kgXref.tab" into table mm3.kgXref;
  
o Build gene aliases

  Generate aliases from hugo, etc

	kgAliasM mm3 proteins092903

  Generate gene aliases from SWISS-PROT data 

    kgAliasP mm3 /cluster/store5/swissprot/092903/build/sprot.dat      sp.lis
    kgAliasP mm3 /cluster/store5/swissprot/092903/build/trembl.dat     tr.lis
    kgAliasP mm3 /cluster/store5/swissprot/092903/build/trembl_new.dat new.lis
    cat sp.lis tr.lis new.lis |sort|uniq >kgAliasP.tab
    rm  sp.lis tr.lis new.lis 

  Generate gene aliases from RefSeq data

    	kgAliasRefseq mm3 

  Concatenate all 3 files

	cat kgAliasM.tab kgAliasRefseq.tab kgAliasP.tab|sort|uniq > kgAlias.tab

  Load it into mySQL table

	load data local infile "kgAlias.tab" into table mm3.kgAlias;

o Build protein aliases

  Generate protein aliases

	kgProtAlias mm3 proteins092903

  Generate protein aliases from NCBI data

	kgProtAliasNCBI mm3

  Concatenate both files

	cat kgProtAliasNCBI.tab kgProtAlias.tab|sort|uniq > kgProtAliasBoth.tab

  Load it into mySQL tables

	load data local infile "kgProtAliasBoth.tab" into table mm3.kgProtAlias;

o Create KEGG pathway related tables

  Go to KEGG web site at:

	http://www.genome.ad.jp/dbget-bin/www_bfind?pathway

  Search "mmu".  Cut and paste the resulting list, e.g.:

	1. path:mmu00010        Glycolysis / Gluconeogenesis - Mus musculus
  	2. path:mmu00020        Citrate cycle (TCA cycle) - Mus musculus
  	3. path:mmu00030        Pentose phosphate pathway - Mus musculus
  	4. path:mmu00040        Pentose and glucuronate interconversions - Mus musculus
  	...

  Save it as mmu.lis

  Create the keggList database table in mm3Temp;

	CREATE TABLE keggList (
  		locusID       varchar(40) NOT NULL default '',
  		mapID         varchar(40) NOT NULL default '',
  		description   varchar(255) NOT NULL default '',
  		KEY (locusID),
  		KEY (mapID)
	) TYPE=MyISAM;

  Run the Perl program getKeggList.pl under hg/hgKegg

	getKeggList.pl mmu > keggList.tab

  Load into the table keggList;

	load data local infile "keggList.tab" into table mm3Temp.keggList;

  Run hgKegg to generate the .tab files:

	hgKegg mm3

  which will create two files, keggPathway.tab and keggMapDesc.tab.

  Create the following two tables in mm3 by:

	CREATE TABLE keggMapDesc (
  		mapID       varchar(40) NOT NULL default '',
  		description varchar(255) NOT NULL default '',
  		KEY (mapID)
	) TYPE=MyISAM;
	
	CREATE TABLE keggPathway (
  		kgID           varchar(40) NOT NULL default '',
  		locusID        varchar(40) NOT NULL default '',
  		mapID          varchar(40) NOT NULL default '',
  		KEY (kgID),
  		KEY (locusID),
  		KEY (mapID)
	) TYPE=MyISAM;

  Load the two tables:

	load data local infile "keggPathway.tab" into table mm3.keggPathway;
	load data local infile "keggMapDesc.tab" into table mm3.keggMapDesc;
 
 
o Create CGAP related tables

  Ftp from ftp://ftp1.nci.nih.gov/pub/CGAP

  Get Mm_GeneData.dat.

  Run hgCGAP to generate parsed .tab files.

	hgCGAP Mm_GeneData.dat

	cat *SEQ*.tab *SYM*.tab *ALI*.tab |sort|uniq >cgapAlias.tab

  Load data into tables:

	load data local infile "cgapBIOCARTA.tab" into table mm3.cgapBiocPathway;
	load data local infile "cgapBIOCARTAdesc.tab" into table mm3.cgapBiocDesc;
	load data local infile "cgapAlias.tab" into table mm3.cgapAlias;


 
  
