#
# This file can be viewed at the following URL:
# http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/product/scripts/README
#

Included here are some scripts to help with the job of constructing
and maintaining a mirror site for the UCSC Genome Browser.
Copy this directory contents to a work directory of your choice
outside of the kent source tree so your scripts will remain
stable despite source tree updates.

These scripts are helpful and useful, but they are not the entire
story.  You need to understand what they are doing to utilize
them appropriately.

They expect that commands for rsync, git, and mysql, for example,
are installed on your system.  Instructions for the installation
of MySQL and the Apache WEB server are not included here.
Those products have excellent How-To resources on the internet.
MySQL: http://dev.mysql.com/doc/refman/5.1/en/index.html
Apache: http://httpd.apache.org/

You must copy and edit the specification file:

	browserEnvironment.txt

to determine where the various objects are going to exist on
your system.  The examples in that file are sending files to
/scratch/tmp/ which is probably not very useful.  Please copy
this directory to a directory of your choice outside of the
source tree and edit the browserEnvironment.txt file.
The pathname to that file will be used with these scripts so
they understand how to function in your environment.

What is the difference between "full" and "minimal" in the scripts ?
A "minimal" set of files for a given UCSC genome database are
sufficient to get a genome browser up and running with no
extra tracks.  This genome browser will have no annotations.
The "full" set of files for a given genome browser are all files
required to operate that genome browser with all annotation tracks
available.  The sizes of the larger databases are becoming more
that 1 Tb.

The update race condition ==========================================

This problem can be avoided if you rsync the MySQL binary data
files directly into your MySQL database.  There isn't a script
provided here to do this since it is just a periodic rsync.
That can be a simple several line shell script.  For example, to
fetch a minimal set of tables for the hg19 browser directly
into the MySQL database:

for TBL in cytoBand cytoBandIdeo chromInfo gold gap grp trackDb hgFindSpec
do
  for EXT in frm MYD MYI
  do
rsync -a -P rsync://hgdownload.soe.ucsc.edu/mysql/hg19/${TBL}.${EXT} \
	/var/mysql/hg19/
  done
done

A database such as hg18 with individual tables for each chromosome would
have an extra rsync to help pick up those tables:

for TBL in cytoBand cytoBandIdeo chromInfo gold gap grp trackDb hgFindSpec
do
  for EXT in frm MYD MYI
  do
rsync -a -P rsync://hgdownload.soe.ucsc.edu/mysql/hg19/${TBL}.${EXT} \
	/var/mysql/hg19/
rsync -a -P rsync://hgdownload.soe.ucsc.edu/mysql/hg19/chr*_${TBL}.${EXT} \
	/var/mysql/hg19/
  done
done

Please note, as of November 2011, there is a second rsync server
you can use in place of the hgdownload mentioned above.  Depending
upon your internet location, the second server may provide
better transfer speed.  To use the second rsync server, use
the name hgdownload-sd in place of the hgdownload in the above commands.

If you instead need to fetch the goldenPath database text
dumps and load them, please note the following discussion.

There is a tricky coordination problem that must be considered.
The updates of the MySQL text dumps in goldenPath/*/database/ are
uncoordinated with when your database tables were last loaded.
An attempt to coordinate this situation is included with the
tagGoldenPath.sh script and the loadUpdateGoldenPath.sh script.
If you know that all the databases are loaded completely from your existing
set of goldenPath/*/database/ file, you can use the tagGoldenPath.sh
script to tag those database/ directories with a lastTimeStamp of
the newest file existing there.  When an update rsync takes place,
any new files will be newer than that lastTimeStamp.  The
loadUpdateGoldenPath.sh script will load or reload only the
new files since the lastTimeStamp.  If that load is successful,
then the tagGoldenPath.sh can be run and a new lastTimeStamp will
be established.  As long as the loading and reloading are successful,
there will be no difficulty with this coordination.  In the worst case,
if database tables and download dumps are unknown to be in sync
or not, a full reload of the database should be undertaken.  A full
reload of a database can consume a number of hours.

Brief description of these scripts.  Run them with no arguments
to view their full help messages:

1. activeDbList.sh - using MySQL commands to the public UCSC MySQL
	server, will fetch a list of active databases.  Lists of
	databases can be used with some of the fetch and load scripts.
2. minimal.db.list.txt - a list of active databases as of mid-March 2010
3. kentSrcUpdate.sh - fetches and/or updates a local copy of the 'kent'
	source tree and builds that source tree with resulting binaries
	into directories of your specification in your copy of the
	browserEnvironment.txt file.  Can be run as a cron job about every two
	weeks to keep your source tree, CGI binaries, and kent utilities
	up to date.
4. fetchHgCentral.sh - fetches a copy of the hgcentral.sql definition of
	the hgcentral database
5. updateHtml.sh - rsync fetch the static HTML hierarchy of
	the UCSC Genome Browser into your specified documentRoot
6. fetchMinimalGbdb.sh - rsync fetch of a minimal set of files in the /gbdb/
	hierarchy to operate a genome browser with the smallest set of files
7. fetchMinimalGoldenPath.sh - rsync fetch of a minimal set of files in the
	goldenPath/ hierarchy to be used to load a genome datase
8. loadDb.sh - can perform an initial load of a database from the
	goldenPath/ database MySQL text dumps
9. fetchFullGbdb.sh - rsync fetch of all files for a given database into
	the local /gbdb/ hierarchy
10. fetchFullGoldenPath.sh - rsync fetch of all files for a given database
	into the local goldenPath/ database MySQL text dump directory
11. tagGoldenPath.sh - after everything is loaded successfully in
	goldenPath/*/database this script will touch a lastUpdate file
	to be used by loadUpdateGoldenPath.sh
12. loadUpdateGoldenPath.sh - after goldenPath/*/database download are updated,
	this script will load any new tables since the last timeStamp
13. dbTrashCleaner.sh - run as a cron job periodically to clean
	database tables in the customTrash database
14. trashCleaning.sh - run as a cron job periodically to clean files
	in your genome browser .../trash/ directory.
15. allDatabaseUpdate.sh - this attempts to put all db updates together into
	one script.  It runs the four different fetch scripts and the
	database update script.  It needs to have lists of databases
	to update minimally or fully.  The logs of update activities need
	to be checked to verify it is working OK.
16. printEnv.pl - script to verify your cgi-bin directory is functioning
	correctly.  Note comments in that script.

====================================================================
This file last updated: 2010/10/27 10:51:59
====================================================================
