
- add correlation coefficient as a criteria:
    Adam Siepel <acs@soe.ucsc.edu> 2005/03/02
    On another front, it occurred to me that the correlation coefficient 
    sometimes used in gene prediction stats could be another useful thing 
    to report.  For each inFile record, you could give a correlation 
    coefficient based on all overlapping selectFile records.  This would 
    give you one number saying something about both directions of coverage 
    and about the degree of "consistency" we were talking about the other 
    day.  For example, you could project the intronEsts into a bed of 
    nonoverlapping features using featureBits, then run overlapSelect -cc 
    (or similar), to get a cc number for each selectFile, which could then 
    go in a database like the one I'm building.  I think, when computing 
    the cc, you might want to limit yourself to the range of the inFile 
    record.  That would make sense for my application at least, where my 
    predictions are fragments.  In other cases, you might want to compute 
    the cc for the smallest interval including both the inFile record and 
    all overlapping selectFile records.

    It looks to me like the number you'd compute for a given interval would 
    be
            cc =  (cN - ab) / sqrt(ab(N-a)(N-b))

    where a is the number of "bits" (e.g., bases in exons) in the inFile 
    record, b is the total number of bits in all overlapping selectFile 
    records (within the interval), c is the number of bits in both the 
    inFile and the selectFile records, and N is the length of the interval.

    For example, if you had a 1000 base interval with 100 bases within 
    predicted exons, 150 bases of supporting EST evidence, and an overlap 
    of 90 bases, then N = 1000, a = 100, b = 150, c = 90, and cc = 0.70.

    This number is defined as long as 0 < a,b < N.  It will always be true 
    that a > 0 (otherwise you don't have an inFile record).  If a > 0 and b 
    = 0, then you'd have 0/0 but you could just report 0.  If a = N and b 
    <= N (also possible), then c = b and you'd also have 0/0.  You could 
    report b/N in this case.  The symmetric thing could be done if b = N 
    and a <= N.

    I suppose an alternative would be to report a, b, c, and N for each 
    inFile record.  Then the cc or some alternative could easily be 
    computed with an awk script.

- add featureBits type of feature specifications (e.g. :intron)

