NGSUtils - bamutils

bamutils / count

Counts the number of reads in genes or regions

This takes a gene/region model and a BAM file and calculates how many reads
show support of each gene/region.

Possible annotation models: gtf, exon, bed, repeat, repeatfam, or bin

[gtf]
    Calculate the number of reads that map within the coding regions of each
    gene. If [-norm] is given, an FPKM calculation is also performed,
    yielding the normalized FPKM value for each gene.

    For paired-end reads, each read will only count once for each gene.

    Requires: GTF file
    Calculates: # reads, FPKM, coverage


[exon]
    Calculate the number of reads that map to each expressed region/exon
    for all genes. Also, for each gene, the reads mapping to consecutively
    constant regions are also found.  With these two numbers an alternative
    index is calculated (# reads in a region / # consec. const. reads).

    Regions can be exons, or parts of exons, depending on splicing (determined
    by isoform annotation).

    For paired-end reads, multiple fragments will be counted *if* they show
    evidence of multiple exons. If both pairs map to the same exon, it will be
    counted only once for that exon.

    Requires: GTF file
    Calculates: # reads,
                FPKM,
                # const region reads for gene,
                # region reads,
                # region excluding reads (spliced out),
                Inclusion percentage
                Exclusion percentage
                alt-index

    Inclusion percentage = # including reads / # all reads
    Exclusion percentage = # excluding reads / # all reads

    Alt-index =     (# region reads) - (# region excluding reads)
                 ------------------------------------------------
                          (# non region reads for gene)

    Note: the alt-index is an experimental calculation that attempts to
    capture the amount that each region contributes to the overall read count
    for a gene. This is close to a percentage, except that we also take into
    account the number of reads that explicitly exclude a region. We divide by
    the number of non-region reads to account for changes that might be due to
    gene expression-level changes.



[bed]
    Calculates the number of reads in a region, where the region is defined
    in a BED6 formated file: chrom, start (0-based), end, name, score, strand.

    Requires: BED file
    Calculates: # reads, FPKM, coverage


[repeat]
    Calculates the number of reads that map to various repeat regions in the
    genome.  Repeat regions are defined by annotations from repeatmasker.org.
    Output is the number of reads that map to each repeat element.

    Requires: RepeatMasker file
    Calculates: # reads, FPKM

[repeatfam]
    Calculates the number of reads that map to various repeat regions in the
    genome.  Repeat regions are defined by annotations from repeatmasker.org.
    Output is the number of reads that map to each family/member of repeats.

    Requires: RepeatMasker file
    Calculates: # reads, FPKM

[bin]
    Calculates the number of reads in bins of N bases. Reads
    that span a bin-bin boundry will be counted for each bin. Valid
    normalization options: total, quantile, none. If quantile normalization
    is performed, only bins that include a read will be used.

    Requires: bin-size
    Calculates: # reads

Note: Output start positions are zero-based coordinates.


Usage: bamutils count {opts} bamfile

Model options (you must select one):
    -gtf filename      Count reads for a genes based on a GTF model
    -exon filename     Count reads for each exon/expressed region (GTF model)
                       (alternative-splicing detection)
    -bed filename      Count reads in BED regions
    -repeat filename   Count reads in RepeatMasker.org defined repeat elements
    -repeatfam fname   Count reads in RepeatMasker.org defined repeat families
    -bin size          Count reads present in bins of {size} bases

Other options:
    -library <value>   the orientation of mapping for single or paired end reads
                       with respect to the primary strand of the gene/region.

                       Possible values:
                       FR         - fragments mapped forward/reverse (default)
                       RF         - fragments mapped reverse/forward
                       unstranded - fragments mapped in either FR or RF

    -coverage          calculate average coverage for genes/regions
    -uniq              only count unique starting positions
                       (avoids possible PCR artifacts, not recommended)
    -startonly         Only take into account the start pos of the read to assign counts
    -fpkm              calculate FPKM values based on millions of mapped reads
                       and the length of the region in kb (number of mapped reads
                       determined by -norm value)
    -norm <value>      how to normalize counts
                       (adds counts-per-million (CPM) column)
    -multiple <value>  how to handle reads that map to multiple locations
    -whitelist file    file containing a white-list of read names
                       (only these read-names will be used in the calcs)
    -blacklist file    file containing a black-list of read names
                       (these read-names will not be used in the calcs)

Possible values for [-norm]:
    (If -norm is not given, can't be calculated)

    all         Use the number of all reads that mapped (anywhere)
    mapped      Use the number of reads that map in the model (genes/regions)
    median      Use the median value
                (genes/regions without reads excluded)

Possible values for [-multiple]:
    complete    Adds to the counts of all genes/regions (default)
                (Note: this can result in more 'counts' than reads)
    ignore      Don't add to the count of any genes/regions
    partial     Adds a fractional count to all genes/regions
                (1/number of matches, ex: IH:i:3 add 0.333 to each gene)

    Note: The IH tag is used to determine if a read has mapped to multiple
          locations. If the IH tag isn't present, then the NH tag is used. If
          both tags are missing, then each read is assume to have mapped to only
          one location on the reference.