The figure shows the bitmap representation of cassette properties, and finding matching combinations
Challenge: to find all the results requires trying all Possible combinations. In this example 3x3x2=18. For millions of cassettes this search is exponential.
- Reorganize the list of functions per gene cassette into bitmaps
- Use FastBit to compress the bitmaps
- Re-structure the query processing algorithm into bitwise logical operations
- Remove solution entries that are contained in other entries (maximal solutions only)
- Progressive pruning the possible solutions based using bitmaps as keys for comparisons
- Providing interactive exploration
- When more than 5 organisms are involved in a query, the previous system based on a commercial database system takes too long and the GUI times out
- Using the new solution, queries involving 600 organisms took less than 10 seconds
- New solution deployed in IMP system at <img.jgi.doe.gov> since May 2013
A. Romosan, A. Shoshani, K.Wu, V. Markowitz, K. Mavrommatis, "Accelerating Gene Context Analysis Using Bitmaps", 25th International Conference on Scientific and Statistical Database Management (SSDBM), 2013.
"Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Additional predictions can be made through gene context analysis which examines the conservation of chromosomal gene cassettes across genomes."
"Gene context analysis starts with first identifying "gene cassettes," which are defined as a sequence of co-located genes. Since each gene is associated with several functional characterizations (functions for short) each cassette is associated with a set of functions."
"Cassette Search" refer to searches for cassettes with common functions over millions of cassettes. We achieved great improvement over such searches by traditional methods by representing the functions associated with cassettes as bitmaps, and then used FastBit to provides efficient searches when comparing cassette functions for all possible combinations of such functions.