Jun 21, 2012

Simple Statistical Algorithm for Biological Sequence Compression Paper Presentation

Abstract: This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Most compression algorithms fall into one of two categories, namely substitution compression and statistical compression. Those in the former class replace a long repeated subsequence by a pointer to an earlier instance of the subsequence or to an entry in a dictionary Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.

