Kececioglu, J., & Deblasio, D. (2013). Accuracy estimation and parameter advising for protein multiple sequence alignment. Journal of Computational Biology, 20(4), 259-279.
PMID: 23489379;PMCID: PMC3619150;Abstract:
We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. © Copyright 2013, Mary Ann Liebert, Inc. 2013.
Kececioglu, J., Shete, S., & Arnold, J. (2000). Reconstructing distances in physical maps of chromosomes with nonoverlapping probes. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 183-192.
Abstract:
We present a new method for reconstructing the distances between probes in physical maps of chromosomes constructed by hybridizing pairs of clones under the so-called sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a clone-subset called the probes. The probes are chosen by a sequential process that is designed to generate a pairwise-nonoverlapping subset of the clones. We derive a likelihood function on probe spacings and orders for this protocol under a natural model of hybridization error, and describe how to reconstruct the most likely spacing for a given order under this objective using continuous optimization. The approach is tested on simulated data and real data from chromosome VI of Aspergillus nidulans. On simulated data we recover the true order and close to the true spacing; on the real data, for which the true order and spacing is unknown, we recover a probe order differing significantly from the published one. To our knowledge this is the first practical approach for computing a globally-optimal maximum-likelihood reconstruction of interprobe distances from clone-probe hybridization data.
Gupta, S. K., Kececioglu, J. D., & Schäffer, A. (1995). Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment.. Journal of computational biology : a journal of computational molecular cell biology, 2(3), 459-472.
PMID: 8521275;Abstract:
The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain how these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
Kim, E., & Kececioglu, J. (2007). Inverse sequence alignment from partial examples. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4645 LNBI, 359-370.
Abstract:
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%. © Springer-Verlag Berlin Heidelberg 2007.
DeBlasio, D. F., & Kececioglu, J. D. (2017). Learning parameter-advising sets for multiple sequence alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 14(5), 1028-1041.