John D Kececioglu

John D Kececioglu

Professor, Computer Science
Associate Professor, Applied Mathematics - GIDP
Associate Professor, Genetics - GIDP
Associate Professor, BIO5 Institute
Primary Department
Department Affiliations
Contact
(520) 621-4526

Work Summary

John Kececioglu's research is in applied algorithms, with an emphasis on bioinformatics and computational biology, including: multiple sequence alignment, inverse parametric alignment, sequence assembly, and genome rearrangement. Software developed by his group includes Opal, a tool for multiple sequence alignment, Facet, a tool for alignment accuracy estimation, InverseOpt, a library for inverse parametric optimization, Ipa, a tool for inverse sequence alignment, and AlignAlign, a tool for optimally aligning alignments.

Research Interest

John Kececioglu is an Associate Professor in the Department of Computer Science, and the BIO5 Institute. His research is in applied algorithms, especially for areas of bioinformatics and computational biology such as: multiple alignment, inverse alignment, sequence assembly, and genome rearrangement. Software developed by his group includes Opal, a tool for multiple sequence alignment; Facet, a tool for alignment accuracy estimation; Ipa, a tool for inverse sequence alignment; AlignAlign, a tool for optimally aligning alignments, and Ninja, a tool for evolutionary tree construction. John is a recipient of a National Science Foundation CAREER Award, serves as an Associate Editor for IEEE/ACM Transactions on Computational Biology and Bioinformatics, and is on the Editorial Board of Algorithms for Molecular Biology. He was Conference Chair for RECOMB 2009, and Program Committee Co-Chair in the area of Sequence Analysis for ISMB 2011 and BCB 2012. John served as Associate Head of the Department of Computer Science during 2012.

Publications

Ravi, R., & Kececioglu, J. D. (1998). Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discrete Applied Mathematics, 88(1-3), 355-366.

Abstract:

We consider the problem of multiple sequence alignment under a fixed evolutionary tree: given a tree whose leaves are labeled by sequences, find ancestral sequences to label its internal nodes so as to minimize the total length of the tree, where the length of an edge is the edit distance between the sequences labeling its endpoints. We present a new polynomial-time approximation algorithm for this problem, and analyze its performance on regular d-ary trees with d a constant. On such a tree, the algorithm finds a solution within a factor (d + 1)/(d - 1) of the minimum in Q(kdT(d, n)+k2dn2) time, where k is the number of leaves in the tree, n is the length of the longest sequence labeling a leaf, and T(d, n) is the time to compute a Steiner point for d sequences of length at most n. (A Steiner point for a set script capital L sign of sequences is a sequence P that minimizes the sum of the edit distances from P to each sequence in script capital L sign. The time T(d, n) is O(d2dnd), given O(dsd+l)-time preprocessing for an alphabet of size s.) The approximation algorithm is conceptually simple and easy to implement, and actually applies to any metric space in which a Steiner point for any fixed-sized set can be computed in polynomial time. We also introduce a new problem, bottleneck tree-alignment, in which the objective is to label the internal nodes of the tree so as to minimize the length of the longest edge. We describe an exponential-time exact algorithm for the case of unit-cost edit operations, and show there is a simple linear-time approximation algorithm for the general case that finds a solution within a factor O(log k) of the minimum. 1998 Published by Elsevier Science B.V. All rights reserved.

Kececioglu, J. D., Lenhof, H., Mehlhorn, K., Mutzel, P., Reinert, K., & Vingron, M. (2000). A polyhedral approach to sequence alignment problems. Discrete Applied Mathematics, 104(1-3), 143-186.

Abstract:

We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branch-and-cut algorithms. The generalized maximum trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of maximum trace. The RNA sequence alignment problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial-time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches. © 2000 Elsevier Science B.V.

Kececioglu, J., Ming, L. i., & Tromp, J. (1997). Inferring a DNA sequence from erroneous copies. Theoretical Computer Science, 185(1), 3-13.

Abstract:

We suggest a novel approach for efficiently reconstructing an original DNA sequence from erroneous copies.

Kececioglu, J. D., & Myers, E. W. (1995). Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1-2), 7-51.

Abstract:

The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%. © 1995 Springer-Verlag New York Inc.

Kim, E., Wheeler, T., & Kececioglu, J. (2009). Learning models for aligning protein sequences with predicted secondary structure. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5541 LNBI, 512-531.

Abstract:

Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondarystructure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%. © Springer-Verlag Berlin Heidelberg 2009.