Research Vignette: Network Biology Meets Cancer Genomics

by Teresa Przytycka and Roded Sharan

The Spring 2016 Simons Institute program on Algorithmic Challenges in Genomics focused on three main research areas: Computational Cancer Biology, Regulatory Genomics and Epigenomics, and Network Biology. The different tracks allowed us, the long-term visitors, to better understand the relationships among the different areas and come up with ideas on how they can be integrated to gain further insights into biology. Here, we would like to tell the story of one such integration attempt that was initiated during the Simons program, where ideas from network biology led to new formulations and novel findings in cancer genomics.

A fundamental concept in cancer genomics is that of a disease module that contains functionally related genes that jointly drive the disease. The discovery of such gene sets can reveal the molecular mechanisms that underlie the disease and suggest potential therapeutic targets. It is generally assumed that cancer progresses through stages, starting from a precancerous lesion caused, for example, by DNA damage. If such damage cannot be properly handled by the cell repair mechanisms, for example when such mechanisms are compromised by mutations, an early stage of cancer emerges. In this state, the organism still relies on cellular mechanisms such as apoptosis, cell cycle arrest, or senescence, to guard against uncontrolled cell growth and proliferation. However, once those barriers are compromised, say by further mutations, the disease progresses. While the progression seems sequential, steered by driver mutations that provide a growth advantage to the evolving tumor cells, the actual process is more complex and not fully understood. To shed light on the progression process, we need not only to be able to identify disease modules, but also to understand the relationships among them.

A common approach to detecting cancer modules is to look for sets of genes that are mutually exclusive in the mutations they harbor, across a cohort of patients. The underlying assumption is that the module can be disrupted by a mutation in any of its member genes, thereby driving the disease progression, and additional mutations do not provide an advantage to the disease. As driver mutations are selected for during cancer evolution, the module’s genes often exhibit mutual exclusivity mutation patterns across different patients. Other inter-gene relations such as physical interactions, co-expression, functional similarity, and more, can inform the discovery of cancer modules. Thanks to public projects, such as The Cancer Genome Atlas, researchers can now access genomic data of thousands of patients across tens of cancer types. Such valuable data sources enable the systematic search of disease modules using different computational and statistical approaches [1].

While several research projects have tried to infer disease modules from mutation data, most have focused on mutual exclusivity relations. Only a few projects have considered co-occurrence relations, the integration of other types of data, or the modeling of inter-module relations. These more complex scenarios, however, are very common in network biology, where multiple relations can be modeled by multiple edge types in a gene network. The discovery of interesting edge-patterns can then be cast as a clustering problem where one aims to maximize within-cluster associations as well as between-cluster relations. It thus seemed natural to try and apply such general clustering formulations in the cancer genomics domain.

Multiple relations can exist within and between genes of true cancer modules. We already mentioned mutual exclusivity within modules (Figure, Panel A), but it can also be present between modules, representing different cancer subtypes or different cancer progression pathways (Figure, Panel B). Similarly, mutations can co-occur within modules, representing some shared mutagenic process, or between modules, representing a synergistic relation, where two alternative pathways need to be disrupted for the disease to occur (Figure, Panel C). Finally, known gene interactions or functional associations are typically within modules, representing their shared role in some molecular pathway driving the disease.

To capture all these different relations, Phuong Dao, Yoo-Ah Kim, Damian Wojtowicz, Sanna Madan and I modeled them within a general clustering framework that scores a given collection of modules by their within- and between-edges. The resulting clustering problem is reminiscent of cluster editing, also known as correlation clustering, where one wishes to minimally edit the edge set of a given graph so that it is transformed into a collection of disjoint cliques. While the problem is known to be NP-hard [2], it admits a practical and exact solution via integer linear programming [3]. Motivated by the success of the latter approach, we formulated the within-between module detection problem using integer linear programming. By applying a commercial solver (Cplex), we could solve the problem to optimality on current data sets, consisting of hundreds of mutated genes and thousands of relations across hundreds of patients, in less than an hour. The integrated network view of these relations made it possible to systematically test the contribution of the different types of relations, both within and between modules, to module discovery.

In an application to real data from The Cancer Genome Atlas project, we were able to form comprehensive maps of the modules underlying different cancers and gain novel insights about the roles of different genes in driving the disease. As an example, when maximizing the mutual exclusivity of mutations within modules and the co-occurrence of mutations between modules, we uncovered pairs of modules that work in a synergistic fashion, where the disruption of both is required for the disease to occur (Figure, Panel C). Overall, our general strategy that accounts for a number of complementary relations proved to be a valuable tool in dissecting the mutational landscape of cancer to infer molecular mechanisms that drive the disease.

References

  1. Y-A Kim, D-Y Cho, and TM Przytycka (2016). “Understanding Genotype-Phenotype Effects in Cancer via Network Approaches.” PLoS Comput Biol 12: e1004747.
  2. R Shamir, R Sharan, and D Tsur (2004). “Cluster graph modification problems.” Discrete Appl Math 144: 173–182.
  3. S Böcker S, S Briesemeister, and GW Klau (2009). “Exact Algorithms for Cluster Editing: Evaluation and Experiments.” Algorithmica 60: 316–334.

Research Vignette: Phase Transitions in Random Constraint Satisfaction Problems

by Allan Sly and Nike Sun

Random constraint satisfaction problems

Suppose we are given a finite graph. Does it admit a proper three-coloring? This is one example of a constraint satisfaction problem (CSP). Generally speaking, a CSP is specified by a list of variables, subject to a set of constraints. A prototypical computational question is to decide, in a given CSP instance, whether there exists any variable assignment satisfying all the constraints. If yes, then one may further ask how many solutions are there, or what is the optimal value of some objective function over the set of solutions.

On general inputs, these questions are not believed to be computationally tractable — many CSPs are NP-complete. For the problem of deciding whether a given graph is three-colorable, the best known algorithms require more than 2 time on the “worst” graphs of size n. In view of this, it became natural to develop notions of tractability in “average” or “typical” cases (Levin 1986). One of the standard ways to approach this has been via ensembles of random CSPs. For example, say that we sample a large random graph from some natural distribution — might it be the case that “in most cases” it is easy to decide whether the graph is three-colorable?

Satisfiability transitions in random CSPs

For all the commonly studied random CSP ensembles, the (average-case) complexity remains an open question. A possible reason for this is that even before going into computational considerations, there are far more basic properties of random CSPs that have not been resolved. For example, it seems very difficult to estimate the probability that a random CSP is satisfiable. In many examples, the random CSP can be naturally parametrized by a “constraint level” α — commonly, α is the ratio of the number of constraints to the number of variables. Numerical experiments (early 1990s) indicated that some of the most canonical random CSPs have a sharp satisfiability transition — a critical threshold 0 < αsat < ∞ such that the random CSP is satisfiable with high probability for α < αsat, and unsatisfiable with high probability for α > αsat.

This transition was most studied for the random k-SAT problem, defined as follows. Start with a set of boolean variables x1,…,xn. A literal is a variable xi, or its negation ¬xi. For i ≥ 1, the random clause ai is the disjunction of k literals chosen independently and uniformly from the set of literals {x1x1,…,xnxn}; the clauses are chosen mutually independently. A random k-SAT formula at density α is defined by the conjunction of the first M clauses where M is a Poisson() random variable: Φ = a1 ∧ … ∧ aM. Its solution space is the set S = S(Φ) of variable assignments x ∈ {T,F}n which make the formula Φ satisfied, meaning Φ(x) = T. Let p = pk(α,n) be the probability that the random formula Φ is satisfiable — equivalently, p is the probability for S to be nonempty. The satisfiability threshold conjecture is that for each k ≥ 2, there is a critical αsat (depending on k but not on n) such that

This was proved for k = 2 (with αsat = 1) in a few independent works around 1992 (Chvátal–Reed, de la Vega, Goerdt), but remained open for k ≥ 3. It is known (Friedgut 1999) that for each k ≥ 3 there exists a sharp threshold sequence αsat(n) — the conjecture holds if the sequence converges.

CONTINUE READING

Research Vignette: Simplicity, Optimality and Efficiency Tradeoffs in Economics

by Constantinos Daskalakis

In the beauty pageant of algorithms, contestants are judged based on their efficiency in the use of computational resources. Equally important is the quality of the solutions they compute, so algorithm design is typically thought to target an optimality-versus-efficiency tradeoff.

A feature of algorithms that is hard to quantify and trade against efficiency and optimality is their simplicity. In turn, this feature is quite important in several practical situations. A domain where simplicity creeps up as an important consideration, as we will explain next, is mechanism design. This subfield of economics and game theory takes an engineering approach to the design of incentive structures, called mechanisms. The goal of a mechanism is to motivate a group of strategic agents — who are interested in optimizing their own utilities — to choose actions that ultimately effectuate the designer’s objectives. For example, when designing a road network, a planner may want to minimize traffic by providing the right incentives (via tolls, traffic lights, and other traffic regulations) to individual drivers (who only care to minimize their own driving times) to choose routes that benefit the overall network traffic. The practical applications of mechanism design are abundant, ranging from economics to politics, and include the design of markets, auctions and online advertising platforms, the design of voting rules, the design of road and information networks, etc.

Everything else being equal, it is always desirable to adopt simple mechanisms. Complex mechanisms may lead to frustration for participants, who, unable to understand how to optimize their actions, may behave erratically and unpredictably, ultimately destroying system performance. At the same time, mechanism design often depends on data from market analysis, the process of scouting information about the strategic preferences of the agents that will participate in the designed mechanism. For instance, when designing auctions, it is useful to have information about how much the bidders value the items that are being sold. Whatever information it is that we have, we would not want to run the risk of overfitting it, so opting for simple mechanisms is often a good idea to achieve good generalization.

Ultimately, mechanism design targets a simplicity, optimality and computational efficiency tradeoff. The first workshop of Simons Institute’s Fall 2015 program on Economics and Computation was devoted to the study of this tradeoff in the context of mechanism design, and beyond. There was a series of interesting talks, by computer scientists and economists, capturing different angles on the tradeoff. In the course of the semester, several groups of program participants worked in this space, including Cai, Devanur and Weinberg, who provided a principled approach based on duality theory for designing simple, multi-item auctions that extract a good fraction of the optimal revenue [1]. This is an important task, as revenue-optimal mechanisms are known to be extremely complex [2,3].

In this Research Vignette, I outline one of my own projects in this space, which I find represents an exciting direction for future investigation. It is joint work with Vasilis Syrgkanis, who was a speaker in the aforementioned workshop, and it pertains to the well-studied problem of combinatorial auctions. This problem involves a seller withindivisible items, which she wishes to sell to a group ofbuyers. Every buyer is characterized by a valuation functionmapping each bundleof items to the buyer’s valuefor this bundle. The seller’s goal is to find a partitionof the items together with pricesso as to maximize the total welfare resulting from allocating bundle to each buyer and charging him. This allocation would induce total buyer utility and seller revenue , so the total welfare would simply be

CONTINUE READING

Research Vignette: Strengthening SETH Lower Bounds

by Amir Abboud

A central goal of complexity theory is to understand and prove lower bounds for the time complexity of fundamental problems. One of the most important computational problems is Edit Distance, the problem of computing the minimum number of edit operations (insertions, deletions, and substitutions) required to transform one sequence into another. A classical dynamic programming algorithm that is taught in basic algorithms courses solves Edit Distance on sequences of length n in O(n2) time. Generalizations of Edit Distance are of fundamental importance in computational biology and genomics, where n is typically a few billions, and this quadratic runtime is prohibitive. A faster, e.g. linear-time, algorithm would have far reaching consequences: the paper introducing BLAST, a heuristic algorithm for a generalization of Edit Distance, that runs in linear time but is not guaranteed to return an optimal solution, has been cited more than fifty thousand times. Despite decades of attempts, no upper bound below the O(n2/log2n) bound of Masek and Paterson is known for Edit Distance.

This is a situation where lower bounds are highly desirable, but unfortunately, the current state of the art in complexity is far from providing a lower bound that is close to quadratic for any natural problem in NP, let alone Edit Distance. Therefore, researchers have turned their attention to conditional lower bounds, and a recent celebrated result by Backurs and Indyk shows that Edit Distance cannot be solved in truly subquadratic O(n2−ε) time, for some ε>0, unless the Strong Exponential Time Hypothesis (SETH) is false (this is among the few STOC papers that made it to the Boston Globe news website). SETH is a pessimistic version of PNP, which essentially states that there is no ε>0 such that SAT on CNF formulas on n variables and O(n) clauses can be solved in O((2−ε)n) time.

Other interesting recent results show that under SETH, the current algorithms for many central problems in diverse areas of computer science are optimal, up to no(1) factors. These areas include pattern matching, graph algorithms, computational geometry, data mining, and economics, and the list is growing by the day. These conditional lower bounds are a part of a more general line of work in which one bases the hardness of important problems in P on well-known conjectures about the exact complexity of other famous problems. Other conjectures concern the exact complexity of 3-SUM (given n integers, do three of them sum to zero?) and All-Pairs-Shortest-Path (compute all pairwise distances in a weighted graph). This line of work was the focus of the third workshop in the Fall 2015 Simons Institute Program on Fine-Grained Complexity and Algorithm Design, entitled Computational Complexity of Low-Polynomial Time Problems. It seems that we are rapidly approaching an exciting state of affairs in which a tight lower bound for many problems of practical significance can be derived from a small set of such conjectures.

One of the major goals of this research is to strengthen the foundations on which these results are based. We do not have strong evidence supporting these conjectures, and perhaps the main reason to believe them is the fact that no one knows how to refute them. For example, SETH was introduced by Impagliazzo and Paturi as a plausible explanation for the lack of (2−ε)n algorithms for CNF-SAT, despite its being one of the most extensively studied problems in computer science. Can we find more evidence for these conjectures? Can we show surprising consequences of refuting them? Alternatively, can we replace them with more believable ones?

In the rest of this Vignette I will discuss a recent paper with Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams, in which we show that many of the SETH lower bounds can be based on an assumption that is much more believable than SETH.

CONTINUE READING

Research Vignette: Mathematical Software Obfuscation

By Amit Sahai

From revolutionaries and spies to ordinary citizens living under repressive regimes, for centuries people have had the need to hide operational secrets. By an operational secret, we mean a secret that must be kept safe, where the secret is also used to perform ordinary tasks. For example, someone may want to keep the amount of cash she has on-hand secret, but she also needs to refer to this information when making everyday purchases. Even secrets like the location of hidden refugees under one’s protection could be an operational secret, since the holder of the secret would want to plan her ordinary movements to avoid the secret location that she is protecting. Through many clever means, people throughout history managed to protect such secrets. Essential to such protection was the refuge of the human mind, the ultimate sanctuary where secrets can be pondered and processed without fear of loss.

But what if this most basic assumption proves false? What if an adversary can read someone’s mind while she is thinking about the secret she needs to protect? Indeed, it is hard to imagine how someone can protect secrets when the inner dialogue of her mind can betray her. Fortunately, this scenario remains science fiction when applied to humanity. However, if we replace humans by computer programs, this situation is all too common. Computer programs, with inbuilt databases of sensitive information, are routinely captured by adversarial entities. These adversaries can reverse-engineer these programs and see every detail of how the programs “think” about their secrets as they perform their calculations. Furthermore, adversaries can even modify the programs to alter their behavior, in order to coax secrets out of the original computer code. Because computer programs with inbuilt operational secrets are so useful, cryptographic researchers have pondered this challenge as far back as the classic work of Diffie and Hellman in 1976. For decades, however, our ability to suitably “obfuscate” programs in order to defend against these kinds of attacks was based on unsatisfying approaches that failed to provide security against even moderately skilled adversaries.

This changed in 2013, due to the work of Garg, Gentry, Halevi, Raykova, Sahai and Waters, which gave the first sound mathematical approach to this problem. This work enabling a mathematical approach to securing software has been hailed as, “a watershed moment for cryptography,” by the Simons Foundation’s Quanta magazine, and its ramifications were a major theme of the Simons Institute program on Cryptography in Summer 2015. To illustrate why this recent advance in obfuscation has caused such a stir in the cryptographic community, let us consider an analogy with the ancient problem of sending encrypted messages.

CONTINUE READING

Research Vignette: Hard Problems All The Way Up

By Benjamin Rossman and Li-Yang Tan

Computational complexity theory is the branch of computer science that studies the nature and limits of efficient computation; an overarching goal in this area is the classification of computational problems according to their inherent difficulty. One of the most intensively studied such classifications is provided by the Polynomial Hierarchy, which was introduced by Albert Meyer and Larry Stockmeyer in 1972. This hierarchy classifies problems according to a natural notion of logical complexity, and is defined with an infinite number of levels: problems at the zeroth level are the “easiest”, and for every integer k, problems at the (k+1)-st level have logical complexity “one notch higher” than those at level k. In this vignette we describe some recent work done at the Simons Institute, in collaboration with Rocco Servedio of Columbia University, which sheds new light on the structure of this hierarchy.

The polynomial hierarchy begins at the zeroth level with the class P of problems solvable in polynomial time — the easiest problems. In theoretical computer science P captures an elegant and robust notion of computational efficiency: polynomial-time algorithms are considered efficient, and problems in P are considered tractable. Examples of fundamental problems in P include Linear Programming, Matching, and Primality. At the first level above P in the hierarchy is NP, the class of problems solvable in nondeterministic polynomial time, meaning that a “yes” answer can be verified efficiently. Well-known examples of problems in NP include the Boolean Satisfiability and Traveling Salesman problems, among countless others that arise in surprisingly diverse contexts.

Clearly P ⊆ NP, since the answer to a computational problem can be verified efficiently if it can be determined efficiently. But is the converse true? No efficient algorithms are known for the hardest problems in NP, the so-called NP-complete problems: while a “yes” answer to these problems can be verified in polynomial time, to date our best algorithms for solving them — determining whether the answer is “yes” or “no” — all run in exponential time. Indeed, the famous P ≠ NP conjecture asserts that there do not exist efficient algorithms for solving NP-complete problems: any algorithm for these problems must run in super-polynomial time. In other words, the P ≠ NP conjecture asserts that first two levels of the Polynomial Hierarchy are distinct.

To infinity and beyond
The Polynomial Hierarchy extends beyond P and NP to include an infinite number of classes of increasing logical complexity: just as NP generalizes P, the second level of the hierarchy generalizes NP, the third generalizes the second, and so on ad infinitum. To understand this hierarchy it is convenient to think of problems in NP as asking questions with a single existential quantifier. For example, the Boolean Satisfiability problem asks if there exists a satisfying assignment to a Boolean formula, and the Traveling Salesman problem asks if there exists a short tour visiting every city exactly once. Problems at the k-th level of the hierarchy allow for questions with not just one existential quantifier, but k alternating quantifiers:

Does there exist X, such that for all Y, there exists Z such that…?

CONTINUE READING

Reed, Muller, and Costa: Together at the Simons Institute

By Henry Pfister, Yury Polyanskiy, Rüdiger Urbanke and Yihong Wu

1 Information Theory

The publication of Shannon’s 1948 paper [Sha48] is widely recognized as the inception of information theory as a field of study. A key result in Shannon’s paper is that messages can be transmitted reliably over a noisy channel if and only if a sufficient amount of redundancy is added to correct errors. In particular, the error probability after decoding can be made to vanish as the message length increases if the rate R (i.e., the number of information bits per channel use) is less than the Shannon capacity C of the channel. Shannon’s proof relies on a randomly chosen codebook and does not lend itself to the practical construction of good codes. Thus, the 1948 paper also marks the beginning of the search for deterministic code constructions that achieve the Shannon capacity.

First, we consider a simple model of point-to-point communication. The binary erasure channel is a noisy channel with input alphabet {0, 1} and output alphabet {0, 1, ?}. During each time step, the transmitter sends an arbitrary element of the input alphabet, X, and the receiver observes a random channel output, Y, whose distribution is given by

\Prob (Y=y|X=x) = \begin{cases} 1-\epsilon & \textrm{if }x=y \\ \epsilon & \textrm{if }y=\;? \end{cases}

The Shannon capacity of the channel is C = 1 − ϵ bits per channel use.

In wireless networks, the point-to-point model of communication does not capture the true complexity of the problem. One may also consider how much redundancy is required for pairs of transmitters and receivers to communicate reliably when their transmissions interfere with one another. The mathematical model of this problem is known as the interference channel [Car75]. In this case, the challenge is to determine the minimal amount of redundancy that each transmitter must add to their message. Characterizing the capacity region of the interference channel, namely, the set of rate pairs that allow reliable communication, is among the most significant open problems in information theory.

In the two sections below, we will describe recent advances in the understanding of these two problems.

CONTINUE READING

Cynthia Dwork on Algorithmic Fairness

Today the New York Times has an interview with Cynthia Dwork, who is participating in the ongoing Cryptography program.

Cynthia is well known, among other things, for her work on non-malleability (about which she spoke in the historical papers seminar series) and for her work on differential privacy.

The interview is about concerns about fairness, and you can read about the concept of fairness through awareness here.

Summer School on Theoretical Neuroscience

Every summer, Berkeley’s Redwood Institute for Theoretical Neuroscience organizes a ten-day summer course on Neuroscience, bringing to Berkeley several dozen young researchers (graduate students and postdocs) from all walks of science with a serious interest in learning about Neuroscience, and especially about techniques for mining and modeling of neuroscience data. This year, the Simons Institute and the Mathematical Sciences Research Institute were, for the first time, co-organizers of this course, in order to attract more computer scientists and mathematicians to this important field. The program included one day of lectures by Vitaly Feldman and myself on the theory of computation, learning theory, and computational models by Valiant, and recently by Vempala and myself.

The Computing Community Consortium, which has been recently very active in promoting the emerging research interface between Computation and Brain Science, see cra.org/ccc/events/brain-workshop, generously agreed to fund this CS component of the summer course.

We advertised the course to CS departments, and from a field of about a dozen applications we selected four CS graduate students, who ended up attending the course: Chihua Ma (UI Chicago), Yunjie Liu (UC Davis and Lawrence Berkeley Labs), Yu Liu (UC Davis), and Antonio Moretti (Columbia). Below are their contributions about highlights of the summer course.

Continue reading

Don Knuth on SAT

Next semester, the Simons Institute will host a program on Fine-Grained Complexity and Algorithms Design, one of whose core topics is the exact complexity of satisfiability problems.

Coincidentally, Don Knuth has just posted online a draft of Section 7.2.2.2 of The Art of Computer Programming, which will be part of Volume 4B.
Chapter 7 deals with combinatorial searching; Section 7.2, titled “Generating all possibilities” is about enumeration and exhaustive search; Subsection 7.2.2 is about basic backtracking algorithms; and Sub-subsection 7.2.2.2 is about SAT.

Even people who are familiar with Don’s famously comprehensive approach to exposition might be surprised to see that this sub-subsection runs to 317 pages, 106 of which are devoted to solutions of exercises. Unfortunately, there was no space to provide the solution to Exercise 516.