Thursday, May 17, 2007

Lies, damn lies, and a Dutch trial

In CS70, the Berkeley freshman/sophomore class on discrete mathematics and probability for computer scientists, we conclude the section on probability with a class on how to lie with statistics. The idea is not to teach the students how to lie, but rather how not to be lied to. The lecture focuses on the correlation versus causation fallacy and on Simpson's paradox.

My favorite way of explaining the correlation versus causation fallacy is to note that there is a high correlation between being sick and having visited a health care professional in the recent past. Hence we should prevent people from seeing doctors in order to make people healthier. Some HMOs in the US are already following this approach.

Today, a post in a New York Times science blog tells the story of a gross misuse of statistics in a Dutch trial that has now become a high-profile case. In the Dutch case two other, and common, fallacies have come up. One is, roughly speaking, neglecting to take a union bound. This is the fallacy of saying 'I just saw the license plate California 3TDA614, what are the chances of that!' The other is the computation of probabilities by making unwarranted independence assumptions.

Feynman has written eloquently about both, but I don't have the references at hand. In particular, when he wrote on his Space Shuttle investigation committee work, he remarked that official documents had given exceedingly low probabilities of a major accident (of the order of one millionth per flight or less), even though past events have shown this probability to be more of the order of 1%. The low number was obtained by summing the probabilities of various scenarios, and the probability of each scenario was obtained by multiplying estimates for the probabilities that the various things that had to go wrong for that scenario to occur would indeed go wrong.

Christos Papadimitriou has the most delightful story on this fallacy. He mentioned in a lecture the Faloutsos-Faloutsos-Faloutsos paper on power law distributions in the Internet graph. One student remarked, wow, what are the chances of all the authors of a paper being called Faloutsos!

Tuesday, May 15, 2007

Proving unsatisfiability of random kSAT

In the previous random kSAT post we saw that for every $k$ there is a constant $c_k$ such that

  1. A random kSAT formula with $n$ variables and $m$ clauses is conjectured to be almost surely satisfiable when $m/n < c_k - \epsilon$ and almost surely unsatisfiable when $m/n > c_k + \epsilon$;
  2. There is an algorithm that is conjectured to find satisfying assignments with high probability when given a random kSAT formula with $n$ variables and fewer than $(c_k - \epsilon) n$ clauses.

So, conjecturally, the probability of satisfiability of a random kSAT formula has a sudden jump at a certain threshold value of the ratio of clauses to variables, and in the regime where the formula is likely to be satisfiable, the kSAT problem is easy-on-average.

What about the regime where the formula is likely to be unsatisfiable? Is the problem still easy on average? And what would that exactly mean? The natural question about average-case complexity is: is there an efficient algorithm that, in the unsatisfiable regime, finds with high probability a certifiably correct answer? In other words, is there an algorithm that efficiently delivers a proof of unsatisfiability given a random formula with $m$ clauses and $n$ variables, $m> (c_k + \epsilon) n$?

Some non-trivial algorithms, that I am going to describe shortly, find such unsatisfiability proofs but only in regimes of fairly high density. It is also known that certain broad classes of algorithms fail for all constant densities. It is plausible that finding unsatisfiability proofs for random kSAT formulas with any constant density is an intractable problem. If so, its intractability has a number of interesting consequences, as shown by Feige.

A first observation is that if we have an unsatisfiable 2SAT formula then we can easily prove its unsatisfiability, and so we may try to come with some kind of reduction from 3SAT to 2SAT. In general, this is of course hopeless. But consider a random 3SAT formula $\phi$ with $n$ variables and $10 n^2$ clauses. Now, set $x_1 \leftarrow 0$ in $\phi$, and consider the resulting formula $\phi'$. The variable $x_1$ occurred in about $30 n$ clauses, positively in about $15 n$ of them (which have now become 2SAT clauses in $\phi'$) and negatively in about $15 n$ clauses, that have now disappeared in $\phi'$. Let's look at the 2SAT clauses of $\phi'$: there are about $15 n$ such clauses, they are random, so they are extremely likely to be unsatisfiable, and, if so, we can easily prove that they are. If the 2SAT subset of $\phi'$ is unsatisfiable, then so is $\phi'$, and so we have a proof of unsatisfiability for $\phi'$.

Now set $x_1 \leftarrow 1$ in $\phi$, thus constructing a new formula $\phi''$. As before, the 2SAT part of $\phi''$ is likely to be unsatisfiable, and, if so, its unsatisfiability is easily provable in polynomial time.

Overall, we have that we can prove that $\phi$ is unsatisfiable when setting $x_1 \leftarrow 0$, and also unsatisfiable when setting $x_1\leftarrow 1$, and so $\phi$ is unsatisfiable.

This works when $m$ is about $n^2$ for 3SAT, and when $m$ is about $n^{k-1}$ for kSAT. By fixing $O(\log n)$ at a time it is possible to shave another polylog factor. These idea is due to Beame, Karp, Pitassi, and Saks.

A limitation of this approach is that it produces polynomial-size resolution proofs of unsatisfiability and, in fact tree-like resolution proofs. It is known that polynomial-size resolution proofs do not exist for random 3SAT formulas with fewer than $n^{1.5-\epsilon}$ clauses, and tree-like resolution proofs do not exist even when the number of clauses is just less than $n^{2-\epsilon}$. This is a limitation that afflicts all backtracking algorithms, and so all approaches of the form ``let's fix some variables, then apply the 2SAT algorithm.'' So something really different is needed to make further progress.

Besides the 2SAT algorithm, what other algorithms do we have to prove that no solution exists for a given problem? There are algorithms for linear and semidefinite programming, and there is Gaussian elimination. We'll see how they can be applied to random kSAT in the next theory post.