Wednesday, November 28, 2007

Why Mathematics?

Terry Tao points to a beautiful article written by Michael Harris for the Princeton Companion to Mathematics, titled Why Mathematics, You Might Ask.

The titular question is the point of departure for a fascinating discussion on the foundations of mathematics, on the philosophy of mathematics, on post-modernism, on the "anthropology" approach to social science studies of mathematics, and on what mathematicians think they are doing, and why.

In general, I find articles on philosophical issues in mathematics to be more readable and enlightening when written by mathematicians. Perhaps it's just that they lack the sophistication of the working philosopher, a sophistication which I mistake for unreadability. But I also find that mathematicians tend to bring up issues that matter more to me.

For example, the metaphysical discussions on the "reality" of mathematical objects and the "truth" of theorems are all well and good, but the really interesting questions seem to be different ones.

The formalist view of mathematics, for example, according to which mathematics is the derivation of theorems from axioms via formal proofs, or as Hilbert apparently put it, "a game played according to certain simple rules with meaningless marks on paper," does not begin to capture what mathematics, just as "writing one sentence after another" does not capture what poetry is. (The analogy is due to Giancarlo Rota.) Indeed one of the main fallacies that follow by taking the extreme formalist position as anything more than a self-deprecating joke is to consider mathematical work as tautological. That is, to see a mathematical theorem as implicit in the axioms and so its proof as not a discovery. (Some of the comments in this thread relate to this point.) Plus, the view does not account for the difference between "recreational" mathematics and "real" mathematics, a difference that I don't find it easy to explain in a few words, probably because I don't have a coherent view of what mathematics really is.

It's not quite related, but I am reminded of a conversation I had a long time ago with Professor X about faculty candidate Y.

[Not an actual transcript, but close enough]

X: so what do you think of theory candidate Y?
Me: he is not a theory candidate.
X: but his results have no conceivable application.
Me: there is more to doing theory than proving useless theorems.
X: that's interesting! Tell me more


I enjoyed Harris's suggestion that "ideas" are the basic units of mathematical work, and his semi-serious discussion of whether ideas "exist" and on their importance.

There are indeed a number of philosophical questions about mathematics that I think are extremely interesting and do not seem to figure prominently in the social studies of mathematics.

For example, and totally randomly:


  1. When are two proofs essentially the same, and when are they genuinely different?
  2. What makes a problem interesting? What is the role of connections in this determination?
  3. What makes a theorem deep?
  4. What does it mean when mathematicians say that a certain proof explains something, or when they say that it does not?

Tuesday, November 27, 2007

Terminology

Different communities have different traditions for terminology. Mathematicians appropriate common words, like ring, field, scheme, ideal,... and the modern usage of the term bears no connection with the everyday meaning of the word. Physicists have a lot of fun with their sparticles and their strangeness and charm and so on. Theoretical computer scientists, like the military, and NASA, prefer acronyms.

We have some isolated examples of felicitous naming. Expander, for example, is great: it sounds right and it is suggestive of the technical meaning. Extractor is my favorite, combining a suggestion of the meaning with a vaguely threatening sound. I think it's too bad that seedless extractor has come to pass, because it evokes some kind of device to get grape juice. (I was on the losing side that supported deterministic extractor.)

Unfortunate namings are of course more common. Not only is the complexity class PP embarrassing to pronounce, but its name, derived from Probabilistic Polynomial time, is a poor description of it. By analogy with #P and $\oplus$P, it should be called MajP.

I heard the story of a famous (and famously argumentative) computer scientist complaining to one of the authors of the PCP theorem about the term PCP, which stands for Probabilistically Checkable Proof. "I too can define a probabilistic checker for SAT certificates," he supposedly said, "with probability half check the certificate, with probability half accept without looking at it." The point being that the terminology emphasizes a shortcoming of the construction (the probabilistic verification) instead of the revolutionary feature (the constant query complexity). Personally, I would prefer Locally Testable Proof.

Of course we will never change the name of PP or PCP, and the seedless extractors are here to stay, but there is one terminology change for which I'd like to start a campaign.

Naor and Naor constructed in 1990 a pseudorandom generator whose output is secure against linear tests. They called such a generator $\epsilon$-biased if the distinguishing probability of every linear test is at most $\epsilon$. Such generators have proved to be extremely useful in a variety of applications, most recently in the Bogdanov-Viola construction of pseudorandom generators again degree-2 polynomials.

Shall we start calling such generators $\epsilon$-unbiased? Seeing as it is the near lack of bias, rather than its presence, which is the defining feature of such generators?

(I know the reason for the Naor-Naor terminology: zero-bias generator makes perfect sense, while zero-unbiased makes no sense. But how about the fact that it is technically correct to say that the uniform distribution is $\frac {1}{10}$-biased?)

[Update: earlier posts on the same topic here and here]

Monday, November 12, 2007

Impagliazzo Hard-Core Sets via "Finitary Ergodic-Theory"

In the Impagliazzo hard-core set theorem we are a given a function $g:\{ 0, 1 \}^n \rightarrow \{ 0,1\}$ such that every algorithm in a certain class makes errors at least a $\delta$ fraction of the times when given a random input. We think of $\delta$ as small, and so of $g$ as exhibiting a weak form of average-case complexity. We want to find a large set $H\subseteq \{ 0,1 \}^n$ such that $g$ is average-case hard in a stronger sense when restricted to $H$. This stronger form of average-case complexity will be that no efficient algorithm can make noticeably fewer errors while computing $g$ on $H$ than a trivial algorithm that always outputs the same value regardless of the input. The formal statement of what we are trying to do (see also the discussion in this previous post) is:


Impagliazzo Hard-Core Set Theorem, "Constructive Version"
Let $g:\{0,1\}^n \rightarrow \{0,1\}$ be a boolean function, $s$ be a size parameter, $\epsilon,\delta>0$ be given. Then there is a size parameter $s' = poly(1/\epsilon,1/\delta) \cdot s + exp(poly(1/\epsilon,1/\delta))$ such that the following happens.

Suppose that for every function $f:\{0,1\}^n \rightarrow \{0,1\}$ computable by a circuit of size $s'$ we have

$Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta$

Then there is a set $H$ such that: (i) $H$ is recognizable by circuits of size $\leq s'$; (ii) $|H| \geq \delta 2^n$, and in fact the number of $x$ in $H$ such that $g(x)=0$ is at least $\frac 12 \delta 2^n$, and so is the number of $x$ in $H$ such that $g(x)=1$; and (iii) for every $f$ computable by a circuit of size $\leq s$,

$Pr_{x\in H} [ g(x) = f(x) ] \leq max \{ Pr_{x\in H}[ g(x) = 0] , Pr_{x\in H} [g(x)=1] \} + \epsilon$



Our approach will be to look for a "regular partition" of $\{0,1\}^n$. We shall construct a partition $P= (B_1,\ldots,B_m)$ of $\{0,1\}^n$ such that: (i) given $x$, we can efficiently compute what is the block $B_i$ that $x$ belongs to; (ii) the number $m$ of blocks does not depend on $n$; (iii) $g$ restricted to most blocks $B_i$ behaves like a random function of the same density. (By "density" of a function we mean the fraction of inputs on which the function evaluates to one.)

In particular, we will use the following form of (iii): for almost all the blocks $B_i$, no algorithm has advantage more than $\epsilon$ over a constant predictor in computing $g$ in $B_i$.

Let $M_0$ be the union of all majority-0 blocks (that is, of blocks $B_i$ such that $g$ takes the value 0 on a majority of elements of $B_i$) and let $M_1$ be the union of all majority-1 blocks.

I want to claim that no algorithm can do noticeably better on $M_0$ than the constant algorithm that always outputs 0. Indeed, we know that within (almost) all of the blocks that compose $M_0$ no algorithm can do noticeably better than the always-0 algorithm, so this must be true for a stronger reason for the union. The same is true for $M_1$, with reference to the constant algorithm that always outputs 1. Also, if the partition is efficiently computable, then(in a non-uniform setting) $M_0$ and $M_1$ are efficiently recognizable. It remains to argue that either $M_0$ or $M_1$ is large and not completely unbalanced.

Recalling that we are in a non-uniform setting (where by "algorithms" we mean "circuits") and that the partition is efficiently computable, the following is a well defined efficient algorithm for attempting to compute $g$:


Algorithm. Local Majority
On input $x$:
determine the block $B_i$ that $x$ belongs to;
output $1$ if $Pr_{z\in B_i} [g(z)=1] \geq \frac 12$;
otherwise output $0$


(The majority values of $g$ in the various blocks are just a set of $m$ bits that can be hard-wired into the circuit.)

We assumed that every efficient algorithm must make at least a $\delta$ fraction of errors. The set of $\geq \delta 2^n$ inputs where the Local Majority algorithm makes mistakes is the union, over all blocks $B_i$, of the "minority inputs" of the block $B_i$. (If $b$ is the majority value of $g$ in a block $B$, then the "minority inputs" of $B$ are the set of inputs $x$ such that $g(x) = 1-b$.)

Let $E_0$ be the set of minority inputs (those where our algorithm makes a mistake) in $M_0$ and $E_1$ be the set of minority inputs in $M_1$. Then at least one of $E_0$ and $E_1$ must have size at least $\frac {\delta}{2} 2^n$, because the size of their union is at least $\delta 2^n$. If $E_b$ has size at least $\frac {\delta}{2} 2^n$, then $M_b$ has all the properties of the set $H$ we are looking for.

It remains to construct the partition. We describe an iterative process to construct it. We begin with the trivial partition $P = (B_1)$ where $B_1 = \{ 0,1\}^n$. At a generic step of the construction, we have a partition $P = (B_1,\ldots,B_m)$, and we consider $M_0, M_1,E_0,E_1$ as above. Let $b$ be such that $E_b \geq \frac 12 \delta 2^n$. If there is no algorithm that has noticeable advantage in computing $g$ over $M_b$, we are done. Otherwise, if there is such an algorithm $f$, we refine the partition by splitting each block according to the values that $f$ takes on the elements of the block.

After $k$ steps of this process, the partition has the following form: there are $k$ functions $f_1,\ldots,f_k$ and each of the (at most) $2^k$ blocks of the partition corresponds to a bit string $b_1,\ldots,b_k$ and it contains all inputs $x$ such that $f_1(x)=b_1,\ldots,f_k(x)=b_k$. In particular, the partition is efficiently computable.

We need to argue that this process terminates with $k=poly(1/\epsilon,1/\delta)$. To this end, we define a potential function that measures the "imbalance" of $g$ inside the blocks the partition

$\Psi(B_1,\ldots,B_m) := \sum_{i=1}^m \frac {|B_i|}{2^n} \left( Pr_{x\in B_i} [g(x) = 1] \right)^2 $

and we can show that this potential function increases by at least $poly(\epsilon,\delta)$ at each step of the iteration. Since the potential function can be at most 1, the bound on the number of iterations follows.

A reader familiar with the proof of the Szemeredi Regularity Lemma will recognize the main ideas of iterative partitioning, of using a "counterexample" to the regularity property required of the final partition to do a refinement step, and of using a potential function argument to bound the number of refinement steps.

In which way can we see them as "finitary ergodic theoretic" techniques? As somebody who does not know anything about ergodic theory, I may not be in an ideal position to answer this question. But this kind of difficulty has not stopped me before, so I may attempt to answer this question in a future post.

Friday, November 09, 2007

The December Issue of the Notices of the AMS

The December issue of the Notices of the AMS is now available online, and it includes letters written by Oded Goldreich, Boaz Barak, Jonathan Katz, and Hugo Krawczyk in response to Neal Koblitz's article which appeared in the September issue.

Despite this, the readers of the Notices remain the losers in this "controversy." Koblitz's petty personal attacks and straw man arguments appeared in the same space that is usually reserved, in the Notices, for expository articles and obituaries of mathematicians. It is from those pages that I learned about the Kakeya problem and about the life of Grothendieck (who, I should clarify, is not dead, except perhaps in Erdos' use of the word).

I find it strange enough that Koblitz would submit his piece to such a venue, but I find it as mind-boggling that the editors would run his piece as if they had commissioned Grothendieck's biographical article to a disgruntled ex-lover, who would focus most of the article on fabricated claims about his personal hygiene.

I can only hope that the editors will soon run on those pages one or more expository articles on modern cryptography, not as rebuttals to Koblitz's piece (which has already been discussed more than enough), but as a service to the readers.

And while I am on the subject of Notices article, let me move on to this article on how to write papers.

All beginning graduate students find the process of doing research mystifying, and I do remember feeling that way. (Not that such feelings have changed much in the intervening years.) One begins with a sense of hopelessness, how am I going to solve a problem that people who know much more than I do and who are smarter than me have not been able to solve?; then a breakthrough comes, out of nowhere, and one wonders, how is this ever going to happen again? Finally it's time to write up the results, and mathematical proofs definitely don't write themselves, not to mention coherent and compelling introductory sections. I think it's great when more experienced scholars take time to write advice pieces that can help students navigate these difficulties. And the number of atrociously badly written papers in circulation suggests that such pieces are good not just for students, but for many other scholars as well.

But I find that advice on "how to publish," rather than "how to write well" (like advice on "how to get a job" rather than "how to do research") misses the point (I am thinking of one of the few times I thought Lance Fortnow gave bad advice). For this reason, I found the first section of the Notices article jarring, and the following line (even if it was meant as a joke) made me cringe

I have written more than 150 articles myself. (...) I have never written an article and then been unable to publish it.

I think that this calls for an Umeshism in response.

Tuesday, November 06, 2007

The Impagliazzo Hard-Core-Set Theorem

The Impagliazzo hard-core set theorem is one of the bits of magic of complexity theory. Say you have a function $g:\{ 0, 1 \}^n \rightarrow \{ 0,1\}$ such that every efficient algorithm makes errors at least $1%$ of the times when computing $g$ on a random input. (We'll think of $g$ as exhibiting a weak form of average-case complexity.) Clearly, different algorithms will fail on a different $1%$ of the inputs, and it seems that, intuitively, there should be functions for which no particular input is harder than any particular other input, per se. It's just that whenever you try to come up with an algorithm, some set of mistakes, dependent on the algorithmic technique, will arise.

As a good example, think of the process of generating $g$ at random, by deciding for every input $x$ to set $g(x)=1$ with probability $99%$ and $g(x)=0$ with probability $1%$. (Make the choices independently for different inputs.) With very high probability, every efficient algorithm fails with probability at least about $1%$, but, if we look at every efficiently recognizable large set $H$, we see that $g$ takes the value 1 on approximately $99%$ of the elements of $H$, and so the trivial algorithm that always outputs 1 has a pretty good success probability.

Consider, however, the set $H$ of size $\frac {2}{100} 2^n$ that you get by taking the $\approx \frac{1}{100} 2^n$ inputs $x$ such that $g(x)=0$ plus a random sample of $\frac{1}{100} 2^n$ inputs $x$ such that $g(x)=1$. Then we can see that no efficient algorithm can compute $g$ on much better than $50%$ of the inputs of $H$. This is the highest form of average-case complexity for a boolean function: on such a set $H$ no algorithm does much better in computing $g$ than an algorithm that makes a random guess.

The Impagliazzo hard-core theorem states that it is always possible to find such a set $H$ where the average-case hardness is "concentrated." Specifically, it states that if every efficient algorithm fails to compute $g$ on a $\geq \delta$ fraction of inputs, then there is a set $H$ of size $\geq \delta 2^n$ such that every efficient algorithm fails to compute $g$ on at least a $\frac 12 - \epsilon$ fraction of the elements of $H$. This is true for every $\epsilon,\delta$, and if "efficient" is quantified as "circuits of size $s$" in the premise, then "efficient" is quantified as "circuits of size $poly(\epsilon,\delta) \cdot s$" in the conclusion.

The example of the biased random function given above implies that, if one wants to prove the theorem for arbitrary $g$, then the set $H$ cannot be efficiently computable itself. (The example does not forbid, however, that $H$ be efficiently computable given oracle access to $g$, or that a random element of $H$ be samplable given a sampler for the distribution $(x,g(x))$ for uniform $x$.)

A number of proofs of the hard core theorem are known, and connections have been found with the process of boosting in learning theory and with the construction and the decoding of certain error-correcting codes. Here is a precise statement.


Impagliazzo Hard-Core Set Theorem
Let $g:\{0,1\}^n \rightarrow \{0,1\}$ be a boolean function, $s$ be a size parameter, $\epsilon,\delta>0$ be given. Then there is a $c(\epsilon,\delta) = poly(1/\epsilon,1/\delta)$ such that the following happens.

Suppose that for every function $f:\{0,1\}^n \rightarrow \{0,1\}$ computable by a circuit of size $\leq c\cdot s$ we have

$Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta$

Then there is a set $H$ of size $\geq \delta 2^n$ such that for every function $f$ computable by a circuit of size $\leq s$ we have

$Pr_{x\in H} [ f(x) = g(x) ] \leq \frac 12 + \epsilon$


Using the "finitary ergodic theoretic" approach of iterative partitioning, we (Omer Reingold, Madhur Tulsiani, Salil Vadhan and I) are able to prove the following variant.


Impagliazzo Hard-Core Set Theorem, "Constructive Version"
Let $g:\{0,1\}^n \rightarrow \{0,1\}$ be a boolean function, $s$ be a size parameter, $\epsilon,\delta>0$ be given. Then there is a $c(\epsilon,\delta) = exp(poly(1/\epsilon,1/\delta))$ such that the following happens.

Suppose that for every function $f:\{0,1\}^n \rightarrow \{0,1\}$ computable by a circuit of size $\leq c\cdot s$ we have

$Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta$

Then there is a set $H$ such that: (i) $H$ is recognizable by circuits of size $\leq c\cdot s$; (ii) $|H| \geq \delta 2^n$, and in fact the number of $x$ in $H$ such that $g(x)=0$ is at least $\frac 12 \delta 2^n$, and so is the number of $x$ in $H$ such that $g(x)=1$; and (iii) for every $f$ computable by a circuit of size $\leq s$,

$Pr_{x\in H} [ g(x) = f(x) ] \leq max \{ Pr_{x\in H}[ g(x) = 0] , Pr_{x\in H} [g(x)=1] \} + \epsilon$


The difference is that $H$ is now an efficiently recognizable set (which is good), but we are not able to derive the same strong average-case complexity of $g$ in $H$ (which, as discussed as the beginning, is impossible in general). Instead of proving that a "random guess algorithm" is near-optimal on $H$, we prove that a "fixed answer algorithm" is near-optimal on $H$. That is, instead of saying that no algorithm can do better than a random guess, we say that no algorithm can do better than either always outputting 0 or always outputting 1. Note that this conclusion is meaningless if $g$ is, say, always equal to 1 on $H$, but in our construction we have that $g$ is not exceedingly biased on $H$, and if $\epsilon < \delta/2$, say, then the conclusion is quite non-trivial.

One can also find a set $H'$ with the same type of average-case complexity as in the original Impagliazzo result by putting into $H'$ a $\frac 12 \delta 2^n$ size sample of elements $x$ of $H$ such that $g(x)=0$ and an equal size sample of elements of $H$ such that $g$ equals 1. (Alternatively, put in $H'$ all the elements of $H$ on which $g$ achieves the minority value of $g$ in $H$, then add a random sample of as many elements achieving the majority value.) Then we recover the original statement except that $c(\epsilon,\delta)$ is exponential instead of polynomial.

Coming up next, the proof of the "constructive hard core set theorem" and my attempt at explaining what the techniques have to do with "finitary ergodic theory."

Saturday, November 03, 2007

Harder, Better, Faster, Stronger

An amazing video to Daft Punk's Harder, Better, Faster, Stronger



Don't be discouraged by the slow first minute; it does get better, faster, and harder.

Doing the same with a different Daft Punk song, however, can be less impressive.

Thursday, November 01, 2007

The "Complexity Theory" Proof of a Theorem of Green-Tao-Ziegler

We want to prove that a dense subset of a pseudorandom set is indistinguishable from a truly dense set.

Here is an example of what this implies: take a pseudorandom generator of output length $n$, choose in an arbitrary way a 1% fraction of the possible seeds of the generator, and run the generator on a random seed from this restricted set; then the output of the generator is indistinguishable from being a random element of a set of size $\frac 1 {100} \cdot 2^n$.

(Technically, the theorem states the existence of a distribution of min-entropy $n - \log_2 100$, but one can also get the above statement by standard "rounding" techniques.)

As a slightly more general example, if you have a generator $G$ mapping a length-$t$ seed into an output of length $n$, and $Z$ is a distribution of seeds of min-entropy at least $t-d$, then $G(Z)$ is indistinguishable from a distribution of min-entropy $n-d$. (This, however, works only if $d = O(\log n)$.)

It's time to give a formal statement. Recall that we say that a distribution $D$ is $\delta$-dense in a distribution $R$ if

$\forall x. Pr[R=x] \geq \delta \cdot Pr [D=x]$

(Of course I should say "random variable" instead of "distribution," or write things differently, but we are between friends here.)

We want to say that if $F$ is a class of tests, $R$ is pseudorandom according to a moderately larger class $F'$, and $D$ is $\delta$-dense in $R$, then there is a distribution $M$ that is indistinguishable from $D$ according to $F$ and that is $\delta$-dense in the uniform distribution.

The Green-Tao-Ziegler proof of this result becomes slightly easier in our setting of interest (where $F$ contains boolean functions) and gives the following statement:

Theorem (Green-Tao-Ziegler, Boolean Case)
Let $\Sigma$ be a finite set, $F$ be a class of functions $f:\Sigma \to \{0,1\}$, $R$ be a distribution over $\Sigma$, $D$ be a $\delta$-dense distribution in $R$, $\epsilon>0$ be given.

Suppose that for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon$

Then there is a function $h:\Sigma \rightarrow \{0,1\}$ of the form $h(x) = g(f_1(x),\ldots,f_k(x))$ where $k = poly(1/\epsilon,1/\delta)$ and $f_i \in F$ such that
$| Pr [h(R)=1] - Pr [ h(U_\Sigma) =1] | > poly(\epsilon,\delta)$

Readers should take a moment to convince themselves that the above statement is indeed saying that if $R$ is pseudorandom then $D$ has a model $M$, by equivalently saying that if no model $M$ exists then $R$ is not pseudorandom.

The problem with the above statement is that $g$ can be arbitrary and, in particular, it can have circuit complexity exponential in $k$, and hence in $1/\epsilon$.

In our proof, instead, $g$ is a linear threshold function, realizable by a $O(k)$ size circuit. Another improvement is that $k=poly(1/\epsilon,\log 1/\delta)$.

Here is the proof by Omer Reingold, Madhur Tulsiani, Salil Vadhan, and me. Assume $F$ is closed under complement (otherwise work with the closure of $F$), then the assumption of the theorem can be restated without absolute values


for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$Pr[f(D)=1] - Pr[f(M) = 1] >\epsilon$


We begin by finding a "universal distinguisher."

Claim
There is a function $\bar f:\Sigma \rightarrow [0,1]$ which is a convex combination of functions from $F$ and such that that for every $M$ that is $\delta$-dense in $U_\Sigma$,
$E[\bar f(D)] - E[\bar f(M)] >\epsilon$

This can be proved via the min-max theorem for two-players games, or, equivalently, via linearity of linear programming, or, like an analyst would say, via the Hahn-Banach theorem.

Let now $S$ be the set of $\delta |\Sigma|$ elements of $\Sigma$ where $\bar f$ is largest. We must have
(1) $E[\bar f(D)] - E[\bar f(U_S)] >\epsilon$
which implies that there must be a threshold $t$ such that
(2) $Pr[\bar f(D)\geq t] - Pr[\bar f(U_S) \geq t] >\epsilon$
So we have found a boolean distinguisher between $D$ and $U_S$. Next,
we claim that the same distinguisher works between $R$ and $U_\Sigma$.

By the density assumption, we have
$Pr[\bar f(R)\geq t] \geq \delta \cdot Pr[\bar f(D) \geq t]$

and since $S$ contains exactly a $\delta$ fraction of $\Sigma$, and since the condition $\bar f(x) \geq t$ always fails outside of $S$ (why?), we then have
$Pr[\bar f(U_\Sigma)\geq t] = \delta \cdot Pr[\bar f(U_S) \geq t]$
and so
(3) $Pr[\bar f(R)\geq t] - Pr[\bar f(U_\Sigma) \geq t] >\delta \epsilon $


Now, it's not clear what the complexity of $\bar f$ is: it could be a convex combination involving all the functions in $F$. However, by Chernoff bounds, there must be functions $f_1,\ldots,f_k$ with $k=poly(1/\epsilon,\log 1/\delta)$ such that $\bar f(x)$ is well approximated by $\sum_i f_i(x) / k$ for all $x$ but for an exceptional set having density less that, say, $\delta\epsilon/10$, according to both $R$ and $U_\Sigma$.

Now $R$ and $U_\Sigma$ are distinguished by the predicate $\sum_{i=1}^k f_i(x) \geq tk$, which is just a linear threshold function applied to a small set of functions from $F$, as promised.

Actually I have skipped an important step: outside of the exceptional set, $\sum_i f_i(x)/k$ is going to be close to $\bar f(x)$ but not identical, and this could lead to problems. For example, in (3) $\bar f(R)$ might typically be larger than $t$ only by a tiny amount, and $\sum_i f_i(x)/k$ might consistently underestimate $\bar f$ in $R$. If so, $Pr [ \sum_{i=1}^k f_i(R) \geq tk ]$ could be a completely different quantity from $Pr [\bar f(R)\geq t]$.

To remedy this problem, we note that, from (1), we can also derive the more "robust" distinguishing statement
(2') $Pr[\bar f(D)\geq t+\epsilon/2] - Pr[\bar f(U_S) \geq t] >\epsilon/2$
from which we get
(3') $Pr[\bar f(R)\geq t+\epsilon/2] - Pr[\bar f(U_\Sigma) \geq t] >\delta \epsilon/2 $

And now we can be confident that even replacing $\bar f$ with an approximation we still get a distinguisher.

The statement needed in number-theoretic applications is stronger in a couple of ways. One is that we would like $F$ to contain bounded functions $f:\Sigma \rightarrow [0,1]$ rather than boolean-valued functions. Looking back at our proof, this makes no difference. The other is that we would like $h(x)$ to be a function of the form $h(x) = \Pi_{i=1}^k f_i(x)$ rather than a general composition of functions $f_i$. This we can achieve by approximating a threshold function by a polynomial of degree $poly(1/\epsilon,1/\delta)$ using the Weierstrass theorem, and then choose the most distinguishing monomial. This gives a proof of the following statement, which is equivalent to Theorem 7.1 in the Tao-Ziegler paper.


Theorem (Green-Tao-Ziegler, General Case)
Let $\Sigma$ be a finite set, $F$ be a class of functions $f:\Sigma \to [0,1]$, $R$ be a distribution over $\Sigma$, $D$ be a $\delta$-dense distribution in $R$, $\epsilon>0$ be given.

Suppose that for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon$

Then there is a function $h:\Sigma \rightarrow \{0,1\}$ of the form $h(x) = \Pi_{i=1}^k f_i(x)$ where $k = poly(1/\epsilon,1/\delta)$ and $f_i \in F$ such that
$| Pr [f(R)=1] - Pr [ f(U_\Sigma) =1] | > exp(-poly(1/\epsilon,1/\delta))$

In this case, we too lose an exponential factor. Our proof, however, has some interest even in the number-theoretic setting because it is somewhat simpler than and genuinely different from the original one.