We want to prove that a dense subset of a pseudorandom set is indistinguishable from a truly dense set.
Here is an example of what this implies: take a pseudorandom generator of output length $n$, choose in an arbitrary way a 1% fraction of the possible seeds of the generator, and run the generator on a random seed from this restricted set; then the output of the generator is indistinguishable from being a random element of a set of size $\frac 1 {100} \cdot 2^n$.
(Technically, the theorem states the existence of a distribution of min-entropy $n - \log_2 100$, but one can also get the above statement by standard "rounding" techniques.)
As a slightly more general example, if you have a generator $G$ mapping a length-$t$ seed into an output of length $n$, and $Z$ is a distribution of seeds of min-entropy at least $t-d$, then $G(Z)$ is indistinguishable from a distribution of min-entropy $n-d$. (This, however, works only if $d = O(\log n)$.)
It's time to give a formal statement. Recall that we say that a distribution $D$ is $\delta$-dense in a distribution $R$ if
$\forall x. Pr[R=x] \geq \delta \cdot Pr [D=x]$
(Of course I should say "random variable" instead of "distribution," or write things differently, but we are between friends here.)
We want to say that if $F$ is a class of tests, $R$ is pseudorandom according to a moderately larger class $F'$, and $D$ is $\delta$-dense in $R$, then there is a distribution $M$ that is indistinguishable from $D$ according to $F$ and that is $\delta$-dense in the uniform distribution.
The Green-Tao-Ziegler proof of this result becomes slightly easier in our setting of interest (where $F$ contains boolean functions) and gives the following statement:
Theorem (Green-Tao-Ziegler, Boolean Case)
Let $\Sigma$ be a finite set, $F$ be a class of functions $f:\Sigma \to \{0,1\}$, $R$ be a distribution over $\Sigma$, $D$ be a $\delta$-dense distribution in $R$, $\epsilon>0$ be given.
Suppose that for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon$
Then there is a function $h:\Sigma \rightarrow \{0,1\}$ of the form $h(x) = g(f_1(x),\ldots,f_k(x))$ where $k = poly(1/\epsilon,1/\delta)$ and $f_i \in F$ such that
$| Pr [h(R)=1] - Pr [ h(U_\Sigma) =1] | > poly(\epsilon,\delta)$
Readers should take a moment to convince themselves that the above statement is indeed saying that if $R$ is pseudorandom then $D$ has a model $M$, by equivalently saying that if no model $M$ exists then $R$ is not pseudorandom.
The problem with the above statement is that $g$ can be arbitrary and, in particular, it can have circuit complexity exponential in $k$, and hence in $1/\epsilon$.
In our proof, instead, $g$ is a linear threshold function, realizable by a $O(k)$ size circuit. Another improvement is that $k=poly(1/\epsilon,\log 1/\delta)$.
Here is the proof by Omer Reingold, Madhur Tulsiani, Salil Vadhan, and me. Assume $F$ is closed under complement (otherwise work with the closure of $F$), then the assumption of the theorem can be restated without absolute values
for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$Pr[f(D)=1] - Pr[f(M) = 1] >\epsilon$
We begin by finding a "universal distinguisher."
Claim
There is a function $\bar f:\Sigma \rightarrow [0,1]$ which is a convex combination of functions from $F$ and such that that for every $M$ that is $\delta$-dense in $U_\Sigma$,
$E[\bar f(D)] - E[\bar f(M)] >\epsilon$
This can be proved via the min-max theorem for two-players games, or, equivalently, via linearity of linear programming, or, like an analyst would say, via the Hahn-Banach theorem.
Let now $S$ be the set of $\delta |\Sigma|$ elements of $\Sigma$ where $\bar f$ is largest. We must have
(1) $E[\bar f(D)] - E[\bar f(U_S)] >\epsilon$
which implies that there must be a threshold $t$ such that
(2) $Pr[\bar f(D)\geq t] - Pr[\bar f(U_S) \geq t] >\epsilon$
So we have found a boolean distinguisher between $D$ and $U_S$. Next,
we claim that the same distinguisher works between $R$ and $U_\Sigma$.
By the density assumption, we have
$Pr[\bar f(R)\geq t] \geq \delta \cdot Pr[\bar f(D) \geq t]$
and since $S$ contains exactly a $\delta$ fraction of $\Sigma$, and since the condition $\bar f(x) \geq t$ always fails outside of $S$ (why?), we then have
$Pr[\bar f(U_\Sigma)\geq t] = \delta \cdot Pr[\bar f(U_S) \geq t]$
and so
(3) $Pr[\bar f(R)\geq t] - Pr[\bar f(U_\Sigma) \geq t] >\delta \epsilon $
Now, it's not clear what the complexity of $\bar f$ is: it could be a convex combination involving
all the functions in $F$. However, by Chernoff bounds, there must be functions $f_1,\ldots,f_k$ with $k=poly(1/\epsilon,\log 1/\delta)$ such that $\bar f(x)$ is well approximated by $\sum_i f_i(x) / k$ for all $x$ but for an exceptional set having density less that, say, $\delta\epsilon/10$, according to both $R$ and $U_\Sigma$.
Now $R$ and $U_\Sigma$ are distinguished by the predicate $\sum_{i=1}^k f_i(x) \geq tk$, which is just a linear threshold function applied to a small set of functions from $F$, as promised.
Actually I have skipped an important step: outside of the exceptional set, $\sum_i f_i(x)/k$ is going to be
close to $\bar f(x)$ but not identical, and this could lead to problems. For example, in (3) $\bar f(R)$ might typically be larger than $t$ only by a tiny amount, and $\sum_i f_i(x)/k$ might consistently underestimate $\bar f$ in $R$. If so, $Pr [ \sum_{i=1}^k f_i(R) \geq tk ]$ could be a completely different quantity from $Pr [\bar f(R)\geq t]$.
To remedy this problem, we note that, from (1), we can also derive the more "robust" distinguishing statement
(2') $Pr[\bar f(D)\geq t+\epsilon/2] - Pr[\bar f(U_S) \geq t] >\epsilon/2$
from which we get
(3') $Pr[\bar f(R)\geq t+\epsilon/2] - Pr[\bar f(U_\Sigma) \geq t] >\delta \epsilon/2 $
And now we can be confident that even replacing $\bar f$ with an approximation we still get a distinguisher.
The statement needed in number-theoretic applications is stronger in a couple of ways. One is that we would like $F$ to contain bounded functions $f:\Sigma \rightarrow [0,1]$ rather than boolean-valued functions. Looking back at our proof, this makes no difference. The other is that we would like $h(x)$ to be a function of the form $h(x) = \Pi_{i=1}^k f_i(x)$ rather than a general composition of functions $f_i$. This we can achieve by approximating a threshold function by a polynomial of degree $poly(1/\epsilon,1/\delta)$ using the Weierstrass theorem, and then choose the most distinguishing monomial. This gives a proof of the following statement, which is equivalent to Theorem 7.1 in the
Tao-Ziegler paper.
Theorem (Green-Tao-Ziegler, General Case)
Let $\Sigma$ be a finite set, $F$ be a class of functions $f:\Sigma \to [0,1]$, $R$ be a distribution over $\Sigma$, $D$ be a $\delta$-dense distribution in $R$, $\epsilon>0$ be given.
Suppose that for every $M$ that is $\delta$-dense in $U_\Sigma$ there is an $f\in F$ such that
$| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon$
Then there is a function $h:\Sigma \rightarrow \{0,1\}$ of the form $h(x) = \Pi_{i=1}^k f_i(x)$ where $k = poly(1/\epsilon,1/\delta)$ and $f_i \in F$ such that
$| Pr [f(R)=1] - Pr [ f(U_\Sigma) =1] | > exp(-poly(1/\epsilon,1/\delta))$
In this case, we too lose an exponential factor. Our proof, however, has some interest even in the number-theoretic setting because it is somewhat simpler than and genuinely different from the original one.