### Dense Subsets of Pseudorandom Sets

The Green-Tao theorem states that the primes contain arbitrarily long arithmetic progressions; its proof can be, somewhat inaccurately, broken up into the following two steps:

Thm1: Every constant-density subset of a pseudorandom set of integers contains arbitrarily long arithmetic progressions.

Thm2: The primes have constant density inside a pseudorandom set.

Of those, the main contribution of the paper is the first theorem, a "relative" version of Szemeredi's theorem. In turn, its proof can be (even more inaccurately) broken up as

Thm 1.1: For every constant density subset D of a pseudorandom set there is a "model" set M that has constant density among the integers and is indistinguishable from D.

Thm 1.2 (Szemeredi) Every constant density subset of the integers contains arbitrarily long arithmetic progressions, and many of them.

Thm 1.3 A set with many long arithmetic progressions cannot be indistinguishable from a set with none.

Following this scheme is, of course, easier said than done. One wants to work with a definition of pseudorandomness that is weak enough that (2) is provable, but strong enough that the notion of indistinguishability implied by (1.1) is in turn strong enough that (1.3) holds. From now on I will focus on (1.1), which is a key step in the proof, though not the hardest.

Recently, Tao and Ziegler proved that the primes contain arbitrarily long "polynomial progressions" (progressions where the increments are given by polynomials rather than linear functions, as in the case of arithmetic progressions). Their paper contains a very clean formulation of (1.1), which I will now (accurately, this time) describe. (It is Theorem 7.1 in the paper. The language I use below is very different but equivalent.)

We fix a finite universe $\Sigma$; this could be $\{ 0,1\}^n$ in complexity-theoretic applications or $Z/NZ$ in number-theoretic applications. Instead of working with subsets of $\Sigma$, it will be more convenient to refer to probability distributions over $\Sigma$; if $S$ is a set, then $U_S$ is the uniform distribution over $S$. We also fix a family $F$ of "easy" function $f: \Sigma \rightarrow [0,1]$. In a complexity-theoretic applications, this could be the set of boolean functions computed by circuits of bounded size. We think of two distributions $X,Y$ as being $\epsilon$-indistinguishable according to $F$ if for every function $f\in F$ we have

$| E [f(X)] - E[f(Y)] | \leq \epsilon$

and we think of a distribution as pseudorandom if it is indistinguishable from the uniform distribution $U_\Sigma$. (This is all standard in cryptography and complexity theory.)

Now let's define the natural analog of "dense subset" for distributions. We say that a distribution $A$ is $\delta$-dense in $B$ if for every $x\in \Sigma$ we have

$Pr [ B=x] \geq \delta Pr [A=x]$

Note that if $B=U_T$ and $A=U_S$ for some sets $S,T$, then $A$ is $\delta$-dense in $B$ if and only if $S\subseteq T$ and $|S| \geq \delta |T|$.

So we want to prove the following:

[The reader may want to go back to (1.1) and check that this is a meaningful formalization of it, up to working with arbitrary distributions rather than sets. This is in fact the "inaccuracy" that I referred to above.]

In a complexity-theoretic setting, we would like to say that if $F$ is defined as all functions computable by circuits of size at most $s$, then $\epsilon'$ should be $poly (\epsilon,\delta)$ and $F'$ should contain only functions computable by circuits of size $s\cdot poly(1/\epsilon,1/\delta)$. Unfortunately, if one follows the proof and makes some simplifications asuming $F$ contains only boolean functions, one sees that $F'$ contains functions of the form $g(x) = h(f_1(x),\ldots,f_k(x))$, where $f_i \in F$, $k = poly(1/\epsilon,1/\delta)$, and $h$ could be arbitrary and, in general, have circuit complexity exponential in $1/\epsilon$ and $1/\delta$. Alternatively one may approximate $h()$ as a low-degree polynomial and take the "most distinguishing monomial." This will give a version of the Theorem (which leads to the actual statement of Thm 7.1 in the Tao-Ziegler paper) where $F'$ contains only functions of the form $\Pi_{i=1}^k f_i(x)$, but then $\epsilon'$ will be exponentially small in $1/\epsilon$ and $1/\delta$. This means that one cannot apply the theorem to "cryptographically strong" notions of pseudorandomness and indistinguishability, and in general to any setting where $1/\epsilon$ and $1/\delta$ are super-logarithmic (not to mention super-linear).

This seems like an unavoidable consequence of the "finitary ergodic theoretic" technique of iterative partitioning and energy increment used in the proof, which always yields at least a singly exponential complexity.

Omer Reingold, Madhur Tulsiani, Salil Vadhan and I have recently come up with a different proof where both $\epsilon'$ and the complexity of $F'$ are polynomial. This gives, for example, a new characterization of the notion of pseudoentropy. Our proof is quite in the spirit of Nisan's proof of Impagliazzo's hard-core set theorem, and it is relatively simple. We can also deduce a version of the theorem where, as in Green-Tao-Ziegler, $F'$ contains only bounded products of functions in $F$. In doing so, however, we too incur an exponential loss, but the proof is somewhat simpler and demonstrates the applicability of complexity-theoretic techniques in arithmetic combinatorics.

Since we can use (ideas from) a proof of the hard core set theorem to prove the Green-Tao-Ziegler result, one may wonder whether one can use the "finitary ergodic theory" techniques of iterative partitioning and energy increment to prove the hard-core set theorem. Indeed, we do this too. In our proof, the reduction loses a factor that is exponential in certain parameters (while other proofs are polynomial), but one also gets a more "constructive" result.

If readers can stomach it, a forthcoming post will describe the complexity-theory-style proof of the Green-Tao-Ziegler result as well as the ergodic-theory-style proof of the Impagliazzo hard core set theorem.

Thm1: Every constant-density subset of a pseudorandom set of integers contains arbitrarily long arithmetic progressions.

Thm2: The primes have constant density inside a pseudorandom set.

Of those, the main contribution of the paper is the first theorem, a "relative" version of Szemeredi's theorem. In turn, its proof can be (even more inaccurately) broken up as

Thm 1.1: For every constant density subset D of a pseudorandom set there is a "model" set M that has constant density among the integers and is indistinguishable from D.

Thm 1.2 (Szemeredi) Every constant density subset of the integers contains arbitrarily long arithmetic progressions, and many of them.

Thm 1.3 A set with many long arithmetic progressions cannot be indistinguishable from a set with none.

Following this scheme is, of course, easier said than done. One wants to work with a definition of pseudorandomness that is weak enough that (2) is provable, but strong enough that the notion of indistinguishability implied by (1.1) is in turn strong enough that (1.3) holds. From now on I will focus on (1.1), which is a key step in the proof, though not the hardest.

Recently, Tao and Ziegler proved that the primes contain arbitrarily long "polynomial progressions" (progressions where the increments are given by polynomials rather than linear functions, as in the case of arithmetic progressions). Their paper contains a very clean formulation of (1.1), which I will now (accurately, this time) describe. (It is Theorem 7.1 in the paper. The language I use below is very different but equivalent.)

We fix a finite universe $\Sigma$; this could be $\{ 0,1\}^n$ in complexity-theoretic applications or $Z/NZ$ in number-theoretic applications. Instead of working with subsets of $\Sigma$, it will be more convenient to refer to probability distributions over $\Sigma$; if $S$ is a set, then $U_S$ is the uniform distribution over $S$. We also fix a family $F$ of "easy" function $f: \Sigma \rightarrow [0,1]$. In a complexity-theoretic applications, this could be the set of boolean functions computed by circuits of bounded size. We think of two distributions $X,Y$ as being $\epsilon$-indistinguishable according to $F$ if for every function $f\in F$ we have

$| E [f(X)] - E[f(Y)] | \leq \epsilon$

and we think of a distribution as pseudorandom if it is indistinguishable from the uniform distribution $U_\Sigma$. (This is all standard in cryptography and complexity theory.)

Now let's define the natural analog of "dense subset" for distributions. We say that a distribution $A$ is $\delta$-dense in $B$ if for every $x\in \Sigma$ we have

$Pr [ B=x] \geq \delta Pr [A=x]$

Note that if $B=U_T$ and $A=U_S$ for some sets $S,T$, then $A$ is $\delta$-dense in $B$ if and only if $S\subseteq T$ and $|S| \geq \delta |T|$.

So we want to prove the following:

Theorem (Green, Tao, Ziegler)

Fix a family $F$ of tests and an $\epsilon>0$; then there is a "slightly larger" family $F'$ and an $\epsilon'>0$ such that if $R$ is an $\epsilon'$-pseudorandom distribution according to $F'$ and $D$ is $\delta$-dense in $R$, then there is a distribution $M$ that is $\delta$-dense in $U_\Sigma$ and that is $\epsilon$-indistinguishable from $D$ according to $F$.

[The reader may want to go back to (1.1) and check that this is a meaningful formalization of it, up to working with arbitrary distributions rather than sets. This is in fact the "inaccuracy" that I referred to above.]

In a complexity-theoretic setting, we would like to say that if $F$ is defined as all functions computable by circuits of size at most $s$, then $\epsilon'$ should be $poly (\epsilon,\delta)$ and $F'$ should contain only functions computable by circuits of size $s\cdot poly(1/\epsilon,1/\delta)$. Unfortunately, if one follows the proof and makes some simplifications asuming $F$ contains only boolean functions, one sees that $F'$ contains functions of the form $g(x) = h(f_1(x),\ldots,f_k(x))$, where $f_i \in F$, $k = poly(1/\epsilon,1/\delta)$, and $h$ could be arbitrary and, in general, have circuit complexity exponential in $1/\epsilon$ and $1/\delta$. Alternatively one may approximate $h()$ as a low-degree polynomial and take the "most distinguishing monomial." This will give a version of the Theorem (which leads to the actual statement of Thm 7.1 in the Tao-Ziegler paper) where $F'$ contains only functions of the form $\Pi_{i=1}^k f_i(x)$, but then $\epsilon'$ will be exponentially small in $1/\epsilon$ and $1/\delta$. This means that one cannot apply the theorem to "cryptographically strong" notions of pseudorandomness and indistinguishability, and in general to any setting where $1/\epsilon$ and $1/\delta$ are super-logarithmic (not to mention super-linear).

This seems like an unavoidable consequence of the "finitary ergodic theoretic" technique of iterative partitioning and energy increment used in the proof, which always yields at least a singly exponential complexity.

Omer Reingold, Madhur Tulsiani, Salil Vadhan and I have recently come up with a different proof where both $\epsilon'$ and the complexity of $F'$ are polynomial. This gives, for example, a new characterization of the notion of pseudoentropy. Our proof is quite in the spirit of Nisan's proof of Impagliazzo's hard-core set theorem, and it is relatively simple. We can also deduce a version of the theorem where, as in Green-Tao-Ziegler, $F'$ contains only bounded products of functions in $F$. In doing so, however, we too incur an exponential loss, but the proof is somewhat simpler and demonstrates the applicability of complexity-theoretic techniques in arithmetic combinatorics.

Since we can use (ideas from) a proof of the hard core set theorem to prove the Green-Tao-Ziegler result, one may wonder whether one can use the "finitary ergodic theory" techniques of iterative partitioning and energy increment to prove the hard-core set theorem. Indeed, we do this too. In our proof, the reduction loses a factor that is exponential in certain parameters (while other proofs are polynomial), but one also gets a more "constructive" result.

If readers can stomach it, a forthcoming post will describe the complexity-theory-style proof of the Green-Tao-Ziegler result as well as the ergodic-theory-style proof of the Impagliazzo hard core set theorem.