Write $n$ iid draws of $(x,y)$ as $(x^n, y^n)$. Fix $R\in (0,H(x))$. What is the min of $n^{1}H(y^nf(x^n))$ over maps $f$ with range $\lbrace 1,\dots,\exp nR\}$, taking $n\to \infty$?

1$\begingroup$ What is $H$? Any function of the data or something more specific? Also what do we know about the distribution of $(x, y)$? $\endgroup$– VincentMay 10 '20 at 20:03

2$\begingroup$ $H$ is (conditional) Shannon entropy. $(x,y)$ is distributed over $\mathcal{X}\times \mathcal{Y}$, both sets finite. $\endgroup$– Christian ChapmanMay 10 '20 at 20:04

$\begingroup$ My current thinking is to find $\min_s n^{1}H(y^nx^n \in s)$ over sets $s$ of $\exp(n(H(x)R))$ sequences typical for $x^n$, taking $n\to\infty$. Then if one can design $f$ with fibers similar to the optimum $s$, the main question is solved. I would be surprised if this new min wasn't equal to the information bottleneck optimum, which would make design of $f$ straightforward. $\endgroup$– Christian ChapmanJun 17 '20 at 21:35

$\begingroup$ Is $f$ allowed to be a function of $n$ variables, or only a function of one variable applied elementwise? $\endgroup$– Matt F.Jun 20 '20 at 5:47

$\begingroup$ $f$ is over $n$ vars jointly. The other case is already a singleletter problem whose answer can (in principle) be numerically estimated. $\endgroup$– Christian ChapmanJun 20 '20 at 17:51
The characterization is given in terms of a socalled auxiliary random variable. It is as explicit of an answer as you'll get, unless you consider very special cases (like jointly Gaussian $X,Y$, or binaryvalued $X,Y$). Namely, you have
$$ \lim_{n\to\infty}\min_{f: x^n \mapsto f(x^n)\in \{1,\dots,2^{nR}\}} \frac{1}{n}H(Y^nf(X^n)) = \inf_{U : YXU, I(X;U)\leq R} H(YU). $$
In the expression on the right, we look over all random variables $U$, jointly distributed with $X,Y$, such that $Y$ and $U$ are conditionally independent given $X$, and $I(X;U)\leq R$, where $I$ denotes the usual mutual information.
This is a special case of the indirect source coding problem (or, more generally CEO problem) under logarithmic loss. The information bottleneck problem corresponds to the unconstrained optimization problem associated to that on the right (i.e., the IB functional is equivalent to the Lagrangian of the thing on the right).
Here is a quick sketch of the proof.
Claim (Lower Bound): For any $f: x^n \mapsto f(x^n) \in \{1,\dots, 2^{nR}\}$, there is a random variable $U$ satisfying $YXU$, $I(X;U)\leq R$ and $\frac{1}{n}H(Y^nf(X^n)) \geq H(YU)$.
Proof: By the chain rule, write $$ \frac{1}{n}H(Y^nf(X^n)) = \frac{1}{n}\sum_{i=1}^n H(Y_iY^{i1},f(X^n))= \frac{1}{n}\sum_{i=1}^n H(Y_iV_i), $$ where we define $V_i := (Y^{i1},f(X^n))$. Observe that, since $(X^n,Y^n)$ are iid draws of $(X,Y)$, we have $Y_i  X_i  V_i$ for each $i=1,\dots, n$. Now, on a common probability space with $(X,Y)$ construct a random variable $U$ as follows. Given $X=x$, draw $Q$ uniformly from $\{1,\dots, n\}$, and take $U = (V_Q,Q)$, where $V_Q\{Q=i, X=x\} \sim p_{V_iX_i}(\cdot  x)$. Evidently, we have $Y  X  U$, and by definition of $U$, $$ \frac{1}{n}H(Y^nf(X^n)) = \frac{1}{n}\sum_{i=1}^n H(Y_iV_i) = H(YU), $$ since the $Y_i$'s are iid. Now, for each $i=1,\dots, n$, we have $X_i  (X^{i1},f(X^n))  (Y^{i1},f(X^n))$ (again, using the fact that $(X^n,Y^n)$ are iid draws of $X,Y$ together with the fact that $f$ is a function of the $x_i$'s), so properties of entropy and the data processing inequality give $$ R \geq \frac{1}{n}I(X^n ; f(X^n)) = \frac{1}{n}\sum_{i=1}^n I(X_i ; X^{i1},f(X^n)) \geq \frac{1}{n}\sum_{i=1}^n I(X_i ; V_i) = I(X;U). $$
Claim (Upper Bound): If $YXU$ satisfy $I(X;U)\leq R$, then for $n$ sufficiently large, there is a function $f: x^n \to f(x^n) \in \{1,\dots, 2^{nR}\}$ with $\frac{1}{n}(H(Y^n f(X^n))+o(n)) = H(YU)$
Proof: Fix any random variable $U$ on a common probability space with $(X,Y)$ such that $YXU$ and $I(X;U) < R$. Generate $2^{nR}$ sequences $U^n(1), \dots, U^n(2^{nR})$ independently, according to $p_U^{\otimes n}$. By standard typicality arguments, if we observe $X^n$, then with high probability there is an index $f(X^n)\in \{1,\dots, 2^{nR}\}$ such that $U^n(f(X^n))$ is jointly typical with $X^n$ (and therefore also $Y^n$ by the Markov Lemma). By standard properties of (conditional) typical sets, we have $$ \frac{1}{n}H(Y^n f(X^n)) \sim H(YU). $$

$\begingroup$ Tom, thanks for the proof. I think this is actually subtly different than source coding under logarithmic loss. Choosing the best encoder here is minimizing a different objective: in this case (1) the "loss function" simultaneously depends on all $n$ source letters and the encoder word in some complex way. in the source coding problem (2) the loss function is an average of $n$ identical singleletter losses. Your proof shows that the minima of the two problems are the same. $\endgroup$ Aug 30 at 17:53

$\begingroup$ Put another way: after conditioning on the encoding, the source letters are no longer necessarily independent. This question's objective function keeps track of this nonindependence, the source coding problem does not. $\endgroup$ Aug 30 at 17:56