Download the lecture notes here: Notes
Assume that we are given \(k\) linearly separable sets
\(A_1,A_2,\cdots,A_k\in \mathbb{R}^d\), we define the set of classifiable
weights as
\[
\mathbf\Theta = \{\mathbf\theta = (W,b): w_ix+b_i>w_jx+b_j,~\forall x\in A_i, j\neq i, i= 1,\cdots,k\}
\]
which means those \((W,b)\) can separate \(A_1,A_2,\cdots,A_k\) correctly.
Our linearly separable assumption implies that
\(\mathbf\Theta\neq \emptyset\). Now we know the existence of linearly
classifiable weights. But how can we find one element in \(\mathbf\Theta\)?
Definition (softmax)
Given \(s = (s_1,s_2,\cdots,s_k)^T\in \mathbb{R}^k\), we define the soft-max
mapping \(\sigma: \mathbb{R}^k \rightarrow\mathbb{R}^k\) as
\[\begin{split}
\sigma(s) = \frac{e^{s}}{e^{s}\cdot \mathbf{1}} = \frac{1}{\sum\limits_{i=1}^k e^{s_i}}
\begin{pmatrix}
e^{s_1}\\
e^{s_2}\\
\vdots\\
e^{s_k}
\end{pmatrix}
\end{split}\]
where
\[\begin{split}
e^{s} =
\begin{pmatrix}
e^{s_1}\\
e^{s_2}\\
\vdots\\
e^{s_k}
\end{pmatrix}
\end{split}\]
\[\begin{split} \mathbf{1} =
\begin{pmatrix}
1\\
1 \\
\vdots \\
1
\end{pmatrix} \in\mathbb{R}^k
\end{split}\]
Definition
Given parameter \(\mathbf\theta = (W,b)\), we define a feature mapping
\(\mathbf p: \mathbb{R}^d \rightarrow \mathbb{R}^k\) as
\[\begin{split}
\mathbf p(x; \mathbf\theta) = \sigma(Wx+b) = \frac{1}{\sum\limits_{i=1}^k e^{w_i x+b_i}}
\begin{pmatrix}
e^{w_1 x+b_1}\\
e^{w_2 x+b_2}\\
\vdots\\
e^{w_k x+b_k}
\end{pmatrix}
= \begin{pmatrix}
p_1(x; \mathbf\theta) \\
p_2(x; \mathbf\theta) \\
\vdots \\
p_k(x; \mathbf\theta)
\end{pmatrix}
\end{split}\]
where the \(i\)-th component
(10)\[
p_i(x; \mathbf\theta) = \frac{e^{w_i x+b_i}}{\sum\limits_{i=1}^k e^{w_i x+b_i}}.
\]
The soft-max mapping have several important properties.
\(\displaystyle 0< p_i(x; \mathbf\theta) <1,~\sum_i p_i(x; \mathbf\theta) = 1\).
This implies that \(\mathbf p(x; \mathbf\theta)\) can be regarded as a
probability distribution of data points which means that given
\(x\in \mathbb{R}^d\), we have \(x\in A_i\) with probability
\(p_i(x; \mathbf{\theta})\), \(i = 1,\cdots,k\).
\(p_i(x; \mathbf\theta)>p_j(x; \mathbf\theta)\Leftrightarrow w_ix+b_i>w_j x+b_j.\)
This implies that the linearly classifiable weights have an
equivalent description as
\[
\mathbf{\Theta} = \left\{\mathbf\theta: p_i(x; \mathbf\theta)>p_j(x,\mathbf\theta),~\forall x\in A_i, j\neq i, i= 1,\cdots,k\right\}
\]
We usually use the max-out method to do classification. For a given
data point \(x\), we first use a soft-max mapping to map it to
\(\mathbf p(x; \mathbf\theta)\), then we attach \(x\) to the class
\(i= \arg\max_j p_i(x; \mathbf\theta)\).
This means that we pick the label \(i\) as the class of \(x\) such that
\(x\in A_i\) has the biggest probability \(p_i(x; \mathbf\theta)\).
more detailed discussion of logistic regression from the probability
perspective will be presented in the nearly future.
From the above properties, we can define the following likelihood
function to help find elements in \(\mathbf{\Theta}\):
\[
P (\mathbf\theta)=
\prod\limits_{i = 1}^k \prod\limits_{x\in A_i} p_i(x; \mathbf\theta).
\]
Based on the property that
(11)\[
p_i (x; \mathbf \theta) = \max_{1\le j \le k} p_j(x; \mathbf \theta), \quad\forall x \in A_i,\ \mathbf \theta \in \Theta,
\]
we may use the next optimization problem
(12)\[
\max_{\mathbf \theta\in \mathbf{\Theta}} P(\mathbf \theta).
\]
to find an element in
\(\mathbf{\Theta}\). more precisely, let us introduce the next lemmas
(properties) of \(P(\mathbf \theta)\).
Lemma 1
Assume that the sets
\(A_1,A_2,\cdots,A_k\) are linearly separable. Then we have
\[
\left\{\mathbf \theta:~P(\mathbf\theta)>\frac{1}{2}\right\}\subset \mathbf{\Theta}.
\]
Proof. It suffices to show that if \(\mathbf\theta \not\in \mathbf\Theta\), we
must have \(P(\mathbf\theta)\leq\frac{1}{2}\). For any \(\mathbf\theta \not\in
\mathbf\Theta\), there must exist an \(i_0\) ,an \(x_0\in A_{i_0}\) and a
\(j_0\neq i_0\) such that
\[
w_{i_0} x_0 + b_{i_0} \leq w_{j_0}x_0 + b_{j_0}.
\]
Then we have
\[
p_{i_0}(x_0; \mathbf\theta) \leq \frac{e^{w_{i_0} x_0 + b_{i_0}}}{e^{w_{i_0} x_0+b_{i_0}}+e^{w_{j_0} x_0+b_{j_0}}} \leq\frac{1}{2}.
\]
Notice that \(p_i(x; \mathbf \theta) < 1\) for all \(i = 1,\cdots,k\), \(x\in A\).
So
\[
P(\mathbf\theta) < p_{i_0}(x_0; \mathbf\theta) \leq \frac{1}{2}.
\]
Lemma 2
If \(A_1,A_2,\cdots,A_k\) are linearly separable and
\(\mathbf\theta \in \mathbf\Theta\), we have
\[
\lim_{\alpha\rightarrow +\infty}p_i(x; \alpha\mathbf\theta) = 1\Leftrightarrow x\in A_i.
\]
Proof. We first note that if \(x\in A_i\)
\[
p_i(x,\mathbf \theta) = \frac{1}{1+\sum\limits_{j\neq i}e^{\alpha[(w_j x+ b_j)-(w_i x+b_i)]}} \to 1, \quad \text{as} \quad \alpha \to \infty.\]
On the other hand, if \(x\not\in A_i\),
\[
p_i(x; \mathbf\alpha\mathbf\theta) = \frac{1}{1+\sum\limits_{j\neq i}e^{\alpha[(w_j x+ b_j)-(w_i x+b_i)]}} \leq \frac{1}{2}.
\]
This implies that if \(x\not\in A_i\),
\(\lim_{\alpha\rightarrow \infty}p_i(x; \alpha\mathbf \theta)\neq 1\) which is
equivalent to the proposition that if
\(\lim_{\alpha\rightarrow \infty}p_i(x; \alpha\mathbf \theta)= 1\), then
\(x\in A_i\).
Lemma 3
If \(A_1,A_2,\cdots,A_k\) are linearly
separable,
\[
\mathbf\Theta = \left\{\mathbf\theta: \lim_{\alpha\rightarrow +\infty}P(\alpha\mathbf\theta) = 1\right\}.
\]
Proof. We first note that if \(\mathbf\theta \in\mathbf\Theta\), we have
\(\displaystyle\lim_{\alpha\rightarrow +\infty}p_i(x; \alpha\mathbf\theta) = 1\)
for all \(x\in A_i\). So
\[
\lim\limits_{\alpha\rightarrow +\infty} P(\alpha\mathbf\theta) = \lim\limits_{\alpha\rightarrow +\infty} \prod\limits_{i = 1}^k \prod\limits_{x\in A_i} p_i(x; \alpha\mathbf\theta) = \prod\limits_{i = 1}^k \prod\limits_{x\in A_i} \lim\limits_{\alpha\rightarrow +\infty}p_i(x; \alpha\mathbf\theta) = 1.
\]
On the other hand, if
\(\lim\limits_{\alpha\rightarrow +\infty} P(\alpha\mathbf\theta) = 1\), there
must exist one \(\alpha_0>0\) such that
\(P(\alpha_0\mathbf\theta) >\frac{1}{2}\). From Lemma 1, we have \(\alpha_0\mathbf\theta\in\mathbf\Theta\), which
means \(\mathbf\theta\in\mathbf\Theta\).
These properties above imply that if we can obtain a classifiable weight
through maximizing \(P(\mathbf\theta)\), while Lemma 3 tells us that
\(P(\mathbf\theta)\) will not have a global minimum actually.
more specifically, we just need to find some \(\mathbf \theta \in \Theta\)
such that
\[
P(\mathbf \theta) > \frac{1}{2} \Leftrightarrow L(\mathbf \theta) : = -\log P(\mathbf \theta ) < \log(2).
\]
Regularized logistic regression
Here, we start from the regularization term
\(e^{-\lambda R(\|\mathbf\theta\|)}\) with these next properties:
\(\lambda > 0\).
\(R(t)\) is a strictly increasing function on \(\mathbb{R}^+\) with
\(R(0) = 0\), \(\lim\limits_{t\rightarrow +\infty} R(t) = +\infty\). For
example, \(R(t) = t^2\).
\(\|\cdot\|\) is a norm on \(R^{k\times(d+1)}\), a commonly used norm is
the following Frobenius norm: $\(\label{key}
\|\mathbf \theta\|_F = \sqrt{\sum_{i,j}W_{ij}^2 + \sum_i b_i^2}.\)$
Based on this regularization term, we may consider the following
regularized likelihood function \(P_\lambda(\mathbf\theta)\) as
\[
P_\lambda(\mathbf\theta) = P(\mathbf\theta)e^{-\lambda R(\|\mathbf\theta\|)}.
\]
Here, let us define
\[
\mathbf\Theta_{\lambda} = \mathop{{\arg\max}}_{\mathbf\theta} P_\lambda(\mathbf\theta),
\]
where
\[
\mathop{\arg\max}_{\mathbf\theta} P_\lambda(\mathbf\theta) = \left\{\mathbf \theta ~:~ P_\lambda(\mathbf \theta) = \max_{\mathbf \theta} P_\lambda(\mathbf \theta) \right\}.
\]
The next lemma show that the maximal set of modified objective is not
empty.
Lemma 4
Suppose that \(A_1,A_2, \cdots, A_k\) are linearly separable, then
if \(\lambda = 0\), \(\mathbf\Theta_0 = \emptyset\),
\(\mathbf\Theta_{\lambda}\) must be nonempty for all \(\lambda>0\).
Proof. Lemma 3
shows the first proposition. For the second proposition, we notice that
\(P_\lambda(\mathbf 0) = \frac{1}{k^N}\).
\(\exists\ m_{\lambda}>0\) such that
\(e^{-\lambda R(\|\mathbf\theta\|)}<\frac{1}{k^N}\) whenever
\(\|\mathbf\theta\|> m_{\lambda}\) because of the properties of
\(R(\|\mathbf\theta\|)\).
So a maxima on \(\{\mathbf\theta: \|\mathbf\theta\| \le m_{\lambda}\}\) must be a
global maxima. Then we can easily obtain the result in the lemma from
the boundedness and closeness of
\(\{\mathbf\theta: \|\mathbf\theta\| \le m_{\lambda}\}\).
Furthermore, we have the next theorem which shows that we can indeed get
\(\Theta\) by maximizing \(P_\lambda(\mathbf \theta)\).
Theorem
If \(A_1,A_2,\cdots,A_k\) are linearly separable,
\[
\mathbf\Theta_{\lambda} \subset \mathbf\Theta,
\]
when \(\lambda>0\) and sufficiently small.
Proof
Proof. By Lemma 1 we can take \(\mathbf\theta_0\in \mathbf\Theta\) such that
\(P(\mathbf\theta_0)> \frac{3}{4}\). Then, for any
\(\lambda < \frac{\log \frac{3}{2}}{R(\|\mathbf\theta_0\|)}\),
\(\mathbf\theta_{\lambda}\in \mathbf\Theta_{\lambda}\), we have
\[
P(\mathbf\theta_{\lambda}) \geq P_\lambda(\mathbf\theta_{\lambda}) \geq P_\lambda(\mathbf \theta_0) = P(\mathbf\theta_0)e^{-\lambda R(\|\mathbf\theta_0\|)} > \frac{3}{4}\cdot \frac{2}{3} = \frac{1}{2},
\]
which implies that \(\mathbf \theta_{\lambda} \in \Theta\). Thus, for any
\(0< \lambda < \frac{\log \frac{3}{2}}{R(\|\mathbf\theta_0\|)}\),
\(\mathbf\Theta_{\lambda} \subset \mathbf\Theta\). ◻
The design of logistic regression is that maximize
\(P_\lambda(\mathbf\theta)\) is equivalent to minimize
\(-\log P_\lambda(\mathbf\theta)\), i.e.,
\[
\max_{\mathbf \theta} \left\{ P_\lambda(\mathbf\theta) \right\} \Leftrightarrow \min_{\mathbf \theta} \left\{ -\log P_\lambda(\mathbf\theta)\right\},
\]
while the second one is more convenient to evaluate the gradient.
meanwhile, we add a regularization term \(R(\mathbf\theta)\) to the objective
function which makes the optimization problem has a unique solution.
mathematically, we can formulate Logistic regression as
\[
\min_{\mathbf\theta} L_\lambda(\mathbf \theta),
\]
where
(13)\[
L_\lambda(\mathbf \theta) := -\log P_\lambda(\mathbf\theta) = -\log P(\mathbf\theta) + \lambda R(\|\mathbf\theta\|) = L(\mathbf\theta) + \lambda R(\|\mathbf\theta\|),
\]
with
(14)\[
L(\mathbf \theta) = - \sum_{i=1}^k \sum_{x\in A_i} \log p_{i}(x;\mathbf \theta).
\]
Then we have the next logistic regression algorithm.
Algorithm
Given data \(A_1, A_2, \cdots, A_k\), find
\[
\mathbf \theta^* = \mathop{\arg\min}_{\mathbf \theta} L_\lambda(\mathbf \theta),
\]
for some sufficient small \(\lambda > 0\).