In this section, we formulate the effective secure throughput optimization problem in the system described above as a POMDP [21], which can determine the optimal policy for the number of messages/data blocks in the MT selection (for security) and relay selection (for QoS) to maximize the system effective secure throughput.

Markov decision process (MDP) provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. In VANETs with cooperative communications, the vehicles make decisions at specific time instances according to the current state *s*(*t*), and the system moves into a new state based on the current state *s*(*t*) as well as the chosen decision *a*(*t*).

As described in Section 2, we use FSMC. Given the current channel state *s*(*t*), the next channel state is conditionally independent of all previous states and actions. This Markov property of state transition process makes it possible to model the optimization problem as an MDP. Furthermore, in VANETs, due to channel sensing and channel state information errors, the system state cannot directly be observed. As a result, we formulate the optimization problem as a POMDP, in which it is assumed that the system dynamics are determined by an MDP, but the underlying state can only be observed inaccurately, or with some probabilities.

A POMDP can be defined by a hex-tuple <*S*,*A*,*P*,*Θ*,*B*,*R*>, where *S* stands for a finite set of states with state *i* denoted by *s*_{
i
}, *A* stands for a finite set of actions with action *i* denoted by *a*_{
i
}, *P* stands for transition probabilities for each action in each state, and ${p}_{\mathit{\text{ij}}}^{a}$ denotes the probability that system moves from state *s*_{
i
} to state *s*_{
j
} when action *a* is performed, *Θ* stands for a finite set of observations, and *θ*_{
i
} denotes the observation of state *i*, *B* is the observation model, and ${b}_{\mathrm{j\theta}}^{a}$ denotes the probability that *Θ* was observed when the system state is *s*_{
j
} and last action taken is *a*, and *R* stands for the immediate reward. ${r}_{\mathit{\text{ij}}}^{a}$ denotes the immediate reward received for performing action *a* and the system state moves from *s*_{
i
} to state *s*_{
j
}, with an observation *Θ*.

In our POMDP model, the vehicle node has to make a decision whenever a slot has elapsed. These instant times are called *decision epochs*. The optimal optimization policy can be obtained from value iteration algorithms in this formulation. Using the POMDP-derived policy, a channel state is observed according to the information from last slot. Based on the observation, the system jointly considers the number of messages/data blocks selection and relay selection to maximize the system throughput.

In order to obtain the optimal solution, it is necessary to identify the states, actions, state transition probability, observation model, and reward functions in our POMDP model, which is described in the following sections.

### 4.1 Actions, states, and observations

In VANETs with cooperative communications, the vehicle nodes need to decide the number of messages/data blocks in the MT and which relay to use at every decision epoch. Therefore, the current composite action

*a*(

*t*)∈

*A* is denoted as,

$a\left(t\right)=\left\{{a}_{n}\right(t),{a}_{R}(t\left)\right\},$

(24)

where *a*_{
n
}(*t*) is the action to decide the number of messages/data blocks in the MT, and *a*_{
n
}(*t*)>0. *a*_{
R
}(*t*) is the relay selection action, and *a*_{
R
}(*t*)∈{1,2,…,*K*}, where *K* is the number of relays.

The current composite state

*s*(

*t*)∈

*S* is given as

$s\left(t\right)=\left\{{h}_{S{R}_{k}}\left(t\right),{h}_{{R}_{k}D}\left(t\right),{h}_{\mathit{\text{SD}}}\left(t\right)\right\},k\in \{1,2,\dots ,K\},$

(25)

where ${h}_{S{R}_{k}}$ is the channel gain between source and relay *R*_{
k
}, ${h}_{{R}_{k}D}$ is the channel gain between relay *R*_{
k
} and destination, and *h*_{
S
D
} is the channel gain between source and destination.

The composite observation

*θ*(

*t*)∈

*Θ* is defined as

$\theta \left(t\right)=\left\{\hat{{h}_{S{R}_{k}}}\left(t\right),\hat{{h}_{{R}_{k}D}}\left(t\right),\hat{{h}_{\mathit{\text{SD}}}}\left(t\right)\right\},$

(26)

where $\hat{{h}_{S{R}_{k}}}\left(t\right)$, $\hat{{h}_{{R}_{k}D}}\left(t\right)$, and $\hat{{h}_{\mathit{\text{SD}}}}\left(t\right)$ are the observation of ${h}_{S{R}_{k}}$, ${h}_{{R}_{k}D}$, and *h*_{
S
D
}, respectively, and they have the same space as the state space.

### 4.2 State transition model and observation model

Given the current state

$s\left(t\right)=\left\{{h}_{S{R}_{k}}\left(t\right),{h}_{{R}_{k}D}\left(t\right),{h}_{\mathit{\text{SD}}}\left(t\right)\right\},k\in \{1,2,\dots ,K\},$

(27)

the current observation

$\theta \left(t\right)=\left\{\hat{{h}_{S{R}_{k}}}\right(t),\hat{{h}_{{R}_{k}D}}(t),\hat{{h}_{\mathit{\text{SD}}}}(t\left)\right\}$, and the chosen action

*a*(

*t*), the probability function of the next state

$s(t+1)=\left\{{h}_{S{R}_{k}}\right(t+1),{h}_{{R}_{k}D}(t+1),{h}_{\mathit{\text{SD}}}(t+1\left)\right\},k\in \{1,2,\dots ,K\}$ is given by

$\begin{array}{l}P\left(s\right(t+1\left)\right|s\left(t\right),\left(\theta \right(t),a(t\left)\right))\\ \phantom{\rule{1em}{0ex}}=\prod _{k=1}^{K}\varphi \left({h}_{S{R}_{k}}\left(t\right),{h}_{S{R}_{k}}(t+1)\right)\psi \\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\times \left({h}_{{R}_{k}D}\left(t\right),{h}_{{R}_{k}D}(t+1)\right)\xi \left({h}_{\mathit{\text{SD}}}\right(t),{h}_{\mathit{\text{SD}}}(t+1\left)\right),\end{array}$

(28)

where $\varphi \left({h}_{S{R}_{k}}\right(t),{h}_{S{R}_{k}}(t+1\left)\right)$, $\psi \left({h}_{{R}_{k}D}\right(t),{h}_{{R}_{k}D}(t+1\left)\right)$, *ξ*(*h*_{
S
D
}(*t*),*h*_{
S
D
}(*t*+1)) are the channel state transition probabilities for difference channels as described in Section 2.2.

Given the channel estimation errors, the vehicle nodes are not able to have full knowledge of the channel information. Following the work in [

33], we assume that the channel estimation error has a Gaussian distribution with zero mean and

*δ*^{2} variance. At a particular time epoch, the observed channel gain is

$\hat{\varrho}={\varrho}_{m}+\omega ,$

(29)

where

$\hat{\varrho}$ is the actual channel gain, and

*ω* is a Gaussian random variable with zero mean and

*δ*^{2} variance. The receiver then quantizes the channel gain to the nearest possible value. The probability that

$\widehat{\varrho}$ is closest to

*ϱ*_{
n
} is given by

*B*_{
c
h
}(

*m*,

*n*)=

$\phantom{\rule{-12.0pt}{0ex}}\left\{\begin{array}{l}\frac{1}{2}\left[\mathit{\text{erf}}\left(\frac{{\varrho}_{n}+{\varrho}_{n+1}-2{\varrho}_{m}}{2\sqrt{2}\delta}\right)\right.\\ \left.\phantom{\rule{1em}{0ex}}-\mathit{\text{erf}}\left(\frac{{\varrho}_{n}+{\varrho}_{n-1}-2{\varrho}_{m}}{2\sqrt{2}\delta}\right)\right],\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}n\ne {\varrho}_{1},{\varrho}_{L-1},{\varrho}_{L},\\ \frac{1}{2}\left[1+\mathit{\text{erf}}\left(\frac{{\varrho}_{1}+{\varrho}_{2}-2{\varrho}_{m}}{2\sqrt{2}\delta}\right)\right],\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}n={\varrho}_{L-1},\\ \frac{1}{2}\left[1-\mathit{\text{erf}}\left(\frac{{\varrho}_{L-2}+{\varrho}_{L-1}-2{\varrho}_{m}}{2\sqrt{2}\delta}\right)\right],\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}n={\varrho}_{1},\\ 0,\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{0.3em}{0ex}}n={\varrho}_{L}.\end{array}\right.$

(30)

In our observation model, channel observation is independent on the composite action

*a*(

*t*), so we can get the observation matrix under action

*a*(

*t*) as

$\begin{array}{ll}{B}^{a\left(t\right)}=& {B}_{S{R}_{1}}\otimes {B}_{S{R}_{2}}\dots \otimes {B}_{S{R}_{K}}\otimes {B}_{{R}_{1}D}\otimes {B}_{{R}_{2}D}\dots \\ \otimes {B}_{{R}_{K}D}\otimes {B}_{\mathit{\text{SD}}},\end{array}$

(31)

where ${B}_{S{R}_{k}}$, ${B}_{{R}_{k}D}$, and *B*_{
S
D
} are channel observation probability matrices for S2R channel, R2D channel, and S2D channel, respectively. ⊗ denotes Kronecker product which is used here to expand the transition matrices. Note that all the channel observation probability is independent. That is why we can use ⊗ to expand it.

### 4.3 Information state

Information state is an important concept in POMDP. We refer to a probability distribution over states as the information state and the entire probability space (the set of all possible probability distributions) as the information space. Let ${\Pi}^{t+1}=\left\{{\Pi}_{0}^{t},{\Pi}_{1}^{t}\phantom{\rule{0.3em}{0ex}}\dots ,{\Pi}_{S}^{t}\right\}$ denote the information space, where ${\Pi}_{i}^{t}$ represents the probability that the current state is *i* at time *t*. As will be shown later, the knowledge of the system dynamics and the transition probabilities must be known in order to maintain an information state.

One important property of the information state is that it can be easily updated with Bayes Rule by incorporating one additional observation into the history,

${\Pi}_{j}^{t}=\frac{\sum _{i}{\Pi}_{i}^{t}{p}_{\mathit{\text{ij}}}^{a}{b}_{\mathrm{j\theta}}^{a}}{\sum _{i,j}{\Pi}_{i}^{t}{p}_{\mathit{\text{ij}}}^{a}{b}_{\mathrm{j\theta}}^{a}},$

(32)

where ${p}_{\mathit{\text{ij}}}^{a}$ is the probability when the system state changes from *i* to *j* when action *a* is adopted. ${b}_{\mathrm{j\theta}}^{a}$ stands for the observation probability that we observe the system state *j* to *Θ* when action *a* is adopted. Both ${p}_{\mathit{\text{ij}}}^{a}$ and ${b}_{\mathrm{j\theta}}^{a}$ are described in Section 4.2.

The new information state will be a vector of probabilities computed according to the above formula. The information states capture all the history information at time *t*. Therefore, we can save all the past actions and observations by constantly updating the information state. Also, it is reasonable to make decisions according to the information state.

### 4.4 Reward function and objective

Our optimization objective is to maximize the network throughput in VANETs. Therefore, a natural definition of the reward is the throughput that can be obtained at each decision epoch. Given the current state

$s\left(t\right)=\left\{{h}_{S{R}_{k}}\right(t),{h}_{{R}_{k}D}(t),{h}_{\mathit{\text{SD}}}(t\left)\right\}$, and action

*a*(

*t*)={

*a*_{
n
}(

*t*),

*a*_{
R
}(

*t*)}, the immediate reward can be defined as

$\phantom{\rule{-15.0pt}{0ex}}R\left(s\right(t),a(t\left)\right)=\mathit{\text{Th}}{r}_{\mathit{\text{SR}}}\left({h}_{S{R}_{k}}\right(t),{h}_{{R}_{k}D}(t),{h}_{\mathit{\text{SD}}}(t),{a}_{n}(t),{a}_{R}(t\left)\right),$

(33)

where *T* *h* *r*_{
S
R
} is the throughput for the authentication process with SR-ARQ, and it is derived in Section 3.3.

Although we use effective secure throughput as the optimization objective in our formulation, other QoS parameter can be used in the reward function as well. For example, when we obtain communication delay

$D{e}_{\mathit{\text{SR}}}\left({h}_{S{R}_{k}}\right(t),{h}_{{R}_{k}D}(t),{h}_{\mathit{\text{SD}}}(t),{a}_{n}(t),{a}_{R}(t\left)\right)$ between the source and destination node, the reward function can be rewritten as

$\begin{array}{l}R\left(s\right(t),a(t\left)\right)\\ \phantom{\rule{1em}{0ex}}=\beta \ast \mathit{\text{Th}}{r}_{\mathit{\text{SR}}}\left({h}_{S{R}_{k}}\right(t),{h}_{{R}_{k}D}(t),{h}_{\mathit{\text{SD}}}(t),{a}_{n}(t),{a}_{R}(t\left)\right)\\ \phantom{\rule{2em}{0ex}}+(1-\beta )\ast D{e}_{\mathit{\text{SR}}}\left({h}_{S{R}_{k}}\right(t),{h}_{{R}_{k}D}(t),{h}_{\mathit{\text{SD}}}(t),{a}_{n}(t),{a}_{R}(t\left)\right),\end{array}$

(34)

where *β* and (1−*β*) are importance weight factors to indicate the importance of throughput and communication delay. In (34), we combine throughput and delay into a single function. This is a common approach used in the optimization literature, which is called Aggregate Objective Function, to solve an optimization problem with multiple objectives [34, 35]. In reality, different VANETs have different throughput and packet delay requirements. By adjusting the parameters in (34), the proposed scheme is generic enough to accommodate different requirements in practical VANETs.

The expected total reward of the POMDP depicts the overall reward over

*Z* time epochs and can be expressed as

${V}_{\mu}={E}_{{\mu}_{n},{\mu}_{R}}\left[\sum _{t={t}_{0}}^{{t}_{0}+Z}R\left(s\right(t),a(t\left)\right)\right],$

(35)

where *μ*_{
h
} specifies the number of messages/data blocks selection policy, *μ*_{
R
} is the relay selection policy, ${E}_{{\mu}_{n},{\mu}_{R}}$ is the expectation when the policies *μ*_{
h
} and *μ*_{
R
} are employed, and *t*_{0} is the initial time.

We aim to develop a joint design of an optimal policy for throughput improvement in VANETs.

$\{{\mu}_{n}^{\ast},{\mu}_{R}^{\ast}\}$ should be a joint policy that maximizes the expected total reward in

*Z* decision epochs, which is

$\{{\mu}_{n}^{\ast},{\mu}_{R}^{\ast}\}=arg\phantom{\rule{1em}{0ex}}\underset{{\mu}_{n},{\mu}_{R}}{max}{E}_{{\mu}_{n},{\mu}_{R}}\left[\sum _{t={t}_{0}}^{{t}_{0}+Z}R\left(s\right(t),a(t\left)\right)\right].$

(36)

### 4.5 Separation principle for optimal policy

In this section, we solve the POMDP model to obtain the optimal policy for the number of messages/data blocks selection and relay selection. Specifically, we establish a separation principle that simplifies the calculation process.

In POMDP models, the underlying states cannot be observed directly, the continuous information state, i.e., the likelihood of being in each state is used instead to make decision. Our task is to compute a policy that obtains, based on the information state, the maximum expected reward for a single action. The POMDP policy can be derived from a value function which is defined over the entire information space. Let

*V*_{
t
}(

*Π*^{
t
}) be the value function that represents the maximum expected total reward that can be obtained starting from epoch

*t*, given information state

*Π*^{
t
} at the beginning of epoch

*t*. The value function of POMDP consists of the immediate reward and the maximum expected future reward, which is given as

$\phantom{\rule{-16.0pt}{0ex}}\begin{array}{l}{V}_{t}{\left({\Pi}^{t}\right)}^{\ast}=\underset{a\in A}{max}\phantom{\rule{0.3em}{0ex}}\left[\sum _{i\in S}{\Pi}_{i}^{t}\phantom{\rule{0.3em}{0ex}}\sum _{j\in S}{p}_{\mathrm{j\theta}}^{a}\phantom{\rule{0.3em}{0ex}}\sum _{\theta \in S}{b}_{\mathrm{j\theta}}^{a}\left(R(i,a)+{V}_{t+1}^{\ast}\left({\Pi}^{t+1}\right)\right)\right],\end{array}$

(37)

where *Π*_{t+1} represents the updated knowledge of system state after incorporating the action *a*(*t*) and the observation *θ*(*t*) in the epoch *t*.

Smallwood and Sondik [

36] have showed that the value function with finite horizon is

*piecewise*,

*linear*, and

*convex*, which means that the value function can be represented with a set of linear segments, and it can be written simply as

${V}_{t}{\left(\Pi \right(t\left)\right)}^{\ast}=\underset{k}{max}\sum _{i}{\Pi}_{i}{\alpha}_{i}^{k}\left(t\right),$

(38)

for some sets of vectors ${\alpha}_{i}^{k}\left(t\right)=\left\{{\alpha}_{i}^{0}\left(t\right),{\alpha}_{i}^{1}\left(t\right),\dots \right\}$. The sets of *α*-vectors represents the coefficients of one of the linear pieces of a piecewise linear function. These piecewise linear functions can represent the value functions for each step in the finite horizon POMDP problem. We only need to find the vector that has the highest dot product with the information state to determine which action to take.

One of the main problem in our POMDP model is the action space. As shown in Section 4.1, the number of messages/data blocks selection action space is {*a*_{
n
}(*t*):*a*_{
n
}(*t*)>0}. The infiniteness of the action space makes it hard to solve the model with traditional value iteration algorithms. To this point, we establish a separation principle that leads to closed-form optimal design of the number of messages/data blocks selection and relay selection strategy. The policy calculation is carried out in two steps without losing optimality.

Step 1: Calculate the optimal number of messages/data blocks policy

*μ*_{
n
} in the MT to maximize the instantaneous throughput subject to the current relay. Specifically, the optimal number of messages

*n*^{∗} in the MT for relay

*R*_{
k
} is determined as follows:

${n}^{\ast}=arg\underset{n}{max}\mathit{\text{Th}}{r}_{\mathit{\text{SR}}}({R}_{k},n).$

(39)

Step 2: Using the optimal number of messages/data blocks policy

*μ*_{
n
} given by (39), we calculate the relay selection policy to maximize the expected total throughput with piecewise linear value functions described above. Specifically, the optimal relay selection policy is given by

${\mu}_{R}^{\ast}=arg\underset{{\mu}_{R}}{max}{E}_{{\Pi}_{R}}\left[\sum _{t=1}^{T}R\left(t\right)\left|\Pi \right(1)\right].$

(40)