BOFormer

BOFormer: Learning to Solve Multi-Objective Bayesian Optimization via Non-Markovian RL

¹Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan
²NVIDIA Research

^*Indicates Equal Contribution

Abstract

Bayesian optimization (BO) offers an efficient pipeline for optimizing black-box functions with the help of a Gaussian process prior and an acquisition function (AF). Recently, in the context of single-objective BO, learning-based AFs witnessed promising empirical results given its favorable non-myopic nature. Despite this, the direct extension of these approaches to multi-objective Bayesian optimization (MOBO) suffer from the hypervolume identifiability issue, which results from the non-Markovian nature of MOBO problems. To tackle this, inspired by the non-Markovian RL literature and the success of Transformers in language modeling, we present a generalized deep Q-learning framework and propose BOFormer, which substantiates this framework for MOBO via sequence modeling. Through extensive evaluation, we demonstrate that BOFormer constantly achieves better performance than the benchmark rule-based and learning-based algorithms in various synthetic MOBO and real-world multi-objective hyperparameter optimization problems.

Hypervolume Identifiability Issue

In single-objective Bayesian optimization, an RL-based AF (e.g., FSAF (Hsieh et al., 2021)) takes the posterior mean and standard deviation and the best function value observed so far as input and then outputs the AF value. An direct extension to MOBO simply takes into account the same set of information about all the $K$ objective functions. The hypervolume identifiability issue can be illustrated by comparing the hypervolume improvement incurred by the sample $x_3$ in the two different scenarios below. Clearly, despite that the AF inputs at $x_3$ are the same in both scenarios, the increases in hypervolume upon sampling $x_3$ are rather different. Hence, the increase in hypervolume is not identifiable solely based on the AF input of the existing RL-based AFs.

Learning AF via Non-Markovian Q-Network

To tackle hypervolume identifiability issue, we propose to rethink MOBO from the perspective of non-Markovian RL via sequence modeling. We propose BOFormer, which leverages the sequence modeling capability of the Transformer architecture and thereby minimizes the generalized temporal difference loss. BOFormer comprises two distinct networks as shown above: The upper network functions as the policy network, utilizing the historical data and the Q-value predicted by the target network to estimate the Q-values for action selection. The lower network serves as the target network, responsible for constructing Q-values for past observation-action pairs.

$Q$-Augmented Representation

Define \begin{align*} y^{(i)*}_{t}:=\max\limits_{1\leq j\leq t}y^{(i)}_{i}, \forall i \in [1,\cdots,K] \end{align*} as the best observed function value of $i$-th objective up to time $t$. The observation-action pair for $x$ can be denoted by \begin{equation*} o_t(x) \equiv \Big\{\mu_t^{(i)}(x), \sigma_t^{(i)}(x),y_t^{(i)*} ,\frac{t}{T}\Big\}_{i\in [K]}, \end{equation*} where $\mu_t^{(i)}(x):=\mathbb{E}[f_i(x)\rvert \mathcal{F}_t]$ and $\sigma_t^{(i)}(x):=\sqrt{\mathbb{V}[f_i(x)\rvert \mathcal{F}_t]}$ are the posterior predictive distribution under Gaussian Process (GP) prior.

BOFormer as an Acquisition Function for MOBO

In BOFormer, we use the normalized hypervolume improvement as the reward, $\textit{i.e.}$, \begin{equation*} r_t:=\frac{\text{HV}(\mathcal{X}_t)-\text{HV}(\mathcal{X}_{t-1})}{\text{HV}(\mathcal{X^*})-\text{HV}(\mathcal{X}_{t})}. \end{equation*} Then, $h_t$, the history up to time $t$, is the concatenation of past observation-action pair representation defined as follows: \begin{align} h_t = \left\{\mu_j^{(i)}(x_j), \sigma_j^{(i)}(x_j), y_{j-1}^{(i)*}, j/t, r_i, Q_{\bar{\theta}}\right\}_{i\in[k], j\in[t-1]} \label{eq: state action}. \end{align} In non-Markovian RL, $Q_{\bar{\theta}}$ can be defined recursively, where \begin{align*} Q_{\bar{\theta}}(h_t,o_t(x_t)):= Q_{\bar{\theta}}\left( \left\{ o_j(x_j), r_j, Q_{\bar{\theta}}(h_{j-1},o_{j-1}(x_{j-1})) \right\}_{j=1}^{t-1}, o_t(x_t)\right). \end{align*} Then, BOFormer selects the next sample point by letting $x_t = \text{argmax}_{x\in\mathcal{X}}Q_{\bar{\theta}}(h_t,o_t(x))$.

Off-Policy Learning and Prioritized Trajectory Replay Buffer

We extend the concept of Prioritized Experience Replay (PER) and introduce the Prioritized Trajectory Replay Buffer (PTRB). The detailed modifications are as follows: (i) Elements pushed into this buffer are entire trajectories $\tau = \{o_i(x_i),r_i\}_{i=1}^T$. (ii) The TD-error considered in PER is replaced by $\delta(Q_{\theta_t},\tau)$, which is the summation of the TD-error of the policy network for all transitions in this trajectory, $\textit{i.e.}$, \begin{align} \delta(Q,\tau) & := \sum_{i=1}^{T-1} \big( Q\left( h_i, o_i(x_i)\right) - \big(r_i + \gamma \max_{x\in\mathbb{X}}Q_{\bar{\theta}}(h_{i+1}, o_{i+1}(x))\big) \big)^2. \end{align} Let $\mathcal{B}$ denote the batch sampled from PTRB. The loss function of BOFormer is defined as $L(\theta):=\sum_{\tau \in \mathcal{B}} \delta(Q_{\theta},\tau)$. For the training data, BOFormer is trained solely on synthetic GP functions, which are relatively cheap to generate (compared to real-world functions), and then be deployed to optimize unseen testing functions.

Results - Averaged Attained Hypervolume

Hypervolume at Final Step - Synthetic Functions

Hypervolume at Final Step - HPO-3DGS

Synthetic Functions with Two Objectives

HPO-3DGS with Two Objectives

Synthetic Functions and HPO-3DGS with Three Objectives

Lego - Visualization of Optimization Process

The numbers above each GIF represent the model size in MB (megabytes).

BOFormer

19.5

220.3

82.2

227.3

220.3

qNEHVI

53.4

26.8

70.1

13.5

335.8

OptFormer

101.8

179.3

51.9

49.9

48.4

Materials - Visualization of Optimization Process

The numbers above each GIF represent the model size in MB (megabytes).

BOFormer

10.1

94.1

70.9

96.9

94.1

qNEHVI

47.0

9.0

52.5

33.4

59.6

OptFormer

82.8

79.7

58.3

160.8

70.1

Conclusion

In this paper, we address MOBO problems from the perspective of RL-based AF by identifying and tackling the inherent hypervolume identifiability issue. We achieve this goal by first presenting a generalized DQN framework and implementing it through BOFormer, which leverages the sequence modeling capability of Transformers and incorporates multiple enhancements for MOBO. Our experimental results show that BOFormer is indeed a promising approach for general-purpose multi-objective black-box optimization.

BibTeX

@inproceedings{ hung2025boformer, title={{BOF}ormer: Learning to Solve Multi-Objective Bayesian Optimization via Non-Markovian {RL}}, author={Yu Heng Hung and Kai-Jie Lin and Yu-Heng Lin and Chien-Yi Wang and Cheng Sun and Ping-Chun Hsieh}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=UnCKU8pZVe} }