LLM-Guided Bayesian Optimization for Faster Convergence

I’ve been thinking about using LLMs as optimizers. Some initial work has been done in 2023. I’m curious about how to make this really work. Today, we see people using LLMs to write for LLMs themselves, creating a self-optimization loop.

However, I’m more interested in using LLMs as a heuristic optimizer to substitute for human insights and accelerate Bayesian optimization. See also: https://arxiv.org/pdf/2309.03409

The following content is generated by LLMs and may contain inaccuracies.

Context

This sits at the intersection of automated machine learning (AutoML), Bayesian optimization (BO), and large language model (LLM) reasoning. Traditional BO excels at sample-efficient optimization but struggles with cold-start problems and lacks domain priors that human experts bring. The 2023 work you reference (Papagiannopoulou et al.) proposes using LLMs to encode structured domain knowledge—replacing or augmenting the acquisition function that guides where to sample next. This matters now because LLMs have proven effective at extracting patterns from text-based technical knowledge that would otherwise require costly human-in-the-loop guidance.

Key Insights

LLMs as surrogate priors: The core innovation is using LLMs to propose promising regions of the search space by leveraging scientific literature, API documentation, or past optimization logs. Papagiannopoulou’s LLAMBO demonstrates that GPT-4 can recommend hyperparameters competitive with Gaussian process surrogates, especially when search spaces are semantically structured (e.g., learning rates, architectural choices). However, LLMs hallucinate numerical relationships—they excel at categorical/ordinal decisions but need guardrails when suggesting continuous values.
Self-referential optimization loops: Your observation about “LLMs writing for LLMs” connects to recent prompt optimization frameworks like DSPy and Textgrad, which differentiate through LLM calls to optimize prompts. Extending this to BO means the LLM doesn’t just suggest parameters—it iteratively refines its own suggestion strategy based on observed outcomes. The risk: LLMs lack calibrated uncertainty estimates, so integrating them into BO’s exploration-exploitation tradeoff remains fragile without explicit uncertainty quantification (e.g., ensembles or conformal prediction wrappers).
Where heuristics beat surrogates: LLMs shine in high-dimensional discrete spaces where Gaussian processes fail (e.g., code generation hyperparameters, graph neural network architectures). But standard BO already converges fast in low-dimensional continuous problems—LLM overhead may not justify gains there. The sweet spot is mixed discrete-continuous spaces with interpretable parameters.

Open Questions

Can LLMs learn to query themselves strategically—deciding when to inject domain knowledge versus deferring to BO’s probabilistic model—without degrading sample efficiency?
How do we prevent LLMs from amplifying biases in optimization literature (e.g., over-indexing on popular heuristics like Adam over niche but effective alternatives)?

我一直在考虑使用大语言模型作为优化器。一些初步工作已在2023年完成。我很好奇如何真正让这种方法发挥作用。如今，人们使用大语言模型为大语言模型本身编写代码，形成了一个自我优化的循环。

不过，我更感兴趣的是使用大语言模型作为启发式优化器来替代人类直觉，并加速贝叶斯优化。详见：https://arxiv.org/pdf/2309.03409

以下内容由 LLM 生成，可能包含不准确之处。

背景

这处于自动化机器学习（AutoML）、贝叶斯优化（BO）和大语言模型（LLM）推理的交叉点上。传统的BO在样本高效优化方面表现出色，但在冷启动问题上举步维艰，并且缺乏人类专家所具有的领域先验知识。你提到的2023年工作（Papagiannopoulou等人）提议使用LLM编码结构化的领域知识——替代或增强指导下一步采样位置的采集函数。这现在之所以重要，是因为LLM已被证明能有效地从基于文本的技术知识中提取模式，而这些模式否则需要昂贵的人在环指导。

核心洞察

LLM作为先验代理：核心创新在于使用LLM通过利用科学文献、API文档或历史优化日志来提议搜索空间中有前景的区域。Papagiannopoulou的LLAMBO表明GPT-4可以推荐与高斯过程代理相竞争的超参数，尤其是在搜索空间具有语义结构的情况下（例如学习率、架构选择）。然而，LLM会在数值关系上出现幻觉——它们在分类/序数决策上表现出色，但在建议连续值时需要安全机制。
自指优化循环：你关于"LLM为LLM而写"的观察与最近的提示优化框架相连，如DSPy和Textgrad，它们通过LLM调用进行微分以优化提示。将其扩展到BO意味着LLM不仅仅建议参数——它根据观察到的结果迭代地精化自己的建议策略。风险在于：LLM缺乏校准的不确定性估计，因此在没有明确的不确定性量化（例如集成或保形预测包装器）的情况下，将它们整合到BO的探索-开发权衡中仍然脆弱。
启发式算法战胜代理的领域：LLM在高维离散空间中表现出色，这是高斯过程失效的地方（例如代码生成超参数、图神经网络架构）。但标准BO在低维连续问题中已经收敛迅速——LLM开销可能无法证明那里的收益是合理的。最优的甜蜜点是具有可解释参数的混合离散-连续空间。

未解决的问题

LLM能否学会战略性地查询自己——决定何时注入领域知识与何时遵从BO的概率模型——而不降低样本效率？
我们如何防止LLM在优化文献中放大偏见（例如过度重视Adam等流行启发式算法，而忽视小众但有效的替代方案）？