A Survey of Prompt Engineering for Large Language Models

17 min readSep 2, 2024

Introduction

Large language models (LLMs) such as ChatGPT have demonstrated remarkable performance and success. They can handle a wide range of downstream tasks with only a few instructions or examples by leveraging natural language prompts and task demonstrations as context, while not updating any parameter in the underlying model. The giant model size is an important factor for its success, while the concepts and applications of prompts and demonstrations also give us new insights about how we can better explore and unlock the power of LLMs.

Prompt engineering is a relatively new discipline for developing, crafting, and optimizing input prompts, either manually or automatically, to efficiently leverage LLMs for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of LLMs. The specific prompts are carefully designed to guide LLMs and steer their behaviours to generate more accurate and relevant outputs that are specific to a certain topic or domain. Prompt engineering is used in various NLP tasks such as text generation, classification, summarization, sentiment analysis, named entity recognition, dialogue systems, etc.

This survey attempts to organize and summarize the current state of knowledge and work in this rapidly developing field by providing an overview of various prompt engineering methods.

Related Work

Zero-shot Prompting

Zero-shot prompting is the most basic and straightforward prompt type. It refers to simply asking a question or presenting a task to an LLM without providing prior exposure to any similar question or task. Zero-shot prompting enables an LLM to perform a task without providing explicit examples or templates of desired output. Zero-shot prompting may not perform well or generate desired output as it is not given any example or trained on a specific task.

The following is an example of zero-shot prompting for sentiment analysis:

Text: I’ll bet the video game is a lot more fun than the film.
Sentiment: ?

Few-shot Prompting

Few-shot prompting refers to presenting an LLM with a set of high-quality examples or demonstrations, each consisting of both input and desired output, on a specific task. As the model first sees good examples or explanations, it can better understand human intention and criteria for what kinds of answers are wanted. Therefore, few-shot prompting often leads to better performance than zero-shot prompting. However, it comes at the cost of more token consumption and may hit the context length limit when the given input and output text are long.

The following is an example of few-shot prompting for sentiment analysis:

Text: all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place.
Sentiment: positive
Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative
Text: for the first time in years, de niro digs deep emotionally, perhaps because he’s been stirred by the powerful work of his co-stars. Sentiment: positive Text: I’ll bet the video game is a lot more fun than the film.
Sentiment: ?

Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting, introduced in [1], is a recently developed prompting approach, which enables an LLM to decompose a multi-step problem or task into several intermediate reasoning steps that are solved individually before giving the final answer. The main idea of CoT prompting is that by showing an LLM some examples where the reasoning process is explained, the LLM will also show the reasoning process when performing the task. This explanation of reasoning process often leads to more accurate results for arithmetic, common sense, and logic reasoning tasks.

The image below demonstrates the the process of CoT prompting.

CoT prompting has several attractive properties as a prompting approach for facilitating reasoning chains and capabilities of LLMs.

First, CoT prompting allows LLMs to decompose multi-step problems into intermediate steps, which means that additional instructions or computation can be provided to the problems that require more reasoning steps.
Second, a chain of thought generates intermediate reasoning steps that suggest how LLMs can arrive at a particular output and provide opportunities to correct the wrong reasoning path.
Last, CoT reasoning can be used for tasks such as arithmetic problems, common sense reasoning, and symbolic manipulation, and is potentially applicable to any task that humans can solve via natural language.

CoT prompting can be combined with few-shot prompting to improve reasoning abilities and achieve better results for more complex tasks.

In practice, manual CoT prompting has obtained strong performance. However, the performance crucially depends on manually generating task-specific demonstrations with reasoning chains, especially for reasoning tasks that require complex and diverse reasoning patterns. This makes it far less scalable and more dependent on the talent of prompt engineers. Manually creating a large and diverse set of demonstrations is costly and tedious, while relying on a limited set of demonstrations may hamper LLMs’ generalization and adaptation abilities. To address the limitations in manual CoT prompting, several approaches are proposed to automatically construct demonstrations with questions and reasoning chains by leveraging LLMs’ innate reasoning abilities.

Automatic Prompt Engineer

Zhou et al. proposed automatic prompt engineer (APE), which is a framework for automatic instruction generation and selection [2]. The instruction generation problem is framed as natural language synthesis addressed as a black-box optimization problem using LLMs to generate and search over candidate solutions.

The first step involves an LLM (as an inference model) that is given output demonstrations to generate instruction candidates for a task. These candidate solutions will guide the search procedure. The instructions are executed using a target model, and then the most appropriate instruction is selected based on computed evaluation scores.

APE discovers a better zero-shot CoT prompting method than the human engineered “Let’s think step by step” prompt.

Synthetic Prompting

A method called Synthetic Prompting is proposed in [3]. The method leverages LLMs’ own knowledge and generative power to augment a limited set of demonstrations with self-synthesized examples, and then uses the augmented set to elicit better reasoning in LLMs.

Specifically, given a few seed examples, each consisting of a question and a chain of reasoning steps, an LLM is prompted to generate more examples by alternating between two processes: (1) the backward process, where the LLM synthesizes a question based on a self-generated reasoning chain, which ensures that the question is answerable and well-defined; (2) the forward process, where the LLM produces a reasoning chain for the synthesized question, which refines the reasoning chain to be more precise and consistent with the question. This process repeats until enough synthetic examples are generated. The most effective demonstrations are selected based on in-cluster complexity, which aims to maximize the diversity and informativeness of the demonstrations by clustering them and choosing the most complex one (the one with the longest reasoning chain) from each cluster. Finally, the LLM is prompted with the selected demonstrations to generate a reasoning chain for a test question and then use it to obtain the answer.

The method is evaluated on various reasoning tasks, including numerical reasoning, algorithmic reasoning, and symbolic reasoning. It can significantly improve the LLMs’ performance, achieving up to 15.6% absolute gains over the state-of-the-art methods.

Self-consistency Prompting

Wang et al. introduce a novel decoding strategy called self-consistency to replace the greedy decoding strategy used in CoT prompting, that further improves LLMs’ reasoning performance by a significant margin [4]. Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct answer. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths.

The extensive empirical evaluation shows that self-consistency boosts the performance of CoT prompting with a striking margin on a range of popular arithmetic and common sense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%), and ARC-challenge (+3.9%).

Active-Prompt

CoT prompting methods rely on a fixed set of human-annotated exemplars. The problem with this is that the exemplars might not be the most effective examples for different tasks. To address this, Diao et al. recently proposed a new prompting approach called Active-Prompt to adapt LLMs to different task-specific example prompts (annotated with human-designed CoT reasoning) [5]. The approach determines the most important and helpful questions to annotate from a pool of task-specific queries. Several metrics are introduced to characterize uncertainty so as to select the most uncertain questions for annotation.

Below is an illustration of the approach. The first step is to query the LLM with or without a few CoT examples. k possible answers are generated for a set of training questions. An uncertainty metric is calculated based on the k answers (disagreement used). The most uncertain questions are selected for annotation by humans. The new annotated exemplars are then used to infer each question.

Multimodal CoT Prompting

Zhang et al. recently proposed a multimodal CoT prompting approach [6]. Traditional CoT focuses on the language modality. In contrast, multimodal CoT incorporates language (text) and vision (images) into a two-stage framework to fuse vision and language representation to perform multimodal CoT prompting. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales. In this way, answer interface can leverage better generated rationales that are based on multimodal information.

The method achieves new state-of-the-art performance on the ScienceQA benchmark, outperforming accuracy of GPT-3.5 by 16% and even surpassing human performance.

An example of multimodal CoT prompting is shown in the figure below.

Tree-of-Thought Prompting

For complex tasks that require exploration or strategic lookahead, traditional or simple prompting techniques fall short. Yao et el. proposed Tree of Thoughts (ToT) prompting, a framework that generalizes over CoT prompting and encourages exploration over thoughts that serve as intermediate steps for solving general problems with LLMs [7].

ToT prompting maintains a tree of thoughts, where thoughts represent coherent language sequences that serve as intermediate steps toward solving a problem. This approach enables an LLM to self-evaluate the progress intermediate thoughts make towards solving a problem through a deliberate reasoning process. The LLM ability to generate and evaluate thoughts is then combined with search algorithms (e.g., breadth-first search and depth-first search) to enable systematic exploration of thoughts with lookahead and backtracking.

The ToT framework is illustrated below:

Soft Prompting

The purpose of prompt construction is to find a method that allows an LLM to effectively perform a task. Rather than being for human consumption, the prompt is not necessary to be human-interpretable natural language. Because of this, there are also methods that apply soft prompts (a.k.a. continuous prompts) that perform prompting directly in the embedding space of the model. Specifically, soft prompts remove two constraints: (1) relax the constraint that the embeddings of template words be the embeddings of natural language words. (2) remove the restriction that the template is parameterized by the pre-trained LLM parameters. Instead, templates have their own parameters that can be tuned based on training data from the downstream task. Several representative methods are explained below.

Prefix Tuning

Prefix tuning is a method that prepends a sequence of continuous task-specific vectors to the input, while keeping the LLM parameters frozen. Mathematically, this consists of optimizing over the log-likelihood objective given a trainable prefix matrix.

Experimentally, Li and Liang observe that such continuous prefix-based learning is more sensitive to different initialization in low-data settings than the use of discrete prompts with real words [8]. Similarly, Lester et al. prepend the input sequence with special tokens to form a template and tune the embeddings of these tokens directly [9]. Compared to Li and Liang’s method, this method adds fewer parameters as it doesn’t introduce additional tunable parameters within each network layer. Tsimpoukelli et al. train a vision encoder that encodes an image into a sequence of embeddings that can be used to prompt a frozen auto-regressive LLM to generate the appropriate caption [10]. They show that the resulting model can perform few-shot learning for vision-language tasks such as visual question answering, etc. Different from the above two methods, the prefix is sample-dependent, namely a representation of input images, instead of a task embedding.

Tuning Initialized with Discrete Prompts

There are also methods that initialize the search for a continuous prompt using a prompt that has already been created or discovered using discrete prompts. For example, Zhong et al. define a template using a discrete search method such as AUTOPROMPT [11], initialize virtual tokens based on this discovered prompt, then fine-tune the embeddings to increase task accuracy [12]. This work found that initializing with manual templates can provide a better starting point for the search process. Qin and Eisner propose to learn a mixture of soft templates for each input where the weights and parameters for each template are jointly learned using training samples [13]. The initial set of templates they use are either manually crafted ones or those obtained using the “prompt mining” method. Similarly, Hambardzumyan et al. introduce the use of a continuous template whose shape follows a manual prompt template [14].

Hard-Soft Prompt Hybrid Tuning

Instead of using a purely learnable prompt template, these methods insert some tunable embeddings into a hard prompt template. Liu et al. propose “P-tuning” method, where continuous prompts are learned by inserting trainable variables into the embedded input [15]. P-tuning also introduces the use of task-related anchor tokens within the template for further improvement. These anchor tokens are not tuned during training. Han et al. propose prompt tuning with rules (PTR), which uses manually crafted sub-templates to compose a complete template using logic rules [16]. To enhance the representation ability of the resulting template, they also insert several virtual tokens whose embeddings can be tuned together with the pre-trained LLM parameters using training samples. The template tokens in PTR contain both actual tokens and virtual tokens. Experimental results demonstrate the effectiveness of this prompt design method in relation classification tasks.

Progressive Prompting

Learning a long sequence of tasks while gaining experience and avoiding forgetting remains a key feature of human-level intelligence. Although LLMs have largely succeeded in learning on a single task, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as continual learning (CL) [17,18]. Two major challenges arise in CL: (1) avoiding catastrophic forgetting, i.e., loss of the knowledge acquired from previous tasks after learning new ones, and (2) allowing forward transfer, i.e., leveraging the knowledge from past tasks for efficient learning of new tasks.

Razdaibiedina et al. introduce Progressive Prompts — a novel CL approach for LLMs that supports forward transfer without forgetting [19]. The method is inspired by progressive networks, but is significantly more memory-efficient because it only learns a fixed number of tokens or prompt for each new task. Learning a prompt to adapt LLMs on a single downstream task was introduced in prompt tuning, and was shown to match the performance of full model finetuning while training <0.01% of the parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the underlying model frozen. Importantly, input tokens are shared across all tasks and progressively prepend new prompts while keeping the previous prompts frozen. The method can: (1) alleviate catastrophic forgetting by preserving the knowledge acquired in previous prompts, and (2) transfer knowledge to future tasks by sequentially learning new prompts given previous ones.

Experiments on standard continual learning benchmarks show that this approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-performing method on T5 model.

Multi-Prompt Learning

Prompt Ensembling

Prompt ensembling is the process of using multiple unanswered prompts for an input at inference time to make predictions. An example is shown in the figure below.

The multiple prompts can either be discrete prompts or continuous prompts. This approach can (1) leverage the complementary advantages of different prompts, (2) alleviate the cost of prompt engineering, since choosing one best-performing prompt is challenging, and (3) stabilize performance on downstream tasks. Prompt ensembling is connected to ensembling methods such as bagging or boosting techniques, which have a long history in machine learning. Current research also borrows ideas from these works to derive effective ways for prompt ensembling, as described below.

Uniform Averaging: the most intuitive way to combine the predictions when using multiple prompts is to take the average of probabilities from different prompts. Jiang et al. first filter their prompts by selecting K prompts that achieve the highest accuracy on the training set, and then use the average log probabilities obtained from the top K prompts to calculate the probability for a single token at a certain position when performing factual probing tasks [20]. Schick and Schutze also try a simple average when using an ensemble model to annotate an unlabelled dataset [21]. When performing text generation evaluation, Yuan et al. formulate this task as a text generation problem and take the average of the final generation scores obtained using different prompts [22].

Weighted Averaging: simple uniform averaging of results from multiple prompts is easy to implement, but can also be suboptimal given that some prompts are more performant than others. To address this issue, some works also explore to use of weighted average for prompt ensembling where each prompt is associated with a weight. The weights are typically pre-specified based on prompt performance or optimized using a training set. For example, Jiang et al. learn the weight for each prompt by maximizing the probability of the target output over training data [20]. Qin and Eisner use the same approach except that the weight for each prompt is optimized together with soft prompt parameters [23].

Knowledge Distillation: an ensemble of deep learning models can typically improve the performance, and this superior performance can be distilled into a single model using knowledge distillation [24]. To incorporate this idea, Schick and Schutze train a separate model for each manually-created template-answer pair, and use the ensemble of them to annotate an unlabelled dataset [21]. Then the final model is trained to distil the knowledge from the annotated dataset. Gao et al. use a similar ensemble method on their automatically generated templates [25].

Majority Voting: for classification tasks, majority voting can also be used to combine the results from different prompts.

Prompt Composition

For those composable tasks, which can be composed based on more fundamental subtasks, prompt composition can also be applied, using multiple sub-prompts, each for one subtask, and then defining a composite prompt based on those sub-prompts. This process is illustrated in the figure below. For example, in the relation extraction task, which aims to extract the relation of two entities, we can break down the task into several subtasks, including identifying the characteristics of entities and classifying the relationships between entities.

Prompt Decomposition

For tasks where multiple predictions should be performed for one sample (e.g., sequence labelling), directly defining a holistic prompt with regards to the entire input text x is challenging. One intuitive method to address this problem is to break down the holistic prompt into different sub-prompts, and then answer each sub-prompt separately. The following figure illustrates this idea with an example from the named entity recognition task, which aims to identify all named entities in an input sentence. In this case, the input will first be converted into a set of text spans, and the model can then be prompted to predict the entity type (including “Not an Entity”) for each span. It is not easy to predict all the span types at the same time due to the large number of spans, so different prompts for each span can be created and predicted separately.

Conclusion

Prompt engineering can significantly influence the quality and effectiveness of LLMs’ responses. By carefully designing and refining the prompts used to generate output, researchers and developers can improve the accuracy and relevance of the model’s output, making it more useful and effective for a wide range of tasks and applications.

In this survey, we have summarized and compared various prompt engineering methods and approaches. We hope this survey will help researchers more effectively and comprehensively understand the paradigm of prompt-based learning, and grasp its core challenges so that more scientifically meaningful advances can be made in this field.

References

[1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. URL https://arxiv.org/abs/2201.11903

[2] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba. Large Language Models are Human-Level Prompt Engineers. URL https://arxiv.org/pdf/2211.01910.pdf

[3] Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen. Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models. URL https://arxiv.org/pdf/2302.00618.pdf

[4] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. URL: https://arxiv.org/pdf/2203.11171.pdf

[5] Shizhe Diao, Pengcheng Wang, Yong Lin, Tong Zhang. Active Prompting with Chain-of-Thought for Large Language Models. URL https://arxiv.org/pdf/2302.12246v3.pdf

[6] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola. Multimodal Chain-of-Thought Reasoning in Language Models. URL https://arxiv.org/abs/2302.00923

[7] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. URL: https://arxiv.org/abs/2305.10601

[8] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190

[9] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.

[10] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. CoRR, abs/2106.13884.

[11] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP).

[12] Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021b. Factual probing is [MASK]: learning vs. learning to recall. CoRR, abs/2104.05240.

[13] Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, Online. Association for Computational Linguistics.

[14] Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. Warp: Word-level adversarial reprogramming. ArXiv, abs/2101.00121.

[15] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT understands, too. CoRR, abs/2103.10385.

[16] Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021. Ptr: Prompt tuning with rules for text classification.

[17] Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32, 2019.

[18] Yufan Huang, Yanzhe Zhang, Jiaao Chen, Xuezhi Wang, and Diyi Yang. Continual learning for text classification with information disentanglement based regularization. arXiv preprint arXiv:2104.05489, 2021.

[19] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, Amjad Almahairi. Progressive Prompts: Continual Learning for Language Models. URL https://arxiv.org/abs/2301.12314

[20] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020c. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.

[21] Timo Schick and Hinrich Schutze. 2021a. Exploiting close questions for few shot text classification and natural language inference.

[22] Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021b. Bartscore: Evaluating generated text as text generation.

[23] Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, Online. Association for Computational Linguistics.

[24] Zeyuan Allen-Zhu and Yuanzhi Li. 2020. Towards understanding ensemble, knowledge distillation and self distillation in deep learning. CoRR, abs/2012.09816.

[25] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Association for Computational Linguistics (ACL).