(
Model vendors continue to release increasingly sophisticated large-language models (LLMs), with longer context windows, and enhanced reasoning abilities.
The models can process and “think”but the cost increases. The more energy a model uses, the higher it is.
Coupled with the tinkering that goes into prompting – it can take several tries to achieve the desired result, and the question at hand may not require a model to think like a PhD – and the compute spend can spiral out of control. This is spawning prompt ops, an entirely new discipline in the dawning AI age.
Crawford Del Prete says that prompt engineering is similar to writing, or the actual creating of content, while prompt ops are more like publishing where you’re changing the content. VentureBeat spoke with IDC President. “The content is living, it is changing and you want to ensure that you are refining it over time.”
The challenge of compute usage and cost
Compute usage and cost are “related but distinct concepts” in the context LLMs, explained David Emerson. Applied scientist at the Vector InstituteIn general, the price that users pay is based on the number both of input tokens and output tokens. They are not altered for actions that take place behind the scenes, such as meta-prompts or steering instructions, or retrieval-augmented-generation (RAG). He explained that while longer contexts allow models to process more text at once it directly translates into more FLOPS. (A measurement of compute power). If not managed properly, some aspects of transformer models can even scale quadratically as input length increases. Unnecessarily lengthy responses can also slow processing time, and require additional computation and cost for building and maintaining algorithms to post-process the responses into the answer that users were hoping to receive. Emerson said that longer context environments tend to encourage providers to deliver verbose answers. Many reasoning models (o3 and o1 by OpenAI, for instance) will provide long answers to simple questions, incurring high computing costs.
Here is an example:
Enter: Please answer the following math question. How many apples will I have if I have 2 apples, and I buy 4 at after eating 1?
Output :If I eat one, I only have one left. I would have five apples if I bought four more.
Not only did the model generate more tokens than needed, but it also buried its solution. The engineer will then need to design a way to extract the answer programmatically or ask questions like “What is your final response?” which incurs even more API costs.
The prompt could also be redesigned so that the model can produce an instant answer. For example:
Enter: Solve the following math question. How many apples will I have if I have 2 apples, and I buy 4 at thee store after eating 1? Start your response by saying “The answer is “…” or:
Input: Please answer the following math question. How many apples will I have if I have 2 apples, and buy 4 more apples at the store after I eat 1? Wrap your final answer with bold tags.
Emerson said that the way a question is phrased can reduce the amount of effort or cost required to get the desired answer. He also mentioned that techniques such as few-shot prompting, which provides a few examples to the user of what they are looking for, can help produce faster outputs. Emerson noted that it is dangerous to not know when to use sophisticated techniques such as chain-of-thought prompting (generating responses in steps) or auto-refinement. These techniques encourage models to generate many tokens or to go through multiple iterations.
He stressed that not every query requires the model to analyze, then re-analyze and finally provide an answer. They could be perfectly capable to respond correctly when instructed to do so directly. Incorrect prompting API configurations, such as OpenAI o3, that require a high level of reasoning, will also increase costs, when a cheaper, lower-effort request would suffice.
Users may be tempted to use a ‘everything except the kitchen sink’ method, in which they dump as much text into a model’s context as possible, hoping that this will help it perform a task with greater accuracy. “While adding more context to models can help them perform tasks, this is not always the most efficient or effective approach.”
Evolution to prompt operations
It is no secret that AI-optimized hardware can be hard to find these days. Del Prete, from IDC, said that enterprises need to be able minimize GPU idle time while also completing more queries in idle cycles between GPU requests.
He asked, “How can I squeeze out more from these precious commodities?” “Because I have to get my system usage up, because I don’t just have the benefit of throwing more capacity at the issue.”
The prompt ops team can help address this challenge as they manage the lifecycle of a prompt. Del Prete explained that prompt engineering is concerned with the quality of the prompt while prompt ops is the part where you repeat.
He said, “It’s orchestration.” “I think of this as the curation and selection of questions, and the selection of how you interact to get the most from AI.” Prompt operations help manage, monitor, and tune prompts. “I think that in three or four years, it will be a discipline. It’ll become a skill.”
Although it’s a very new field, early providers of the technology include QueryPal Promptable Rebuff and TrueLens. Dep Prete said that as prompt ops evolve, platforms will continue iterating, improving and providing real-time feedback, giving users more capability to tune prompts with time.
He predicted that agents would eventually be able write, structure and tune prompts themselves. “The level will increase in automation, and the human interaction will decrease. You’ll be able have agents operate more autonomously when creating prompts.”
There are some common mistakes people make when prompting.
Until prompt operations is fully realized, a perfect prompt is not possible. Emerson says that people make the following mistakes:
- They don’t specify the problem well enough. This includes the way the user wants the model’s answer to be provided, what should the model consider when responding, constraints that need to be taken into account, and other factors. Emerson said that models require a lot of context in many situations to deliver a response which meets the user’s expectations.
- Failing to consider the ways in which a problem can narrow the scope of a response. Should the answer fall within a range (0 to 100?) Should the answer be a multiple-choice problem instead of something open-ended Can the user give good examples to contextualize their query? Can the problem be broken down into smaller steps to create simpler queries?
- Structure not being used. LLMs have a good sense of pattern recognition and many are able to understand code. Emerson said that while bullet points, itemized list or bold indicators (****) might seem “a little cluttered” to humans, these callouts are beneficial for LLMs. When users want to process responses automatically, asking for structured outputs can be helpful. Emerson stated that there are many other factors that should be considered when maintaining a production pipe, based on engineering practices. This includes:
- Ensuring that the throughput is consistent;
- Monitoring performance of prompts over time, (potentially compared to a validation set); and
- Setting-up tests and early warning detection in order to identify pipeline problems. Users can also use tools that support the prompting process. DSPy for example, is an open-source tool that can optimize and configure prompts based on labeled examples. While this example may be quite sophisticated, there are other tools (including those built into ChatGPT, Google, and others) that will help with prompt design.
Emerson concluded, “I believe one of the easiest things users can do to stay current on effective prompting methods, model developments, and new ways to interact with models.”
