Semantics and syntax in GenAI applications

Either through RAG or Guardrails, both syntax and semantics play an important part in how we use LLMs, and how to take advantage of them.

Whether through RAG, caching, evals, or guard-railing, we can take advantage of syntactic and semantic constructs to elevate our LLM applications.

For that, old and new techniques and solutions can be part of our tool kit. Let’s learn a little more about those.

For those who keep hearing about RAG, embeddings, or guard-railing but don’t understand how they fit together, we’re bringing it back to the basics. You can then use this knowledge to build more complex systems. No need to get scared by buzzwords or marketing, let’s break things down.


Currently, the vast majority of GenAI use cases rely on RAG (Retrieval-Augmented Generation) in some way, which is where most of the investment is directed. It enables companies to leverage their data swiftly and to develop internal and external knowledge-based tools.

We have been helping companies build their RAG systems for many use-cases and seeing how they can drive value for businesses. Having said this, we have also embraced them by building an open source integrated writing environment for technical writing.

RefStudio — An Open Source Integrated Writing Environment for Technical Writing

If Confluence is where documentation goes to die, RAG is where it comes to life. But it’s not just documentation; it can also apply to code and other types of data. RAG can make operations more efficient, onboarding smoother, and knowledge sharing easier.

However, it’s important to note that you need to surface your data to your prompts. Unless you invest in fine-tuning models specifically for your needs, they won’t be trained on your data and will require access to it in some way. This means you’ll encounter a search problem: sifting through a large amount of information to find what’s relevant to your query.

Engineers have been doing that for a minimum of 20+ years and have known about search indexes for a long time. Our co-founder wrote about them years ago, and that content is still relevant and interesting today.

“An inverted index answers questions ‘like find me the document that contains the word blue but not the word black’. They are kind of like the index in the back of a book. An inverted index looks a lot like a hash table. You hash the word and place it in a hash table. Then, like in the range index, you keep an array of the documents that match that term.” — Nuno Job, Database Indexes for The Inquisitive Mind

This allows you to know which documents contain which terms. That means that if your prompt is about running, you’ll likely need to include documents B and C as the context of that prompt.

Inverted indexes employ many techniques to be effective, among them:

Fishing is a way of catching cats, he argued in his arguments
Fishing, catching, cats, argued, arguments
Fishing is a way of catching cats, he argued in his arguments
fish, is, a, wai, of, catch, cat, he, argued, in, his, argum

The problem with inverted indexes is that they have no understanding of meaning, and they don’t understand semantics. On the other hand, Embedding models do.


Embeddings are a highly versatile and fascinating machine learning technique. From a piece of content, it generates a fixed-size array of floating numbers and places these numbers in a multidimensional space.

The crucial aspect is that once all of those arrays are placed in that space, other points nearby have a similar semantic meaning.

Interactive map of embeddings

In the image above, you can see an interactive map of embeddings for the ‘Our Time’ podcast, built by Matt Webb. If you navigate through that interactive map, you’ll see that the episodes close to each other have related topics.

If you can do that with podcast episodes, you can do that with everything. The Word2Vec dashboard allows you to play around with this concept. Give it a word, and it will give you other words in the same space in similar positions:

While in the inverted index, you would get words that are syntactically similar to your search, here you get semantically similar words. It is a critical and fundamental difference.

This technique has many use cases, among them are the following:

  • Recommendation systems
  • Search (multimodal)
  • Data preprocessing

And so much more

There are many embedding models out there. You should choose the ones that fit your use cases the best:

  • text-embedding-ada-002 is the most common and popular model;
  • fastText is very fast and lightweight
  • e5-large-v2 is strong with QA style content
  • You can go through the HuggingFace leaderboard to learn about all of them

In the context of GenAI, we can use embeddings to index all our data and search through it to find the most relevant content for our prompt. That means we need to have performant ways to navigate through vector indexes. And that’s what vector databases do: they let you carry out an Approximate Nearest Neighbour Search.

Furthermore, the market of Vector Databases is exploding, with so many and varied options. However, you don’t have to use one; you can also use sqlite-vss and pgvector for SQLite and Postgres, respectively. Alternatively, you can do that locally with ANN libraries like FAISS. And you can go to the edge with Athena.

When indexing your content, you can slice it in many ways:

  • Per document
  • Per paragraph
  • Per phrase
  • Per Q&A
  • Or any other way

And, lastly, you can mix an inverted index with embeddings. This way you cover the shortfalls of each type of search.

Outside of RAG, you can also use embeddings for prompt caching. If you can find semantically similar documents with it, you can also find semantically similar prompts. It should go without saying, but different use cases and usage patterns will yield varying results.


It’s important to validate the output of LLMs to guarantee that it is either syntactically or semantically correct. Semantically, it connotes that the text output is free from harmful content, correct, and is factual. Syntactically, on the other hand, means that we can restrict the output to a certain machine-readable schema, whether it’s JSON, XML, Typescript, etc.


The basics of syntactic guard-railing is to ask the model to reply within a specific schema:

You are a service that translates user requests into JSON objects of type “SentimentResponse” according to the following TypeScript definitions:

export interface SentimentResponse {
  sentiment: "negative" | "neutral" | "positive"; // The sentiment of the text
The following is a user request:
hello, world
The following is the user request translated into a JSON object with 2 spaces of indentation and no properties with the value undefined:

  "sentiment": "neutral"

In the example above, generated by TypeChat, we’re constraining the output to a schema defined by the Typescript interface “SentimentResponse”. This way you can parse the output (JSON in this case), and perform all kinds of validations (like URLs, e-mail, etc.) and transformations. You can use validation libraries, like Zod. If it fails, you can even return to the LLM with the error and ask it to fix the output.

There are many open-source libraries that support syntactic guard-railing:

…and so many more

Additionally, llama.cpp recently added grammar-based sampling support. With it, you can author GBNF files, which are a type of Backus-Naur notation, for defining a context-free language. For that, a library like gbnfgen can help you generate grammars directly from TypeScript interfaces.


On the other hand, you might also need to semantically guardrail your outputs. Make sure that it:

  • has no harmful or inappropriate content
  • is factual and relevant to the input

Nvidia’s NeMo-Guardrails is one guardrail option to:

  • prevent the model from engaging in discussions on unwanted topics
  • steer the model to follow pre-defined conversational paths and enforce standard operating procedures (e.g., authentication, support)

You can also go back to old-school methods, like:

Or, chain the output to another LLM and ask it to classify it. Ideally, a better model.

Last, but not least, it’s also worth paying attention to your system prompt. It can have a strong impact on the alignment of the answers the model gives. For instance, read into the breakdown of the Claude-3 system prompt. You can even give examples of behaviour in the system prompt.

Just the tip of the iceberg

Not only do we scratch the surface of embeddings and guard-railing, but there are also so many other building blocks in the GenAi ecosystem.

While embeddings stand as one of the most important concepts in GenAi, the impact embeddings have extend beyond its boundaries. On the other hand, guard railing, either syntactic or semantic, is an essential concept and technique to master in serious GenAI production systems.

Above all, we want to strive to break those concepts apart so that we can develop applications that bring value to customers — all types of applications, not just chat and bots.

Semantics and syntax in GenAI applications
was originally published in YLD Blog on Medium.
Share this article: