What is the benefit of adding AI functionality within your product?

Companies that delay incorporating AI will be at a competitive disadvantage in the (near) future – most high-performing companies are shipping AI into products quickly, and that initial advantage will be compounded as new AI features create competitive moats that make it harder for slow adopters to catch up.

Effectively implementing AI accelerates growth and creates a stickier customer experience – there are 3 categories of companies competing in the marketplace of AI-powered products:

AI native – companies that have emerged within the last 1-2 years lean heavily on using AI directly in products, have small teams, and operate efficiently. For example, Cursor was able to generate $100MM in revenue with just 30 people.
Enterprise – companies with huge platforms and distribution moved quickly to implement AI, contradicting the conventional wisdom that very large companies are slower to change. Their scale and speed of adoption present a meaningful threat to companies that have yet to figure out their own AI product strategies.
Everyone in between – companies with established revenue, platforms, and customers are already starting to prove that implementing AI can significantly accelerate growth.

Project Scoping

What are the common blockers that companies should be aware of when trying to identify, scope, and pursue AI use cases?

Leadership’s lack of LLM fluency can limit a company’s ability to identify AI use cases – most CTOs and CEOs gained their experience before LLMs existed (i.e., before 2 years ago). People who aren’t as comfortable with LLMs are less likely to encourage AI features to get shipped or explore innovative AI applications (this is why so many companies created chatbots).

Identifying AI use cases is a skill – you can’t get better at understanding the types of problems AI can help solve if you don’t use LLMs. Use AI-powered tools to get things shipped to production quickly and learn which approaches work and don’t work.

What types of use cases make for strong AI applications? How should you approach identifying and selecting an application within your organization?

To select your first use case:

Determine where customers spend a lot of time on your platform, and whether you have good data around any of those uses.
Choose something that you can finish within a month.

Start with time-saving applications where customers will instantly perceive the value – look for areas where customers spend significant time manually filling in data, summarizing information, or categorizing items. For example, if your customers collectively spend 1,000s of hours per week classifying accounting transactions, creating a classification autofill feature could provide an immediate benefit.

Choose scenarios where partial automation is still useful – continuing with the accounting classification example, even partial automation (e.g., it can auto-populate classifications 70% of the time) dramatically reduces manual work.

Avoid use cases that require 100% accuracy and automation to be valuable – futuristic applications like “AI lawyers” or “auto-accountants” that can autonomously create legal documents or file taxes are highly unrealistic because they have impact and nuance that require human oversight. Prioritize situations where automation can help even if it is occasionally wrong—a risk mitigated by the ability to have a human to review and correct any errors. For example, misclassifying a small percentage of transactions is a correctable error that doesn’t trigger critical consequences.

Look for opportunities to simplify existing workflows instead of replacing them entirely – the most effective AI implementations enhance rather than replace human capabilities by allowing humans to focus on higher-value activities.

How long should it take to build AI into your product?

Scope your first AI feature to 30 days – use a tight timeline as a forcing function to create something functional and gather feedback as quickly as possible.

Avoid scope creep and over-promising – companies set themselves up for failure by trying to do too much at once, especially when leadership feels pressured to roll out AI features at the same pace as competitors. Scoping timelines are already difficult to estimate when you are using a new type of tool (e.g., an LLM) for the first time; adding on additional features can extend timelines dramatically, to the point where a project could be abandoned or never shipped properly.

Build end-to-end connections so you are ready to ship early on – instead of building siloed components in isolation and sequence, immediately connect the LLM to the end point. Add pieces as you go, knowing that you can ship at any time and remain able to iterate on the parts that require more work.

Build evals first – create tests for the entire system upfront. Evals serve as a rubric for how you will grade what the LLM produces. Once you generate the rubric, build a dataset to test against it and adjust your system accordingly instead of waiting to test at the end.

How does your data state affect your approach to implementing AI?

How AI use cases can be pursued with various levels of historical data
Data access	Method
Established product and significant data	Use customer data, usage data, and data from high-use workflows to identify possible features to incorporate into those workflows. Conduct exploration against your existing data
Start-up with a new product and no customer usage data	Use LLMs to generate synthetic data and build your system around that data. Once the system performs well, ship it to production knowing that it already works under most use cases.

Note: the ability to create fake data significantly expedites the production process – teams can now avoid paying for months of training data and asking survey questions just to get enough data to train a model.

What kinds of skill sets are required to build AI use cases into your product?

Applied AI engineers need a generalist skill set rather than deep specialization – over the past 10-15 years, the industry moved toward specialized engineering. AI development requires knowledge across multiple disciplines, which makes these roles difficult to hire for.

Look for engineers with a breadth of knowledge across areas such as:

Infrastructure and DevOps – understanding cloud infrastructure, deployment, and operational concerns is crucial for building robust AI systems.
Backend engineering – engineers must be comfortable using retries, rate limiting, queues, and asynchronous processing.
Machine learning engineering and data science – while deep ML experience isn’t necessary for all applications, engineers benefit from having a functional understanding of how models work, which levers can be adjusted, and the different approaches they can take to improve results.

A growth mindset drives continuous improvement – experimentation, feature flagging, data exploration, and SQL analysis are necessary to effectively iterate and explore new tools. Engineers who aren’t comfortable with uncertainty or dislike adopting new tools will struggle more than to those who take a more dynamic approach.

Good product sense can make or break an AI project – beyond technical skills, engineers need to understand what they’re optimizing for (and why). This allows them to make better decisions about which problems AI can solve and how to prioritize different types of solutions.

What are the steps/phases to building out an applied AI use case?

1. Understand and scope the problem – determine the persona you’re building for, what they need, and what success looks like. Use this understanding of success to set appropriate boundaries on the implementation and limit scope creep.

2. Assess what’s possible from a data standpoint – you can’t build a system until you know what you’re working with. Explore how you can segment and chunk your data, and examine data quality, quirks, structure, and edge cases. You will waste time and money down the line if you misjudge the quality or quantity of your data.

3. Use prompt engineering to quickly test feasibility before beginning infrastructure work – use small datasets to conduct “vibe checks” on OpenAI or Anthropic. Prompting is a low-lift way to experiment, and you can often develop much of a solution through prompting alone.

4. Confirm your initial scoping – test against real data to verify whether your time estimates are realistic before you commit resources. Rescope or adjust expectations now to prevent sunk cost fallacy later in the project.

5. Establish evaluation frameworks – set up your testing framework once you know your system’s inputs and outputs. Your approach should depend on the technique and model you use:

For text responses, use LLM-as-a-judge and begin with unit tests for simple tasks like classification.
If you use RAG, start with measurements around recall (e.g., number of relevant chunks retrieved divided by total number of relevant chunks available)

6. Continue iterating based on eval performance – you will probably need to refine multiple parts of the system.

Note: your goal is to be able to go through this process faster over time – train more people on your team on this process and provide opportunities for engineers to get better at identifying which areas in a system have the most potential for improvement.

Model Selection

What are the different categories of architectures that you can choose from? What are the pros and cons of each?

Approaches to building use cases with LLMs
Approach	Description	Pros/Cons
RAG	The most common architecture involves storing data in a database, retrieving relevant information based on queries, and providing that context to the LLM alongside prompts.	Pro – leverages existing data without requiring model training
Structured output generation	This handles the critical interface between AI and systems and ensures that LLMs produce outputs in a format that your systems can use.	Pro – functionality is essential for production applications Con – getting a model to generate the right data structure requires careful design and iteration
Fine-tuning	Fine-tuning takes existing models (typically open-source) and trains them with proprietary data to improve outputs for specific use cases.	Pro – enables powerful customization Con – iteration is difficult, time consuming, and requires specialized knowledge

Note: while it is more complicated, fine-tuning trains an LLM on your private data, which can become a competitive advantage.

Selecting models based on data privacy
Model Type	Example	Pros/Cons
Closed-source – provides strong performance with simple pay-per-use pricing.	Anthropic, OpenAI, Gemini, Nova	Pros – convenience, higher effectiveness, and technology that is months or years more advanced than open-source models Cons – offers less control, requires sending data outside your infrastructure (which may raise privacy concerns)
Open-source – offers a more controlled experience, but users must manage their own GPU infrastructure.	Llama	Pros – data never leaves your environment, which is necessary for healthcare and other sensitive data Cons – typically performs slightly worse than closed-source models, requires infrastructure management

When do you need to keep a human in the loop?

Most companies benefit from starting with the “human in the loop” approach and reducing oversight as quality improves – it is easier and less risky to build a solution that incorporates human validation. As accuracy increases over time, experiment with gradually reducing human intervention. Once you can consistently achieve very high accuracy rates (e.g., 98-99%), make a business decision regarding whether it is worth it for the solution to become entirely automated.

Approach	Model implications *Note: there are competing belief systems on this topic.*	Example
Keep a human in the loop	AI generates drafts that humans approve and/or modify.	Tools like Cursor keep humans in the loop by creating recommended solutions that can be adjusted.
Full automation	Humans are not involved, and AI agents operate entirely independently.	Cognition Labs is building Devin, a software engineer agent that will theoretically act autonomously and be able to do the same things as a human software engineer.

The ideal amount of human involvement might vary by feature within a single product – instead of making a binary decision for your entire product, consider whether different capabilities warrant different levels of automation based on their criticality and how reliable the system is in each area.

Consider offering configurable levels of automation to meet different user needs – for some users, error risk is worth the increased efficiency. For others, oversight is considered critical. For example, if your feature can automatically classify accounting transactions, you could offer an “auto-accept” option that users can toggle on or off based on their preferences.

Training/Data Pipeline

What are the key considerations when deciding how to train your AI implementation on the proper context?

Most implementations don’t require true model training, just effective data retrieval – unless you’re fine-tuning (which is often referred to as building a model from scratch), the most important factor is how you structure, store, and retrieve data.

Retrieval quality is influenced by your choice of embedding model – providers like Cohere, OpenAI, and Anthropic have multi-modal models, but each model has unique strengths and is not equally suited to handle every use case.

How you chunk data affects retrieval relevance – be thoughtful about how you categorize and chunk different types of information. For example, if you want to ask, “was the customer generally happy over the last month?” you must be able to filter through chunks from the last month. Similarly, an agent that provides meeting notes must be able to parse information by length, topic, speaker, tone, etc. Experiment with different approaches to find what works best for your specific content and queries.

Metadata enrichment enables more precise filtering and search – store relevant metadata alongside embeddings to enable more targeted searches. For example, storing speaker information alongside meeting transcripts allows you to make queries like, “what did John say in the last meeting?”.

Reranking and hybrid retrieval can improve results – rather than relying on a single search strategy, complex use cases may benefit from searching by a variety of criteria (e.g., recent items, text similarity, etc.), which are used to rerank data and surface the most relevant information.

Note: while not part of data architecture, prompt choice has significant impact on how a model retrieves data – in parallel with these other techniques, edit your prompts to become more relevant and generate outputs that are structured how you need them.

How should you approach collecting, cleaning, and labeling the training data needed for our AI features?

Store extra columns of data with clear labels – engineers often put data into an embedding without taking the time to structure and store it with the columns that will allow the model to use that data the way it’s intended.

Relabeling data after upload creates significant extra work – backfilling data sets projects behind and can cause downstream implications for the other pieces that were progressing in parallel. Invest time up front to ensure that you’ve identified all inputs and labeled the data accordingly.

What data pipeline architectures do you recommend for continuously improving AI models with new customer data?

Handle new data inputs differently for each technique:

RAG – RAG systems easily incorporate new data if it is validated and properly formatted. Once new data comes into the system, it can be searched against and retrieved for future queries. For example, an AI tool might not know how to classify a Dairy Queen transaction the first time one is encountered, but once an accountant classifies it, the classification becomes available for future similar transactions.
Fine-tuning – be careful with versioning fine-tuned models. Instead of adding new data incrementally, you typically need to retrain the model from scratch to avoid overfitting. Updates require a complete retraining process.
Direct Preference Optimization (DPO) – this newer approach presents 2 answers to the same question and learns which is better based on user preference. ChatGPT uses DPO when it provides multiple options and asks about your preference.

Deployment and Maintenance

What approach should you take to handle edge cases and unexpected inputs in your AI system?

Create strong guardrails for what the LLM can and can’t do – if you can anticipate common or high-risk edge cases, design a solution that is incapable of derailing in response to certain requests.

Start with use cases that are less likely to have edge cases – open-ended applications like chatbots are a challenging first AI use case because they can receive an infinite range of inputs. Mitigate edge case risk by choosing a more constrained application, such as an internal-facing tool or a classifier dealing with a limited selection of inputs.

Establish strong monitoring systems that rapidly identify and flag problems – expect edge cases to occur and have a response plan in place. Your monitor should alert you to unexpected behaviors or outputs so you can address them as quickly as possible.

Deploy internally before exposing the feature to customers – roll out your AI feature internally and compare its outputs against human decisions before exposing it to customers. This helps reveal edge cases and performance issues before they affect users.

Some edge cases can be resolved with prompt engineering – include a validation step that calls the LLM and determines whether a user query fits within the conversation flow. If it risks prompting the model to go off track, you can automatically have the LLM steer the conversation back to a query within its parameters.

What kinds of ongoing maintenance do you need to keep up as models improve?

Regularly test new models against your eval framework – models are constantly changing, so it is important to test them regularly to see if it makes sense to switch providers. If multiple models provide satisfactory results, your choice comes down to cost.

Keep track of model-specific “hacks” when switching models – many implementations include prompt engineering tricks that work well for particular models but may negatively impact performance with others. You will likely need to change, remove, and create new hacks when you switch models.

Consider the infrastructure implications of model updates – if you host the model yourself, new versions may have different resource requirements or compatibility needs.

Building Applied AI Use Cases In Your Product