London

June 28–29, 2027

New York

September 15–16, 2026

Berlin

November 9–10, 2026

You can vibe code a demo, but what about a product?

Lessons from shipping generative AI products to production.

By Ivy Si

June 23, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 6 minutes

Key takeaways:

  • Anyone can vibe code a demo. Almost no one is ready for production. Infrastructure, safety, compliance, and eval are where most teams quietly fail.
  • Your inference bill will make or break you.
  • The model is the easy part. Product-market fit, compliance, and quality eval all have to be built by humans.

Foundation generative AI models are powerful and have made creating demos extremely easy. Many might assume that building on these generative AI models is trivial, and that wrapper applications are low-hanging fruit – anyone can ship generative AI products. I do not think that it is the right assumption to make. 

There’s a long way between a demo and an actual product, and most of the gap has nothing to do with the model itself. The gap comes from infrastructure, evaluation, safety/compliance, cost, and actually making a good product. 

Finding the product-market-fit is hard

The barrier to building AI applications has collapsed. That means that everyone can build a demo fast, with a lot of the same ideas floating around in the market, while product roadmaps and critical user journeys often haven’t been carefully thought through.

Powerful large language model (LLM) coding capabilities have drastically reduced the technical barrier for making a tech product. Anyone can build a program/demo by describing its features in natural language to ‘vibe coding’ tools. However, the product-market-fit question has only become more and more important.

Defining a clear product roadmap is harder than it looks. Teams often underestimate this because the technology feels so capable, but capability is not valuable. The question isn’t “can the model do this?” but “does this solve a real problem better than alternatives, and will users pay for it?”

The 99-step rule 

The 99-step rule is my term for describing one of the foundational principles underlying finding the product-market-fit. Users don’t want to do 10 steps while the generative AI application does 90. They want the application/product to do 99 steps, if not all of the 100 steps. Generative AI applications/products, especially the consumer-facing ones, must deliver explicit end-to-end value. 

One cannot ship half-baked workflows and then expect users to compensate with much effort. A good analogue is that users are expecting the generative AI product to deliver the value like how food delivery works. Right at their door, with carefully prepared and packaged food, utensils, side dishes, napkins, and maybe hand sanitizer wipes.

If the generative AI product doesn’t make it easy but rather adds more to the cognitive load for users, you are going the opposite way from finding the product-market-fit. 

Inference economics

The financial viability of a generative AI product hinges on inference economics. Production is where the inference bill mostly occurs. 

There are two paths. First is to use generative AI models hosted by other enterprises via enterprise application programming interfaces (APIs). OpenAI, Anthropic, and Google all have APIs for their generative AI models in different modalities. This is the right step to take for most teams most of the time. It’s fast to build a demo, fast to iterate, lower cost, no infrastructure to build, and there’d be a clear price tag where the team can set a cap to. 

The second option is to self-host an open-sourced model yourself. There are many open-source foundational models across parameter size (from hundred millions to hundreds of billions) and modalities (image, text, embedding, VLM, voice to text, etc.) hosted by different companies (Gemma from Google, Llama from Meta, etc.)

You get the flexibility to choose your own serving stack plus the freedom to choose smaller models and develop your product in your local machine (at first), with no cost besides paying for the electricity bills. However, this is much heavier lifting than directly calling an API approach as it requires computing resources and engineering talent. It’s mostly not doable for small startups. 

Fine-tuning vs. foundational model advancement 

Fine-tuning is the process to train a model with your data (sometimes your own proprietary data) to make the model perform better in specific domains like coding ability or in specific verticals.

Fine-tuning is tempting and sometimes worth it. However, in most cases, fine-tuning is not needed nor preferred. There are several reasons why.

First, the iteration speed of the next generation of foundational models is fast, and the quality ‘hill climbing’ of foundational models is serious. It’s very likely that the next released foundational model will out-perform the fine-tuned previous generation of the model, even when it comes to specific verticals that the old generation model has been fine-tuned in. Second, fine-tuning requires good quality data for training and grounded eval. Collecting good data is expensive by itself. 

Safety, regulations, and compliance 

Generative AI models are capable, but only if used correctly. This is why many regions have published their own laws regulating the use of generative AI models. The US, UK, European Union (EU), and Asia all have different laws. If you are expecting to ship the product in more than one country, you are likely working across multiple legal regimes. 

These regulations affect what features you can expose, where data is allowed to be stored, and for how long, which in turn impact engineering decisions. These are important to figure out early on during the product and engineering roadmapping phase, rather than last minute right before the product launch. 

Failing to do so will lead to social harm (imagine if generative AI models are misused to generate harmful speech or hateful content) and potential lawsuits and legal consequences. 

The solution is to first understand the regulation, strictly follow it, and then take specific steps to mitigate the risks of generative AI misuse. In particular, you should implement safety guardrails and safety evals. Safety guardrails essentially reject any unsafe requests and responses, whether in the form of a text query, a user uploaded image, or AI-generated text, image, or videos. 

After implementing safety guardrails, the team should conduct safety evals to make sure that the system is actually blocking unsafe requests from users, and any unsafe content returned to the user. In addition, logging should be implemented to capture the requests, responses, and bad actors.

Quality evaluation 

Quality evaluation is another critical aspect for generative AI applications because models are still stochastic and non-deterministic, which means the same input might produce different outputs of different qualities. This is not a bug per se, but it does signal that the system needs to have quality measurement in place to catch regressions. 

For your generative AI use case, you need to create your own golden set. The golden set should have curated inputs and the range of expected/acceptable outputs. Think of this set as a smoke test, where the system should pass all test cases. Failing any case from the set means that there are some quality regressions.

The other side of the coin is to use LLMs as a judge, meaning given the quality rubrics, the input, and the output, then asking large models to give a verdict of the quality following the rubrics. There is some research that suggests this works, but of course it needs to be tested and tuned for individual use cases.

Piece all these different things together and you can set up an evaluation pipeline that runs periodically and pump the golden set into the pipeline, using LLMs to analyze if there’s regression. On top of this pipeline, you can implement monitoring and logging to alert humans to investigate the system if there are regressions. 

LDX3 New York is live

From demo to discipline

Building a demo is easy, but shipping a generative AI to production means taking care of the underlying infrastructure, safety, quality, product-market-fit, and regulatory constraints.

The winners in the generative AI application space will be the teams that:

  1. Think through the product-market-fit and business model. 
  2. Build multi-layer safety and quality guardrails into the architecture. 
  3. Approach eval as a quantitative science, anchored by golden datasets and rigorous regression testing.
  4. Make tradeoffs between API access and self-hosted inference.
  5. Design for regulatory and compliance constraints, not a single global deployment.