Why I Am Excited To Build A Dev Platform For Prompt Engineering

Last year, I attempted to start a new chapter of my career by applying for a Senior Prompt Engineering position at Khan Academy with their Khanmigo AI product. Khan Academy's vision of making high-quality tutoring accessible worldwide with Khanmigo deeply resonated with me. I hoped to contribute my experience developing an online learning platform at my first startup, HeatSpring, which I had just sold earlier that year.

Exiting a Startup Feels Good... and Bad

In February of 2023, after nurturing Heatspring for 17 years into a platform with over $1.3M in annual revenue, 200+ courses, and a community of 100,000+ users, I decided to sell. Starting as a project at Babson College in 2006, HeatSpring had become a significant part of my life. Seventeen years and one successful exit later I was left unsure what to do next. Yes I got a nice payout and an exit is supposed to be every founder's dream, but honestly selling my first company kind of sucked and I was left feeling depressed and hopeless. My startup had become a big part of my personality, starting over would be hard and I feared I couldn't do it.

After a few months of flailing, I started diving deeper into opportunities around AI and Machine Learning. I immersed myself in technical courses, books, and tutorials for AI developers. I decided to pivot my career towards AI with another startup. I had become convinced that AI was the big opportunity for the next 20 years, but I had not yet found a compelling application for a startup. I experimented with building a wrapper for the OpenAI API which implemented Retrieval Augmented Generation (RAG), so companies could upload their private documents and then use OpenAI to chat with them. I thought this was a great idea until OpenAI released essentially the same feature with GTPs at DevDay 2023. A lot of startup ideas died that day.

AI Job Opportunity with Khan Academy

My LinkedIn feed happened to pop up the Senior Prompt Engineer position with Khan Academy at an opportune time. Despite my entrepreneurial nature urging me to try another startup, the practical reality of financial stability was becoming increasingly pressing. Not having a salary was starting to weigh on me and I was also picking up on some not-so-subtle signs that it was starting to weigh on my wife as well. Khan Academy's mission aligned perfectly with my passion for education and technology, prompting me to start working on an application.

The job requirements specifically mentioned Python skills and that my cover letter should address the question of How you ensure the high quality of the prompts you create (use specific strategies and examples). I had been developing some AI-based application prototypes for startup ideas and had developed a testing system for my prompts. However, these were written in Ruby and minitest so I translated some of this system into Python and created a github repository as a demo project to provide with my application. I wrote an article about it here called Prompt Engineering Testing Strategies with Python.

I used the OpenAI API and unittest in Python to show some examples of how I was maintaining high-quality prompts with consistent cross-model functionality, such as switching between text-davinci-003, gpt-3.5-turbo, and gpt-4-1106-preview. These tests also demonstrated a framework for ongoing testing of prompt responses over time to monitor model drift and even evaluation of responses for safety, ethics, and bias as well as similarity to a set of expected responses.

Navigating the Interview Process

The next week I got some good news, I got an interview! The interview was with a Director to whom I would be reporting. It went well and he seemed to like my demo project and the concept behind the testing suite and it also seemed like the Khanmigo team could benefit from using something like this. Khanmigo officially lives under the Content department, so the prompts are primarily written by non-technical content managers within each specific discipline. Then the prompts are handed over to the software engineering team for implementation and ongoing management. This back and forth caused some pain within the organization and led to delays and frustrations.

A few days later I got invited back for a second interview, this time a technical interview with a Senior Developer. That interview went well also and we worked on an example of asking the AI to structure its response as a JSON object and how we might go about ensuring the AI returns valid JSON, something that my test suite could be super helpful with. I knew I shouldn’t get my hopes up, but to be honest I started getting excited about having a job and joining a large team, it’s been about 20 years now! A few days after my second interview I got the bad news “Unfortunately, we won't be moving forward with your candidacy at this time…” bummer.

I was disappointed, I thought the interviews had gone well and I was excited to help develop Khanmigo. I also genuinely thought that my test suite concept could help the team with ongoing prompt engineering management. Despite the setback, I had now found a new direction.

Shiro: A Dev Platform for Prompt Engineering

Managing LLM prompts in a production environment is challenging. Coordinating non-technical users developing and iterating on prompts, with the software engineering team deploying and managing the prompts is not an easy task. The probabilistic nature of LLM responses also adds additional challenges. How do we measure if the changes we've made to prompts result in better or worse responses? How do we test responses over time and monitor for model drift? Would using a different model or provider result in better experiences?

To help teams tackle these challenges, I've developed the Shiro platform. Shiro is a dev platform for prompt engineering to help teams level up their prompt engineering management. Shiro facilitates coordinating large teams of non-technical users to develop, test, and iterate on prompts. Users can perform side-by-side comparisons of multiple prompts, parameters, models, and even model providers across a variety of test cases.

It also helps software engineers deploy prompts to production and allows options to lock down prompt versions or allow non-technical teams to continue updating prompts used in production without having to change production code.

Shiro Platform Features

Cross-LLM-Provider Support

Compare how the same prompt performs across models from OpenAI™, Google™, Mistral™, AWS™, Ollama, Cohere™, AI21™, Anthropic™, Microsoft Azure™ or your own fine-tuned model.

Test Cases & Quantitative Evaluation

Build up a bank of test cases so that with each iteration of your prompt, you get closer to your ideal output. Every update is version-controlled and can be easily reverted, no code changes necessary.

History Tracking & Collaboration

Each permutation you try is saved to your history and has a unique URL so that you can revisit or share it with others at any time.

Observability and monitoring

Every model input, output, and end-user feedback is captured and made visible at both the row level and in aggregate. Use your data for model fine-tuning.

Roles and Permissions

Create roles and permissions to fine-tune access to features like editing prompts, managing deployments, and setting test controls.

RESTful API

Our high-reliability, low-latency API allows you to make changes to the prompt, or even the underlying model/provider, without making any code changes.

Early Access Now Open

Shiro is now open for early-access users. Sign up now for the free starter plan with 10 OpenAI prompts, 1 prompt deployment, and unlimited tests. Please help support my startup so I can explain to my wife why I don't have a job yet!