Zac Pustejovsky Zac Pustejovsky

The Everything Algorithm

What do Large Language Models mean for UX, Product, and Engineering?

What do Large Language Models mean for UX, Product, and Engineering?

Recent advancements in the generative modeling of language have caught the world's attention with the release of several impressive new products. In the last few months, we’ve been overrun with grand proclamations of how this new technology will transform our lives and the entire global economy. White collar jobs will be overrun by robot employees; traditional tech giants will crumble at the hands of AI startups; intelligence, creativity, and communication, long considered distinctly-human traits, are now well within the capabilities of machines. It has seemed as if the whole world were attempting to performatively announce their fealty to the new superintelligence. Can a new algorithm really be so disruptive?

They’ve become ubiquitous, the components of technical systems that rank posts in a feed, propose next videos to watch, or suggest products to buy. Algorithms are so omnipresent in mobile and web-based products that it feels wrong to think of them as distinct pieces of the systems in which they reside. In many cases, a product’s visual User Interface (UI) feels almost secondary to the algorithm; a formality to make up for the fact that we need something to click on or type into.

There is always a reward function. Without one, an algorithm has no real purpose.

In designing these algorithms, engineers and product managers need a metric by which to measure their success. Maybe an ecommerce website uses Average Revenue Per User (ARPU) to compare multiple versions of a product recommendation algorithm. I like to think of these target metrics as the system’s reward function. A reward function is a numerical representation of success that a system optimizes towards. Sometimes this optimization can be automated, as is the case in Reinforcement Learning (RL). More often, the optimization happens slowly as the result of iterative experimentation and multivariate testing. There is always a reward function. Without one, an algorithm has no real purpose.

Over the last couple of years, and more intensely over the last few months, Large Language Models (LLMs) have proven adept at a variety of tasks. Murray Shanahan of Imperial College London explains them as follows:

LLMs are generative mathematical models of the statistical distribution of tokens in the vast public corpus of human-generated text, where the tokens in question include words, parts of words, or individual characters including punctuation marks. They are generative because we can sample from them, which means we can ask them questions. But the questions are of the following very specific kind. “Here’s a fragment of text. Tell me how this fragment might go on. According to your model of the statistics of human language, what words are likely to come next?”

- This paper by way of Vicki Boykis

This type of model has a reward function equal to the probability of the next token given all previous tokens, starting with a prompt. The system is “pre-trained” to represent this statistical distribution as accurately as possible.

The explosion in popularity of these models has culminated with the astonishing launch of ChatGPT, a question-answer system built on top of an LLM. One of the key innovations in these newer systems comes from a second phase of training, in which expert users interact with the model. As the model generates responses, the users assign reward as a function of how “good” the generated text is. We could imagine a grading system where nonsense gets no reward, grammatically-correct paragraphs get a little reward, and fluidly-written statements of fact get a lot of reward. Each time the model receives a grade, it adjusts itself to alter the predictions from future queries. This process is called Reinforcement Learning from Human Feedback (RLHF). RLHF can be thought of as the process of adjusting the reward function of the system to match a specific use-case. The distribution is tweaked such that the model generates useful answers to user queries.

A good heuristic for understanding whether a specific task or a job will experience large-scale disruption from LLMs is to ask: what is the reward function?

Just like many other technological leaps forward, people have immediately started to bemoan the upcoming mass unemployment at the hand of the omniscient machine. I don’t think this is a reasonable reaction.

A good heuristic for understanding whether a specific task or a job will experience disruption from LLMs is to ask: what is the reward function? If the reward function is the same as the algorithm, then there will probably be some disruption. A student writing a 1500-word essay on the Vietnam War is attempting to maximize their grade, which is based on the clarity and fluidity of their arguments. This task is perfectly suited for an LLM. The model can provide a perfect statistical combination of all previous A-graded papers on Vietnam, tailored to fit the exact argument you are trying to make.

However disruptive this may be in AP English and History, few people make their livelihood by producing 1500-word essays on well-researched topics. Those who do tend to have either a unique perspective or some novel information that has not yet been indexed by the model. In this case, LLMs will likely be a welcome improvement to their writing workflow, as drafts, outlines, and filler text can all be generated and tweaked instantly.

Is my livelihood doomed to fall victim to the supercomputer?

How might these systems alter the lives of other white-collar workers? There is certainly potential for disruption in some roles. Roles that specialize in producing large volumes of templated text (e.g. internal HR Communications) or in succinctly summarizing documents (e.g. business and legal analysts) will likely get more productive. Increased productivity can lead to decreased headcount.

Code generation is often touted as a major disrupter, as middle-managers the world over are drooling at the prospect of a mechanized workforce, generating and modifying complex code bases on a dime. Is my livelihood doomed to fall victim to the supercomputer?

In theory, LLMs should be very good at churning out code. Code is text, and code validity in response to a prompt can be set up as a reward function. All sorts of examples have been produced where LLMs correctly generate functions, classes, and even small programs. I’m using ChatGPT to help me uniformly sample n data points from a dataset of unknown size (represented as a PCollection). I’m using my map-reduce framework of choice, Apache Beam. There is actually a class in Beam that does this already, using reservoir sampling.

Using Apache Beam, write a Java pipeline that uniformly samples n elements from a PCollection of unknown size, where n is specified by the user as input to the program.

ChatGPT generated the following code, along with a brief explanation.

 import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Sample;
import org.apache.beam.sdk.transforms.Shuffle;
import org.apache.beam.sdk.values.PCollection;

public class SamplePCollection {
  public interface Options extends PipelineOptions {@Description("Number of elements to sample")@Default.Integer(3)
    int getSampleSize();
    void setSampleSize(int sampleSize);
  }

  public static void main(String[] args) {
    // Create the pipeline options
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Read in the input data
    PCollection < Integer > inputData = pipeline.apply("CreateInputData", Create.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10));

    // Get the sample size from the pipeline options
    int sampleSize = options.getSampleSize();

    // Sample the specified number of elements from the input data
    PCollection < Integer > sampledData = inputData
        .apply(Shuffle.viaRandomKey())
        .apply(Sample. < Integer > fixedSizeGlobally(sampleSize));

    // Write the sampled data to an output file
    sampledData.apply(TextIO.write().to("output.txt"));

    // Run the pipeline
    pipeline.run().waitUntilFinish();
  }
}

A pretty impressive result! It made a funny mistake In the sampling logic, however. The program unnecessarily passes over the data twice, once for the Shuffle.viaRandomKey() and once for the Sample.<Integer>fixedSizeGlobally(sampleSize). Now this program would work, but it’s needlessly expensive. I ask ChatGPT if it’s possible to write the above pipeline with only a single pass over the data. It responds in turn:

It is generally not possible to uniformly sample n elements from a PCollection in a single pass while maintaining a uniform random distribution of the sampled elements.

Some might say that the model is being accurate, if not statistically pedantic. Reservoir sampling approximates a uniform distribution rather than sampling with the precise probability necessary for uniformity. That seems like we’re giving the system a bit too much credit, however. When asked what the Sample.<Integer>fixedSizeGlobally() method actually does, ChatGPT seems to know.

Sample.fixedSizeGlobally is a transformation in Apache Beam that uniformly samples a fixed number of elements from an input PCollection. It takes a single parameter, which is the number of elements to sample from the PCollection.

I point out that a previous response claimed it was impossible to uniformly sample a fixed number of elements with a single pass over the data.

I apologize for the confusion. You are correct that it is generally not possible to uniformly sample n elements from a PCollection of unknown size in a single pass while maintaining a uniform random distribution of the sampled elements.

However, the Sample.fixedSizeGlobally transformation in Apache Beam can be used to uniformly sample a fixed number of elements from a PCollection in a single pass.

A little defensive, don’t you think?

If I were a young engineer I would take that first solution, tell the bot to write me a quick test, submit a Pull Request, and call it a day. If I were a senior engineer on an overworked team, I might just let this one slide. It would function correctly and any issues with scaling wouldn’t be noticed for a long time. That’s a problem for a future robot to solve.

Code generation is clearly a strength of LLMs, but can they generate the requirements as well?

The ability to design a scalable and maintainable system given a set of product requirements is usually what separates beginners from more experienced engineers. Code generation is clearly a strength of LLMs, but can they generate the requirements as well?

To gauge the system design skills of ChatGPT, I asked it to design a product that I’ve worked on: a high-interest savings account.

Design me a system that pays a certain interest rate to users based on the balance in their savings account. Some of the requirements are:

* Daily payouts

* Variable rate and principal cap

* The concept of an "interest offer", such that the rate and cap can be modified

* An admin API that can modify the "interest offer" that users see

* A user-facing API that can show users what they can expect to earn in interest.

Describe the system in its entirety, as well as any dependent systems that need to exist in order for it to function.

It responded with five paragraphs describing a high-level design. I then proceeded to ask an absurd amount of follow-up questions, in the hope that I could create a graphic detailing its solution. Eventually, we arrived at something like this (graphic is my own):

Rough system design proposed by ChatGPT for a high-interest savings product. Balances are stored as a numerical field in an account table, external jobs compute and payout interest, and APIs show users their interest-earning balance and some additional metadata.

I could spend a few paragraphs explaining all of the flaws of this design, but it’s not particularly interesting. Suffice to say that if I handed this to a dev team, a lot of incorrect payments would go out to users.

Precision is incredibly important in good system design. Asking an engineer to implement a nonsensical database schema can waste days, if not weeks. I have to imagine that a purpose-built LLM for engineering could do far better. That being said, it’s hard to imagine even a 10x improvement here actually displacing many jobs.

While the model seemed to struggle with tasks that require precision, it made me wonder whether it would be helpful in some of the more open-ended, imprecise parts of my job. Learning new skills is an important element in any software engineering role. Even the most senior of engineers will constantly encounter situations with which they are unfamiliar. Maybe rather than focusing on automation, LLMs can help me learn new technical skills?

The model excels in the meandering, open-ended inquiry that’s so helpful in education.

With ChatGPT and Github Copilot as my pairing partners, I set out to do a tutorial on Field Programmable Gate Arrays (FPGAs). The first thing that stood out to me was how irrelevant code autocomplete felt to a hardware-adjacent development process. Getting my beginner FPGA board to control the activation of an LED from a switch was about 5% code and 95% navigating specialized software. Unfortunately, Copilot could not set up my M2 Macbook to have a functioning virtual Windows environment with the proper device drivers installed.

ChatGPT was, however, enlightening in its ability to provide me with context that was missing from my tutorial. It explained hardware programming concepts, defined Verilog keywords, and even gave me reasonable debugging advice based on certain errors I experienced.

In learning, I didn’t find myself using the generated code so much as reading it. It did not fully replace Google Search, but the system’s ability to be there as a stand-in for an “expert” and directly answer my specific questions felt incredibly natural, especially when compared to the paradox of choice that search engines provide.

The model excels in the meandering, open-ended inquiry that’s so helpful in education. This is a far more compelling use-case than automation, which requires a great-deal more precision than is currently feasible.

An engineering role has a fairly-complex reward function. Engineers are judged on actions. Their reward function is usually some weighted combination of deployments, system uptime, incident-response, bug fixing, and relationships with business stakeholders. As such, the reward of a given action that an engineer takes is not measurable in small, discrete time periods. This makes the role a poor candidate for total automation.

LLMs are the next phase of interfaces which allow us to interact with the far-reaching world of internet content.

So maybe white collar jobs are relatively safe, at least for now. But why has everyone been so excited? The answer to this question lies in the algorithm’s User Experience (UX).

LLMs are the next phase of interfaces which allow us to interact with the far-reaching world of internet content. Whereas the best systems currently provide references to objects, we can now generate the objects themselves, on-demand. The product implications of this are obvious. Rather than returning an array of websites, an inquiry in a search engine can now directly answer you in the manner and grammatical style of an expert. A request for an image can be created on-demand rather than referenced externally.

Some might say that “hallucinated information,” the phenomenon wherein LLMs generate plausible-sounding bullshit, will turn off consumers. I think this will have no impact whatsoever on adoption. Millennials were raised as a generation to “not believe everything you read on the internet.” Gen Z is well-trained in the whims of the various algorithms that dictate our online lives. People are used to “plausible-sounding bullshit,” and they have come to expect it in their digital consumption.

If the issue is trust, then standard Google Search has slowly lost that trust over the years and the opportunity is ripe for disruption. As an example, I attempted to get some advice on a variety of real-life subjects.

Search systems are doomed in a competition against generative content, because ultimately they cannot control the experience of the content they link to.

While I was making a sauce for dinner, I wanted to know what to do if the sauce broke. I asked both ChatGPT and Google Search.

ChatGPT provides decent responses to simple inquiries about cooking.

ChatGPT gave me a good answer with lots of possible paths forward in the event that my sauce broke. Google Search gave me a Knowledge Graph snippet from Blue Apron’s blog. When I clicked on the link, all I could see was an ad to subscribe to their services.

I then asked a follow-up about how long I should expect to be whisking the sauce to maintain my emulsification. In this too, the generated response was far more direct and helpful.

ChatGPT handles follow-up questions well and gave me the answer I needed, along with helpful additional context.

Google Search provides a busy carousel of recipes and a selection of five related searches.

Search systems are doomed in a competition against generative content, because ultimately they cannot control the experience of the pages they link to. In the examples above, I leave Google’s modern page and get dropped into a world of flashing takeover ads and cheap branding.

I am not suggesting that OpenAI is about to leapfrog Google. If anything, Google is the company best positioned to build an amazing LLM product and launch it en masse to consumers. What I am suggesting is that search engines have gotten really bad at surfacing relevant information. Google’s total dominance of the space has forced every company under the sun to game their algorithm. The proliferation of search-optimized sponsored content has been mistaken by the search algorithms as knowledge. Trust in information found from random sources on the internet is low enough that I see no reason users wouldn’t switch to a generative model. In fact, Google’s Knowledge Graph project (aka “Featured Snippets”) has been training us to blindly trust the machine for many years now.

We’ve seen this kind of disruption before. Shifts in the algorithms that manipulate our digital consumption have always led to unforeseen consequences.

What this experience does is remove the search engine as “referrer” and simply distills the information you desire into a personalized response. Reducing time-to-value is the number one objective of all things digital. Staying within the context of a single branded experience for all attempts to retrieve information is an unquestionable UX improvement.

Despite the many voices claiming this to be an unprecedented shift, we’ve seen this kind of disruption before. Shifts in the algorithms and reward functions that manipulate our digital consumption have always led to unforeseen consequences. Google’s position as the main referral source of the internet has driven every brand on the planet to start a “lifestyle blog” filled with vapid content to feign relevance in search queries. Facebook’s quest to optimize engagement numbers played a role in facilitating genocide via the dissemination of fake news. Netflix’s pursuit of eyeballs helped create the spread of the insufferable “bingeable” television series. Twitter’s attempts to optimize for a more friendly discourse angered the richest man in the world, who ultimately lost that title by buying them with the promise of making the app “funny” again.

What can be expected is a shift in how people experience the internet. The SEO wars will make way for a new type of battle; one where companies fight to weave their products into the language model in subtle ways. This will likely have a ripple effect on the design patterns of the internet.

Algorithms and their reward functions, though devised and deployed in the digital domain, have real world consequences. It will take years before we understand the repercussions of this technology. Hopefully our growing societal understanding of algorithms will help us deal with the changes that we are about to face. To do this effectively, we need to stay focused and not get distracted by the loud voices proclaiming mass white collar automation. This technology may be new and impressive, but it still fits cleanly within the realm of what we already know about the internet. Let’s try not to make the same mistakes again.

Read More