Life, the Universe and LLMs
Deep thoughts on AI advancements over the past 12 months July 2024Photo: Getty Images/piranka
In Douglas Adams’ Hitchhiker’s Guide to the Galaxy, two hyper-intelligent pan-dimensional beings named Lunkwill and Fook are tasked with turning on Deep Thought, the massive supercomputer constructed by other hyper-intelligent beings, and asking it a question. Keep in mind, Deep Thought was so powerful that it could contemplate the very vectors of the atoms in the Big Bang itself. Lunkwill and Fook ask Deep Thought to give them the answer to “life, the universe and everything.” After confirming that there is an answer, Deep Thought tells them it will take 7,500,000 years to reveal it. Wow. Thanks for nothing, Deep Thought.
I can’t help but recognize the parallels we see in our own lives as actuaries managing the world of artificial intelligence (AI), big data, generative AI (GenAI) and large language models (LLMs). Actuaries are, of course, tasked with answering big questions (I’ll admit not quite so big as in Adams’ novel), and in my experience, we increasingly are relying on AI models to give us accurate and coherent answers.
In the news, we see contrasting opinions and statistics regarding the use of AI in our daily lives. On one hand, we have seen increased performance and a reduction in the cost of operating LLMs.1 But we also have seen challenges arise in the world of AI, ranging from comical, akin to Deep Thought’s famously underwhelming answer, all the way to downright malicious applications, such as using AI to simulate a loved one’s voice as part of a scam.2
The Last 12 Months
Recognizing that the world of GenAI and LLMs does not encompass all of AI, this article will focus on GenAI. Regardless of its merit, investments year-over-year in GenAI have increased by 800%, while investments in AI have generally decreased.3
Just over a year ago, The Actuary published “ActuaryGPT.” In that article, we discussed several (new at the time) concepts, such as LLMs and OpenAI’s GPT-4, while exploring the strengths and weaknesses of those models. Here’s a nonexhaustive update on some of the major developments over the last 12 months.
- OpenAI has released GPT-4 Turbo and GPT-4o (“o” for Omni), both closed-source models. Turbo boasts several enhanced features when compared to GPT-4. Among other features, it has improved coding, math and reasoning capabilities; it costs less to run; and it has increased context length from 8,000 to 128,000 tokens (words, or parts of words, that LLMs read and generate). GPT-4o takes things one step further by acting as a multimodel LLM, where it can process and generate text, images and audio. This allows a nearly seamless level of interaction. For example, you can have a verbal conversation directly with an LLM.
- Many for-profit companies are releasing open-source LLMs. Open-source options were available a year ago, but there are many more now. Enterprises that want to implement LLMs have realized that for reasons of data security, computational efficiency and company-specific context, implementing their own instance of an open-source LLM is more beneficial than paying for a closed-source comparable. With that said, closed-source LLMs are outperforming their open-source counterparts.4 Examples of open-source models that were released recently include Apple’s OpenELM and Snowflake’s Arctic.
- As a result of more open-source models becoming available, companies are fine-tuning these LLMs to implement their own context-specific LLMs. It’s important to note that this is different than building your own LLM from scratch, which can come with a hefty price tag. An interesting example of context-specific LLMs is Shopify’s Sidekick,5 which is an AI-powered conversational assistant powered by a fine-tuned instance of Llama 2 (an open-source LLM from Meta).
- Up to this point, a large emphasis has been placed on measuring the capability of models through benchmarks such as MMLU, HumanEval and GSM8K. Higher scrutiny is now being placed on measuring concepts such as bias, toxic answers and general truthfulness. Similarly, the world is seeing a significant increase in AI regulation. Out of 128 countries that were monitored in a recent survey, there were 148 AI-related bills passed since 2016.6 Benchmarking model capability is no longer the only metric by which investors and the public are assessing performance.
- Streamlined versions of LLMs have been introduced, appropriately named “small language models” or SLMs. SLMs have fewer parameters, don’t require significant time or capital to train, and are overall easier for companies to implement. The generalized capabilities of an SLM might be lower than an LLM. However, companies can fine-tune an SLM for a specific context and be rewarded with a powerful model that is easier to implement. Examples of popular SLMs include Microsoft’s Phi-3 and Mistral’s 7B.
- Researchers have put forward a “model collapse” theory,7 whereby GenAI model performance deteriorates due to a feedback loop. Imagine a scenario where the majority of information online is model generated. The theory is that since LLMs and other GenAI models are trained via large-scale data scraping from the internet, model performance will degrade and lose statistical information over time. There have been no documented cases of this occurring as of yet, and many in the AI community are already working on solutions to this theoretical issue.8,9
- StackOverflow and OpenAI announced a partnership that is notable for several reasons. First, many LLMs (including those produced by OpenAI) can produce code on their own. In contrast, StackOverflow historically has been the go-to website for assisting software developers with their issues in code development, fueled by a large community of experienced users. Second, GenAI answers are still banned on StackOverflow, and many of its users are having strong negative reactions to the news.10
For More Information
- Read the Society of Actuaries report, “A Primer on Generative AI for Actuaries.”
- Read “ActuaryGPT,” another The Actuary article from Harrison Jones.
The Next 12 Months: We Demand Rigidly Defined Areas of Doubt and Uncertainty
I’ve witnessed that insurance professionals are now immersed in a world of AI. Actuaries could be working with data scientists and building an AI model to help predict future claims, or a contact center agent could be leveraging an LLM to assist a policyholder with an issue. We are now subject to an evolving ecosystem with, seemingly, a large degree of uncertainty about the future. How powerful will LLMs get? What will the next wave of GenAI look like? What is this interesting company I keep hearing about named Skynet?11
Picking up where I left off in Adams’ novel, prior to the supercomputer Deep Thought providing an answer to “life, the universe and everything,” philosophers Majikthise and Vroomfondel barge into the room with Lunkwill and Fook. They demand to be heard as representatives of the Amalgamated Union of Philosophers, Sages, Luminaries and Other Thinking Persons. Their main worry is that Deep Thought will put them out of their jobs if it can provide the answer to life, the universe and everything. In the frenzy of stating their concerns, Vroomfondel proclaims loudly, “We demand rigidly defined areas of doubt and uncertainty!” A statement that Adams presumably meant to portray as contradictory and ridiculous. However, in an actuary’s world, is it not reasonable to place boundaries, conditions and contingencies when facing an uncertain future?
We don’t know what will happen in the next 12 months. As actuaries, understanding the tools that are available to effectively perform our work is essential, all while keeping a close eye on the risks that come with using powerful tools like AI. Within your organization, you also have the choice to face the uncertain future as Lunkwill and Fook did, seeking the answer. Or, as Majikthise and Vroomfondel did, questioning the need for Deep Thought entirely.
Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries or the respective authors’ employers.
References:
- 1. Gunton, Matthew. FrugalGPT and Reducing LLM Operating Costs. Medium, March 28, 2024. ↩
- 2. Bethea, Charles. The Terrifying A.I. Scam That Uses Your Loved One’s Voice. The New Yorker, March 7, 2024. ↩
- 3. AI Index Steering Committee. Artificial Intelligence Index Report 2024. Chapter 4: Economy. ↩
- 4. AI Index Steering Committee. Artificial Intelligence Index Report 2024. Chapter 2: Technical Performance. ↩
- 5. Wiggers, Kyle. Shopify Sidekick Is Like ChatGPT, but for E-Commerce Merchants TechCrunch, July 26, 2023. ↩
- 6. AI Index Steering Committee. Artificial Intelligence Index Report 2024. Chapter 7: Policy and Governance. ↩
- 7. Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The Curse of Recursion: Training on Generated Data Makes Models Forget. April 14, 2024. ↩
- 8. Gerstgrasser, Matthias, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv.org, April 1, 2024. ↩
- 9. Kirchenbauer, John, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A Watermark for Large Language Models. PMLR, July 3, 2023. ↩
- 10. Meta Stack Exchange. Our Partnership With OpenAI. May 14, 2024. ↩
- 11. Skynet is the fictional AI system from The Terminator series that serves at the primary antagonist. It’s used here as a playful reference to the common practice of Hollywood conceptualizing AI systems gone rogue. ↩
Copyright © 2024 by the Society of Actuaries, Chicago, Illinois.