ARC SOTA July 2023

ARC: Where do we stand today?

In 2019, François Chollet published his paper “On the Measure of Intelligence” where he introduced the Abstraction and Reasoning Corpus. Almost four years later, after a competition on Kaggle and with Lab42’s second ARCathon in progress, the time is ripe to take a look at where we stand today, including an update from seven devoted ARC specialists.

Image: Midjourney

Michael Hodel
Programmer and winner of the Lab42 ARCathon 2022

đź“… July 7, 2023


The Abstraction and Reasoning Corpus is a hard benchmark almost by definition, as it is an attempt to be a measure of intelligence. In the first ARC challenge, the best-performing approach managed to reach 20.6%. This record has since been broken three times with scores of 28.5%, 30.4% and 31.4%. However, the approaches that did so were not very innovative, instead largely mere ensembles of the best publicly available code. Hence, there has arguably been little progress since the first challenge over three years ago, at least measured in terms of the best score on this hidden test set.


Today, the global AI competition ARCathon 2023 is running until December 1, 2023 with a total prize money of CHF69’000. The competition has almost 200 registered teams. Hence, there are numerous people currently tackling ARC, both in academia and outside of it. What kinds of approaches are the teams trying? What are the challenges they face and the opportunities they see? To give some overview of the status mid-2023, we asked several people about their perspectives on ARC and the approaches they are pursuing. Here is what they said:

Simon Ouellette: Meta-learning Core Knowledge

My focus is on using Deep Learning methods that are capable, at least in theory, of learning algorithms. While it is well known that these types of approaches cannot be trained directly on the inputs/outputs of highly composite tasks, I believe that with the help of intermediate supervision (such as curriculum learning) it can learn generalizable solutions.

Systematic or algorithmic generalization is a difficult challenge for the current state of machine learning. It gets even more difficult if you consider this in a multi-task context. Then it touches on compositionality: the ability to learn sharable, reusable algorithmic “building blocks” across different tasks.

We are only starting to understand what can unlock this ability in neural networks. There is potentially more to be researched and discovered before we can use this approach to solve the ARC challenge.

I think one of the most basic things to understand about the ARC challenge is that it doesn’t have a usable training set. You need to build your own.

I usually hear 2 different objections to this idea:

“You don’t need to if you use a Domain Specific Language (DSL) -based approach.”

If your approach is so DSL-centric that you don’t need learning (this excludes some use cases of DreamCoder, you will eventually run into an even more serious problem: combinatorial explosion. That is the point of learning: so that you don’t have to constantly search for a solution from scratch every time you face a new task. The more complex the task, the more exponentially expensive your search is going to be if you start from the same basic primitives every time.

“That’s the point of the ARC challenge, to find an algorithm that learns from few examples.”

The bias-variance trade-off in machine learning makes it clear that there is no such thing as an algorithm that is more sample efficient than another while preserving the same level of learning flexibility. There is no free lunch.

For example, meta-learning works by pre-training on a very large dataset of diverse tasks so that it can learn shared, reusable features. Only then can it few-shot learn new tasks in the same distribution. Similarly, humans benefit from several millions of years of evolution combined with a lifetime of learning. That is a lot of data. That’s why we’re so good at the ARC challenge.

If anyone wants to better understand my approach, they can read “The Hitchhiker’s Guide to the ARC Challenge” on the Lab42 website.

Douglas R. Miles: Rediscovering Historical Methodologies

During the 1970s and 1980s, the field of artificial intelligence (AI) witnessed the emergence of several machine learning strategies that showcased remarkable potential. Surprisingly, these methods often outperformed neural networks in terms of learning abilities [ref.]. Unfortunately, their progress was overshadowed by the rise of modern techniques such as expert systems and Good Old-Fashioned Artificial Intelligence (GOFAI) [ref.]. This was primarily due to computational limitations at the time and a prevailing preference within the scientific community for newer, more conventional methodologies [ref.].

One notable creation during this period was Tom M. Mitchell’s Version Space (VS), developed in 1977. The Version Space strategy (VSS) offered a fresh perspective on concept learning by focusing on a binary search space. It introduced two key boundaries: the General boundary and the Specific boundary. The General boundary represented the most expansive hypothesis consistent with all positive instances, while the Specific boundary indicated the most restrictive hypothesis that still encompassed all positive examples. These defined boundaries formed a ”version space” that adapted and adjusted with the introduction of new data. Implementations of this strategy have shown success in solving several ARC tasks.

In the 1980s, the First Order Combined Learner (FOCL) Algorithm emerged as a groundbreaking strategy. It built upon the purely inductive FOIL Algorithm (First-Order Inductive Learner) by incorporating domain theory to refine the search for optimal rules, significantly enhancing precision. FOCL also integrated Explanation-Based Learning (EBL) techniques with the established methodologies of FOIL. EBL is a learning strategy that enables the acquisition of fundamental problem-solving techniques through the observation and analysis of example solutions to specific problems.

A revolutionary strategy, Muggleton’s Inductive Logic Programming, was introduced in the 1990s. ILP leveraged logic programming to provide a unified representation for input examples, background knowledge, and hypotheses [ref.]. This integration birthed a resilient method for concept learning, utilizing logical rules to decipher complex information structures. Considering the complex demands of contemporary AI challenges like the ARC, combining and ILP appears to be a complementary approach, potentially creating a versatile and robust learning system capable of adapting to various challenges [ref.].

While these historical methodologies were meticulously documented, their practical implementations in contemporary programming languages were conspicuously missing. This lack of representation in the modern age suggests that these once-pioneering strategies have gradually faded into the background [ref.]. Recognizing this gap as an opportunity, efforts have been made to re-implement these techniques using the programming language developed for the 5th generation computer project, which seems suitable for reviving these long-lost methodologies.

As I delve deeper into the intricate web of these historical methodologies, it becomes evident that their untapped potential holds promise. Despite the current trend in AI research heavily relying on data-intensive methodologies like deep learning, the challenges posed by ARC underscore the necessity for strategies that can efficiently function in environments where data is scarce. Therefore, it is reasonable to believe that historical methodologies such as VSS, ILP, and EBL could be non-brute force tactics equipped to unravel the mysteries presented by ARC.

In summary, VSS, EBG, and EBL are agile learning approaches well-suited for tasks like ARC, which emphasize adaptability, efficient ambiguity handling, and performance with limited data. These methodologies, which once surpassed neural networks in learning speed, lost relevance over time and were frequently overlooked. However, just like neural networks, these machine learning techniques have the potential to reach new heights with today’s computational capabilities.

Jack Cole: Machine Learning/NLP Transformer Models

In our pursuit to unravel the ARC Challenge, we’ve focused on a tri-pronged approach that utilizes transformer models, data manipulation strategies, and test-time adaptations. Our journey commenced with an empirical exploration of the performance of different transformer models for ARC. Numerous experiments have been conducted concerning data formatting, hyperparameter tuning, optimizers, compositionality, generalization limits, and the volume of data required to learn various ARC concepts. Notably, we have also studied the OpenAI GPT series (3, 3.5, and 4) to understand some of the limitations and possibilities of transformers for ARC. A broad spectrum of other models was also studied, most of which perform very poorly before and after being fine-tuned. ARC presents a variety of challenges for transformer models, not the least of which is the context window and tendency towards shortcut strategies. Along the way, we navigated through various training strategies, including iterative techniques and ensembling. Ultimately, our training strategy is iterative as a practical matter, in that additional training data is added as it is produced.

While a significant chunk of our work revolved around transformer models, another large portion of our time was invested in synthetic dataset generation and data augmentation techniques. We explored a number of methods to expand our training set, experimenting with multiple variables to improve model performance. These data manipulation strategies allowed us to boost the model’s learning efficiency and broadened its ability to adapt to a diverse array of ARC tasks. We have recently been exploring a concept of dataset-induced meta-learning — designing datasets that maximize a model’s development of meta-learning characteristics.

At the heart of the approach has been a transformer model that has been training for approximately 9 months. The best model has achieved up to 19 correct items on the public test set with one attempt, and 24 correct with three attempts. Our model has recently met with success on the notorious hidden test set. Our best model consistently solves 5 items on this dataset, even reaching a high of 7 correct, which could be a significant breakthrough in the ML approach to the ARC Challenge. A critical part of our approach was tailoring our model for the specific evaluation dataset. Given the challenges of ARC’s hidden test set, we evolved techniques for adapting our models during test-time. This aspect of our work played a pivotal role in advancing from 1-2 items correct to 5-7.

ARC is one of the toughest challenges in the AI space, and I would encourage anyone who likes hard challenges to give it a try. Be warned, however, that feeling the rewards of your work may take a long time to arrive. You may try many blind alleys and failed approaches before you arrive at something that works better. Also, there are great communities that can be of support in the process with several people who are very willing to help. The Lab42 Discord server is a good resource. Weekly Discord meetings with Yannic Kilcher’s ARC challenge group have been an invaluable resource for brainstorming and even receiving very tangible contributions from others to this project.

Michael Hodel: Program Synthesis and Data Generation

When I started out with ARC, I first created a Domain-Specific Language (DSL) with the aim of being expressive enough to allow programs solving arbitrary ARC-like tasks, and generic, i.e. consisting of only few primitives, each useful for many ARC tasks. In some sense this can be viewed as an attempt of coding up some of the core knowledge priors. As a proof of concept, solver programs for the 400 training tasks were written — entirely within this DSL.

Subsequently, I used this DSL to construct a program synthesis approach for the 2022 ARCathon, which got 6% on the hidden test set but unfortunately ended up much closer to a brute-force breadth-first-search and much further from something intelligent than I had initially hoped. This line of work was put on hold once the competition ended, after which I investigated how much further the record could be pushed via ensembling already existing approaches, which ended up scoring 30.4%. However, since the interest in creating something innovative with a lot of potential is greater than trying to squeeze out some few percentage points from code that is not my own, I also put this on hold.

Since then, I have been mainly focused on generating synthetic ARC-like data, since I intend to test how far one can get by applying machine learning techniques, and the lack of good and sufficient data is obviously the biggest road-blocker on the way to testing this hypothesis. The first route I took in that realm was to try to think of classes or clusters of ARC-tasks, e.g. one focusing on transformations on a grid-level, object-level or cell-level. However, as it turned out to be very hard to have a great task diversity while still having a computationally feasible and automated manner to ensure task validity, I took a step back: I started to reverse engineer individual ARC training tasks, generalize them, and create many more examples for them, all while being able to control the example difficulty. My aim with this effort is to subsequently run a wide range of experiments.

There are two main characteristics that make ARC hard: The dataset contains many diverse tasks and each task comes with only very few examples. Given those, one could think of such an intermediate problem as a new dataset satisfying either or both of the following two characteristics:

Reduced scope: The diversity of ARC tasks is reduced, i.e. tasks may only use a subset of the core knowledge priors, or there is in other ways additional knowledge about attributes of the ARC tasks that constrain the space of possible solutions and hence make this new dataset easier to tackle. Examples are having only certain categories or classes of ARC tasks, fixed grid sizes, or also only 1-Dimensional ARC tasks or e.g. 2-symbol ARC tasks, etc. It seems plausible to me that working on such a dumbed-down version of ARC could give insights, especially with respect to how to go about including priors.

More samples: The number of examples per task is greater, e.g. thousands of instances instead of only a handful. Assumably, most ARC-tasks could be learned with machine learning approaches, given enough samples. It seems plausible to me that insights could arise from optimizing for sample-efficient learning methods. This need not be in a meta-learning setting right away: One could also think of training one model per task, e.g. trying to minimize the median number of examples needed to learn at least a certain (known to be reachable) share of tasks for a fixed model architecture and fixed hyperparameters.

Blane D'attelles: Tendril

Tendril is a novel machine learning algorithm being developed to address ARC. This dynamic programming approach involves three primary components. The first is the utilization of a best-first search algorithm, aimed at identifying promising objects that can aid in problem-solving. Secondly, these objects are exploited through a greedy strategy to create mappings from input to output. The resulting information is then used to steer the best-first search. The third component involves Tendril directly examining the input to recognize ancillary objects needed to identify the critical objects for problem resolution. Tendril’s ultimate goal is to establish a machine learning algorithm that ensures convergence even across very small sample spaces, extrapolation of data, applying feedback loops to guide convergence, and abstraction across the small sample set.

The primary issues include controlling the combinatorial explosion and debugging the system, which operates on a dynamic programming algorithm. Unlike traditional dynamic programming algorithms, which typically have a small, finite set of possibilities at each step, Tendril could potentially have infinite possibilities. This adds another layer of complexity, as we strive to sort and identify the most promising strategies for each step, without exploring the infinite array of potential options. This issue is solved by creating a complex tracking system which measures different routes by adjusting a frontier size and solution depth. But the tracking system is a difficult design.

John Tan Chong Min: LLMs as a System to Solve the ARC Challenge

Using the flexibility of Large Language Models (LLM) to be prompted to do various novel tasks using zero-shot, few-shot, context-grounded prompting, we implement LLM (specifically GPT4) as a system with memory and recursive environment feedback loop, as an attempt to solve the ARC challenge. GPT4 has been shown to be able to perform an association between the input space to the action space relatively well (e.g. Voyager/Ghost in the Minecraft), and so we can frame the ARC challenge as a problem to find the actions given the input/output associations and utilize the generalizability of GPT4 to solve it!

Useful Abstraction Spaces: One thing which GPT4 does not do well is to identify objects accurately from text alone. Objects are defined as continuous sections of the grid with the same non-zero value. Providing such an object view as an abstraction space greatly helps with GPT4 to form associations with the input-output pair and is better able to find a solution. Providing various abstraction spaces such as object view and section view (multiple sections of the grid to overlay) to GPT4 can greatly reduce the complexity of the problem.

Encoding Human Biases via Primitive Functions: An initial implementation of using GPT4 to solve ARC was done with just prompting the human biases and action spaces via text. This did not do so well due to lack of grounding using words alone. A key innovation was the usage of primitive functions as action spaces, as a way to encode human priors. If we could use functions for grounding and express the semantic meaning of the function in words, GPT4 could use the function to provide the code needed for the solution. Hence, the problem now becomes finding out what are the primitive functions we need to encode in order for GPT4 to solve any generic ARC problem.

Recursive Feedback Loop from Environment: Another key idea is that a learning system would need to utilize feedback from the environment, and so a recursive loop feeding in feedback from the environment (whether the code matches the intended output) can help a lot in getting the right answer.

Using Memory for Additional Context: New problems might mix and match aspects of previous solutions, so having a memory bank to provide some examples of the code for similar solved problems in the past can help to ground GPT4 to better generate the answer.

LLMs as a System: Humans do not operate with only one system. We have various systems to call for various tasks. Similarly, we can have multiple expert agents for each task, like object-level, pixel-level, section-level and call on them to come up with their view of the task and select the most promising agent (GPT Problem Type Classification). This greatly helps narrow the search space for the solution. Then, we utilize the specialized functions this agent has and solve the problem (GPT Follow-up Prompt). Interfacing this agent with environment feedback, the problem-type specific abstraction space, past examples and action spaces can greatly help filter and ground GPT4 to generate a plausible solution. I believe that with better grounding via expert agents, and with better abstraction space representations and better primitive function grounding, we will eventually be able to solve most of the ARC tasks using this system.

Watch my YouTube video for more information.

Simon Strandgaard: Experiments and Tools

I have bundled multiple ARC datasets into a single repo, so it’s easier to process.

The most important piece of my project is arc-console, which is similar to stdout with the capability of logging HTML blobs, so ARC tasks can be inspected in great detail. I can compare the expected output with the predicted output and troubleshoot what is going on. To begin with, I printed ARC tasks to the terminal and it was difficult to see what was going on, due to terminal colors and ascii art. This frustration sparked the idea for arc-console. Being able to see things live is an invaluable tool for me.

I have 230 solutions written in the LODA assembly language. The solutions I have manually coded make some sense. The solutions that have been mined by mutating the existing solutions, make little sense, but somehow solve some tasks. The tasks that I have solved so far have been the low hanging fruit. If I run the miner for 24 hours, it may find a few solutions. If I provide more metadata for the initial state of the program, then it sometimes yields many new solutions within 24 hours. If all the inputs agree on a single background color, then it makes sense to provide to the assembly program. I often scroll through the arc-console, asking myself “what metadata can be useful?”.

If all the inputs agree on the same colors, that’s the same as doing intersection between the color histograms. If all the inputs agree there is a blue square, that’s also an intersection between the image meta labels. I do intersections between sets and nested sets. What remains after the intersections is something that may be relevant for solving the task. The stuff that gets discarded by doing intersection is less general.

My repos are loda-rust, arc-console, loda-arc-challenge, arcathon-docker-image, arc-notes.


There are arguably two main directions one can take to push the state of the art on ARC: First, one can try to better explicitly infuse prior human knowledge or leverage existing systems (e.g. LLMs). Alternatively, one can try to find or create additional ARC-like data to then learn from. The first camp does not depend on additional data, and may be trying to hardcode as good as possible relevant human knowledge into a rule-based system, e.g. via a DSL, or heuristics for a program synthesis approach, or assuming that there are already systems out there (e.g. LLMs) which are capable of achieving a high score on ARC, but are limited by our ways of utilizing them. The second, in its crudest form, would simply be training a model on a lot of (good and diverse) ARC-like data. Of course, those two views (i.e. infusing priors and learning from data) are not mutually exclusive, and finding better ways of combining both may be a promising route to take. Improvements on either front could be crucial. What speaks for the former camp, what for the latter? How could they be combined?

One could argue against the infusing priors approach by e.g. saying that it is undesirable in that it may result in a largely static system and that it is infeasible in the first place since we are very much lacking understanding of the human brain and by that limited in our abilities to explicitly translate human knowledge into computer code. One could oppose the learning from data approach by claiming that the machine learning approaches existing today are fundamentally lacking, and even given a great amount of high-quality data and compute, no great progress is to be expected without major innovations.

Nevertheless, it seems that there is indeed a need for, or at least utility from, additional data to learn from (since, while there is a lot of data covering a lot of the concepts relevant for ARC, almost no data in the ARC-format exists). This would also allow to falsify the hypothesis that current (meta-)learning methodologies are insufficient (by scoring well on the ARC test set with solely current learning methodologies). Generating data is indeed something that people have been working on. However, it is not trivial at all: How can you semi-automatically create many ARC-like tasks that broadly cover the core knowledge priors and exhibit great diversity, but which are also ? Is there an upside from having annotated data, i.e. not only generated tasks, but also programs (in some DSL) that solve those tasks, such that more supervised program synthesis approaches as opposed to mere fully end-to-end approaches are also enabled?

To also foster the other side: How could one systematically assess the quality of coded-up prior knowledge? Naively, one may think that one way could be to define quality metrics for a Domain-Specific Language, e.g. how slim and generic it is, and how short the programs are that solve particular ARC tasks which it enables (since the size of the DSL and the size of the solution programs obviously greatly influence the space of possibilities over which one has to do program search/synthesis). However, such a notion, while potentially useful, would of course be very limited, since there is a lot of knowledge that may not be captured by an ordinary DSL, e.g. meta-knowledge about how to use DSL primitives (i.e. a mapping from properties of an ARC tasks to which subset of a DSL would be relevant, or how DSL primitives interplay in the first place, etc.).

Those are the two main ways of trying to make progress on ARC. To what degree has the potential of fully hardcoded systems already been exhausted? How far can we get with pure end-to-end learning approaches? Can we bridge the gap between the two? Maybe creating additional ARC-like data and augmenting it with information like programs in a DSL, natural language descriptions, etc. is crucial for hybrid approaches (like supervised program synthesis). Or have we perhaps not looked hard enough in the right places to this day?

For Humankind

Villa Fontana
Obere Strasse 22B
7270 Davos, Switzerland


Powered by 

For Humankind

Villa Fontana
Obere Strasse 22B
7270 Davos, Switzerland


Powered by

For Humankind

Villa Fontana
Obere Strasse 22B
7270 Davos, Switzerland


Powered by