The Search for Compositional Generalization
Methodologically, I believe in figuring out the science before building the solution. In other words, I believe in deeply understanding what I’m building: if something works, I want to know why, and if something doesn’t, I also want to know the reason. After all, ARC is not an end in and of itself. It is, instead, a challenge meant to drive our research into the essence of intelligence. This means that I’m currently focused on discovering how an AI system can learn to reason about things like the ARC challenge. I’m interested in better understanding the limitations of program synthesis and deep learning techniques (like the transformer), in order to then figure out the appropriate architecture.
📅 October 19, 2023
In spite of its success in many areas, deep learning has its issues. Imagine that you are asked to build a system that can classify a number into prime and non-prime (a.k.a composite). Although I have never tried it, it seems fairly self-evident that even the most powerful Transformer in the world could never learn to solve this problem in a way that generalizes beyond the training set. Instead, it would simply “memorize” the training set. Or, at least, that is certainly true if the training data only consists of numbers and their corresponding classification label. There are at least 2 reasons for this, which I will elaborate on later. This example matters because, much like solving an ARC problem, learning how to classify a number into prime and composite implies learning an algorithm.
At the opposite end of the bias-variance spectrum, you might have a DSL-based approach. Or even better yet, a simple hard-coded program that checks if the number is prime. Such a simple program would take as input the number n to classify, and go through a loop over all integers from 2 to the square root of n. If it finds a number that divides n evenly, it returns the label “composite”. Otherwise, it returns the label “prime”. For the sake of this discussion, though, let’s assume that this simple hard-coded solution is not allowed. Let’s assume that we’re interested in a solution that shows at least some degree of “learning”.
I think we can easily imagine that with the proper function primitives – namely things such as an implementation of a loop, a modulo operation, and an equality comparison operator – a DSL-based search might eventually stumble upon the aforementioned solution. This is what I find convincing about a DSL-based approach. It composes atomic building blocks of knowledge into a new solution that solves a (potentially) previously unseen problem. I find it impossible to imagine a solution to the ARC challenge that doesn’t perform some type of compositional generalization, whether implicitly or explicitly. Although, whether this necessarily involves an explicit search algorithm is currently unclear to me.
What I find less convincing about DSL-based approaches is their “hard-codedness”. This manifests itself in at least two criticisms:
- Without learning, a purely search-driven solution will run into an unmanageable combinatorial explosion for sufficiently complex problems (and, in my experience, that threshold of sufficient complexity is astonishingly low, especially for a relatively low-level DSL)
- There is a reason the field of AI has evolved from classical symbolic methods (or so-called “expert systems”) to machine learning: at some point it becomes unmanageable to hard-code everything. We need a system that can learn the complexity on its own.
Due to the aforementioned weaknesses found at both ends of the spectrum, it seems intuitive to me that the ideal solution is somewhere in the middle. Unlike DSL-based methods, I want to be able to learn the primitives, or “core knowledge”, from data. I might also want to guide the search using learning. But unlike most deep learning approaches, I want the compositionality that allows for systematic generalization. Although saying that deep learning struggles with compositionality is probably controversial, it has been my experience so far.
Back to the prime number example: there are at least two reasons why I believe a Transformer could not learn to classify numbers into prime or composite. First, such a solution requires the ability to loop until a certain condition is met (namely: the square root of n). However, it is well known that standard Transformers are architecturally incapable of such a thing [1, 2] . Universal Transformers, however, should be capable of looping, at least in principle .
Second, whenever the data generating program has a sufficient degree of complexity (in terms of intermediate reasoning steps), there is ample empirical evidence that intermediate supervision is required [3, 4, 5, 6]. The inputs and outputs of the program itself are not sufficient, much like in the prime number task. This is why merely providing the numbers and their prime/composite label isn’t enough.
With these two insights in mind, one can start to shape a high-level solution – or, at least, a new hypothesis. Imagine if we trained a few independent neural networks on the intermediate concepts first. For example, one dataset would be about learning the modulo operation. Another dataset would be about checking whether a number is equal to zero. Another dataset would be about returning the square root of a number. These pretrained modules could very much act like the function primitives in a DSL-based solution.
Now, imagine that you could compose these pre-trained modules in a framework that supports conditional loops, perhaps based on the Universal Transformer idea. In principle, you would have a solution. The key elements are:
- A curriculum in which the data is of gradually increasing complexity to represent the intermediate concepts.
- An inductive bias that supports conditional loops.
- The ability to learn concepts as independent modules, and a search over these modules to compose them into a (new) solution.
Regarding the third point, some might argue that Transformers have the ability to implicitly modularize and compose knowledge, but I am decreasingly confident about that. This is why, in the “imagined solution” above, I propose independent neural network modules to learn the distinct concepts. To be clear, I’m not saying that one should manually train each module on each separate dataset. I’m only saying that I would like to find an algorithm that can learn “as if” this painstaking modularization had been done.
My doubts regarding Transformers’ ability to perform this kind of compositional generalization result from a combination of personal experimentation and reading various papers that cast doubt on this [7, 8, 9, 10, 11].
Conceptually, whether or not Transformers are able to store the learned concepts in somewhat isolated or independent “parts” of itself, it is the mechanism for searching for new combinations thereof which seems to be lacking. There is no explicit search in the Transformer, and I fail to imagine how a learned mechanism can ever search for new compositions that have never been seen on the training set.
For example, suppose that a Transformer manages to learn the modulo and the equality operator as two fairly distinct and independent sets of weights, from training on the intermediate task data. When training on the prime number task itself, by what mechanism could it leverage these “knowledge modules”, instead of simply re-training the weights and “overwriting” this prior knowledge (so-called catastrophic interference)? This could just be a failure of my imagination, but I think a fairly sophisticated meta-learning mechanism is required here.
This problem potentially touches on many aspects: continual learning, in-context learning, modular neural networks. Regarding the latter, some interesting research has shown the promise of modularity. According to this research [12, 13], the “distributed representation prior” that exists in most standard deep learning architectures is not suitable for compositionality. Instead, they propose algorithms in which various independent sets of weights are meta-trained via a “gating model” (also sometimes called a communication module) that switches on and off the relevant modules and manages the messaging between these modules based on the current task.
The colorized Pentomino problem is one that has stumped standard deep learning architectures [4, 13]. It consists of displaying three Pentomino shapes of randomized color, scale, rotation and position. The algorithm must learn to output 1 when the three shapes are the same (regardless of color, scale, rotation and position), and 0 otherwise. The reason this is extremely difficult to learn for most deep learning approaches is that the task is composed of two sub-tasks that contain a lot of invariances. Namely,
- The neural network must learn to “identify” a Pentomino shape regardless of its color, position, rotation or scale.
- The neural network must learn to classify whether all three shapes are the same or not, regardless of permutation.
A modular convolutional architecture known as ResMixNet  is able to solve this problem with 99.12% accuracy, while the best standard convolutional architectures cannot exceed 70%. Equally important for independent researchers that don’t have access to virtually unlimited resources: these modular architectures seem to be much more efficient than the “distributed representation” baselines, both in terms of model size (number of parameters needed) and convergence speed.
Although these results have gone largely unnoticed in the machine learning community, they sound to me like they might be the desired bridge between DSL-based solutions and deep learning. That is why they currently are the main focus of my experimentations.
 Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2018). Universal transformers. arXiv preprint arXiv:1807.03819.
 Hahn, M. (2020). Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8, 156-171.
 Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. “Failures of gradient-based deep learning”. In: International Conference on Machine Learning. PMLR. 2017, pp. 3067–3075.
 Caglar Gulcehre and Yoshua Bengio. “Knowledge matters: Importance of prior information for optimization”. In: The Journal of Machine Learning Research 17.1 (2016), pp. 226–257.
 Elisabetta Cornacchia and Elchanan Mossel. “A Mathematical Model for Curriculum Learning”. In: arXiv preprint arXiv:2301.13833 (2023).
 Noam Wies, Yoav Levine, and Amnon Shashua. “Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks”. In: Proceedings of 11th International Conference on Learning Representations (ICLR). 2023. Online Version.
 Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., … & Ortega, P. A. (2022). Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098.
 Moskvichev, A., Odouard, V. V., & Mitchell, M. (2023). The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. arXiv preprint arXiv:2305.07141.
 Dziri, N., Lu, X., Sclar, M., Li, X. L., Jian, L., Lin, B. Y., … & Choi, Y. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv preprint arXiv:2305.18654.
 Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O. (2023). The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”. arXiv preprint arXiv:2309.12288.
 Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., … & Kim, Y. (2023). Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
 Madan, K., Ke, N. R., Goyal, A., Schölkopf, B., & Bengio, Y. (2021). Fast and slow learning of recurrent independent mechanisms. arXiv preprint arXiv:2105.08710.
 Jo, J., Verma, V., & Bengio, Y. (2018). Modularity matters: learning invariant relational reasoning tasks. arXiv preprint arXiv:1806.06765.