The large language models that have boomed in power and popularity over the last year are based on a neural network architecture called the transformer. Transformers were originally designed to efficiently learn complicated long range correlations arising in natural language processing and are now being applied to other areas, including many-body quantum physics, where they being used as flexible variational quantum states.  
Interest in neural network quantum states has grown rapidly since 2017, when Carleo and Troyer showed that a neural network architecture called the Restricted Boltzmann Machine could be trained to find the ground state wavefunction of the transverse-field Ising and antiferromagnetic Heisenberg models. One limitation with this original work was the difficulty of computing expectation values of the ground state using this architecture, since the trained network takes as its input a spin configuration and returns the corresponding probability amplitude, meaning that time-consuming Monte-Carlo sampling is needed to evaluate expectation values. 
Monte-Carlo sampling can be avoided using different parameterizations of the many-body quantum state. For example, the autoregressive quantum state encodes the many-body wavefunction $\Psi(\boldsymbol{s})$ as a product of conditional probability distributions: 
$$\Psi(\boldsymbol{s}) = \prod_{i=1}^N \psi_i(s_1 | s_1,...,s_{i-1}) = \psi_1(s_1) \psi_2(s_2|s_1) \psi_3 (s_3 | s_2 s_1) ...\psi_N(s_N| s_{N-1}...s_2 s_1) $$
One can thereby draw unbiased samples from the many-body ground state by first drawing the first spin, $s_1$, according to the learned probability distribution $\psi_1(s_1)$, followed by $s_2$ according to the conditional probability distribution $\psi_2(s_2 | s_1)$, and so on until a complete spin configuration is obtained. However, there is a conservation of misery in that we avoid Monte Carlo sampling but instead need to learn an exponential number of conditional probability distributions! Luckily, it is empirically observed that a single neural network is able to encode all this information, if we allow it to take as an additional input a hidden vector $h_i$ that encodes information as to the previously-drawn spins. This gives the neural network an autoregressive structure, in that the output from one pass is sequentially fed back into its input. 
The autoregressive quantum states were inspired by the autoregressive neural networks developed for natural language processing tasks. The performance and scalability of autoregressive neural networks is limited by the need for sequential processing to generate a single sample, the potential for vanishing gradients making it difficult to train the network, and a bias of the learned probability distribution to the recently-sampled spins. More advanced formulations based on masked convolutional networks are able to generate entire configurations using a single evaluation of the network, but still encounter issues with trainability and encoding distributions exhibiting complex correlations. 
Then along came the transformer architecture. The innovation here was the inclusion of multiple parameterized transformations to the input data that can be trained to pick out different features and (long-range) correlations – the multi-head attention. Once the key features are identified by the multi-head attention, a relatively simple feed-forward neural network is sufficient to compute the output probability. The success at this architecture for language modelling is now inspiring many studies on applications to many-body physics. 
One line of research is exploring transformer neural networks as a flexible ansatz capable of describing families of many-body quantum systems, exemplified by the paper "Transformer quantum state: A multipurpose model for quantum many-body problems." In this work, the transformer neural network was trained to learn the many-body ground states of a family of Ising models. Thus, it takes as its input model parameters (e.g. the applied magnetic field strength), and then draws samples of ground state spin configurations. The model can also extrapolate to predict properties of ground states not included in the training data, albeit with lower accuracy, particularly when attempting to extrapolate across a phase transition. The network can also be "inverted" to perform a maximum likelihood estimation of the system's parameters given a few spin configurations drawn from its ground state, analogous to shadow tomography of many-body quantum states. 
A second line of research is exploring the use of transformers as a means of accurately computing ground state energies from specific strongly-correlated model Hamiltonians, such as the Shastry-Sutherland model, see for example the paper "Transformer Variational Wave Functions for Frustrated Quantum Spin Systems". In this case, an architecture called the vision transformer is trained to learn the complex correlations present in the ground state. The biggest challenge is training the network, which is particularly difficult for complex-valued networks, however recent work has shown that a two-stage architecture that applies a real-valued transformer followed by a complex fully-connected neural network can be trained more easily. 
What next for this hot topic? Better training methods or more easily-trainable transformer architectures are needed, since in training data for quantum many-body systems is much harder to come by than web-scraped training data for large language models. Future research on applications of classical transformer neural networks will likely be divided between problem-specific models tailored to solve certain hard many-body problems, and less accurate general purpose "foundation" models which may be useful for generating initial guesses for other more precise iterative methods. Beyond this, quantum and quantum-inspired generalisations of the transformer architecture are also cool directions to watch! 

