Progress in training large language models (LLMs) in the last few years for natural language processing (NLP), reasoning and language generation has been paradigm shifting. Discussions have arisen on whether and how the technology can be introduced into almost every sector. However, until now the use of large language models in practice has stayed in a virtual sphere. The use of LLMs for manipulation of the physical world is fairly uncharted and untested.
Automation through robotics has and continues to change some large industries, particularly logistics and production industries. Think here of assembly robots on production lines, or the use of robots in Amazon and Ocado warehouses. One of the central challenges in robotics is how to enable robots to understand high-level instructions to allow for non-expert robot operation and effective and cooperative human-robot workspaces. A second challenge is generalising robots to perform a wide variety of tasks on command and learning new tasks without retraining or explicit programming. LLMs have the potential to overcome some of these challenges.
The most famous example of an LLM is ChatGPT, a chatbot that uses NLP and generative AI to respond to users’ input. Other commonly used examples are translation websites and customer chatbots, although these are smaller LLMs and use significantly less training data, which is targeted for more specific tasks.
LLMs are neural networks which are trained on enormous amounts of language data, leading, in some cases, to hundreds of billions of parameters. They process this amount of data using a transformer-based architecture, which, unlike previously developed artificial neural network architectures, can concurrently take and process input data, instead of sequentially. Transformers also make far better use of graphic processing units (GPUs), which are designed to process data concurrently. This makes deep learning using transformers significantly faster than previous architectures, allowing for larger amounts of data and larger training networks.
The concurrent processing of training and input data also makes large language models more accurate than previous deep learning language models. Sentences and other strings of language are not processed sequentially word by word but as a whole, attributing attention values to words that can create close relationships between words despite being far away in language sequence. The order in which the words are input no longer creates weak connections between words that are perhaps distant in position but close in meaning.
These advantages make LLMs the first reliable solution to language processing. The massive amounts of data they can be trained on, and the abundant availability of language data make these models extremely generalisable, making them more robust to a range of inputs, phrasings and purposes.
The most obvious use for LLMs in robotics is for human-robot interaction in service robots. LLMs allow agents to process and react in accurate and convincing ways to human interaction. A less obvious but more challenging application of LLMs in robotics is in logistics and industry. In the last 5 years research has focussed on the possibilities in this space.
Since the release of ChatGPT, LLMs have seen a plethora of applications, including robotics. Two key contributions by LLMs are being harnessed for robotics – language, audio and visual processing, and improved reasoning, decision-making and planning, in other words, the input is better understood and the output more accurate. The core challenge is how to translate NLP and reasoning to output that can be translated to the physical world. Robots are only as good as their ability to navigate and interact with their physical environment.
Some of the new possibilities have been tested across several robotics’ fields. Those described below focus on outcomes that can be applied to industry and logistics: improved generalisation and NLP of task prompts.
Vemprala et.al. implement OpenAI’s ChatGPT in robotics, leveraging its natural language processing, reasoning and code synthesis capabilities to develop a system that can take high-level natural language instructions and output corresponding robot executable code. The executable code synthesis is dependent on a high-level labelled API function library. This allows ChatGPT to reason and build connection between APIs, for example OpenCV (computer vision) and ROS (robot operating system). Using OpenCV is a classic robotics grounding tool for environment navigation and object detection and recognition. Although highly general, using an API library of preset atomic functions limits the robot application to tasks that be completed with the functions included in the API library. A benefit of using ChatGPT for NLP is that it allows for multiple prompts and knowledge retention for more accurate results.
Liang et. al. use code-writing LLMs to synthesise robot executable code from natural language and example policy code prompts. Although this offers the potential for greater generalisability because it doesn’t depend on a high-level API function library, the method is not as widely applicable in real world applications because it requires operators to provide prompts with expert knowledge of robot policy code.
A fundamental but challenging part of robotics is grounding reasoning and automation in the physical environment of the robotic agent. Some new approaches have been tested using multi-modal LLMs that perform NLP in the context of processed visual input.
Shah et. al. and Huang et. al. offer different solutions that each comprise a deep learning model that processes visual and natural language input using a multi-modal LLM. Shah et. al.’s solution extracts landmarks from the natural language prompt and locates them in visual data to facilitate robot navigation while Huang et.al.’s solution is more generalisable by composing 3D value maps of the environment. Although the application of the former is very limited to outdoor robot navigation, and the latter to smaller, more immediate environments, both offer insight into the possibilities for multi-modal LLM in grounding and navigation for mobile autonomous agents and manipulators.
All of the above examples use natural language processing to understand operator prompts; however, Zhang et. al. developed a solution for inter-robot communication using LLMs. GPT-4 offers a system for communication about tasks between robots in natural language. This could be instrumental to industry and logistics automation because it allows different AGVs, running in decentralised systems, using potentially different APIs and midware services to communicate and coordinate.
These solutions, although in their infancy, have massive potential for further logistics and industry automation. LLM in robotics offer reliable multi-prompt human-robot interaction with options for mid-execution corrections, and zero-shot learning of new tasks. The possibility of general-purpose robots is particularly exciting for automation with limited space and/or resources for several autonomous agents. Inter-robot communication could allow for different robotic systems to interact at low computational cost, and without having to develop systems to bridge software and system differences between autonomous agents. This would also remove the need for operator supervision between automated tasks.
Despite this progress there are still quite a few challenges to deep learning and its application in robotics. Grounding language understanding and reasoning with the physical environment, particularly on a generalisable scale rather than within already confined parameters continues to be a challenge. This would greatly improve accuracy of the application of NLP for robotics and automation. Secondly, there is still a long way to go for testing and oversight protocols before solutions like those discussed here can be implemented on a larger scale; however, the generalisation and communication options created by LLMs open up discussion for many new possibilities and methods in this area. Lastly, as is so often the case in machine learning problems, it is very difficult to come by or create large and diverse datasets for a specific purpose, in this case for language-robot interactions for learning and performing new tasks in industrial environments.
Kara is a Trainee Patent Attorney in the Engineering practice group. Kara specialises in Software, Machine learning, Mobile robotics and Path planning algorithms.
Email: kara.quast@mewburn.com
Our IP specialists work at all stage of the IP life cycle and provide strategic advice about patent, trade mark and registered designs, as well as any IP-related disputes and legal and commercial requirements.
Our peopleWe have an easily-accessible office in central London, as well as a number of regional offices throughout the UK and an office in Munich, Germany. We’d love to hear from you, so please get in touch.
Get in touch