![](https://crypto4nerd.com/wp-content/uploads/2024/04/1kXokeuvYfcwWkIsSKcrCoQ.png)
Large Language Models (LLMs) have revolutionized various applications, including machine translation, text summarization, dialogue systems, and code generation. Yet, the hefty computational requirements for pretraining these models pose significant barriers to broader accessibility and development.
To address these challenges, recent open-source initiatives like BLOOM, StarCoder, and StarCoder-2 have emerged, aiming to democratize access to pretrained LLMs. However, these models encounter limitations such as restricted multilingual capabilities, computational intensity, and the risk of catastrophic forgetting during continual pretraining.
In a new paper Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order, a collaborative effort involving researchers from 33 institutions presents AURORA-M, the inaugural open-source multilingual language model fine-tuned in accordance with the U.S. Executive Order. AURORA-M, a 15-billion-parameter model, is trained on English, Finnish, Hindi, Japanese, Vietnamese, and code, specifically tailored to mitigate the aforementioned limitations.
The team summarizes their main contributions as follows:
- Introduction of AURORA-M: Derived from the StarCoderPlus model, AURORA-M encompasses a robust 15-billion-parameter multilingual LLM architecture.
- Two-Stage Curriculum: AURORA-M implements a two-stage continual pretraining curriculum, comprising Continual Auxiliary Pretraining (CAP) and Continual Alignment Tuning (CAT). This approach aims to maximize adaptation, minimize catastrophic forgetting, and align with safety objectives.
- Extensive Evaluation: AURORA-M undergoes comprehensive evaluation across diverse tasks, domains, and languages, demonstrating superior multilingual performance while maintaining competitiveness in English and coding tasks.
- Development of Red-Teaming Dataset: The creation of “The Biden-Harris Redteam Dataset” addresses concerns outlined in the Executive Order, along with standard safety considerations. AURORA-M is fine-tuned on this dataset and evaluated against various safety benchmarks.
- Scalability Analysis: The impact of scaling total training tokens on multilingual and code evaluation tasks is thoroughly examined.
AURORA-M is meticulously designed to accommodate six linguistically diverse languages and code. Its continual pretraining on a vast dataset comprising 435 billion tokens equips it with an in-depth understanding of language nuances and coding structures.
Emphasizing safety as a core principle, AURORA-M becomes the first open-source multilingual LLM fine-tuned on a comprehensive collection of human-reviewed safety instructions, aligning with the Biden-Harris Executive Order on AI’s safe, secure, and trustworthy development and utilization.
Rigorous evaluations confirm AURORA-M’s efficacy in avoiding catastrophic forgetting in English and coding tasks while showcasing competitive multilingual performance. Overall, AURORA-M not only excels in multilingual understanding and coding tasks but also underscores the collaborative ethos of the open-source community, promoting transparency and accessibility in AI development.
The paper Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order is on arXiv.