2025-06-17 –, Main stage
The OpenLLM-France community has created the Lucie Training DataSet and Lucie-7B models to address anglo-centric biases in LLM datasets. The Lucie Training DataSet contains texts in French, English, Spanish, German, and Italian with French contributing the largest share.
The Lucie resources are open, making it one of the first LLM compliant with the Open Source Initiative (OSI) AI definition. Datasets are available on Hugging Face. Model weights for Lucie-7B and Instruction are published under Apache 2 license. Training and data preparation code is freely available on GitHub under AGPL v3.
We will present both the Lucie Training Dataset and the Lucie-7B models, resources created by the
OpenLLM-France community. The Lucie Training Dataset is a multilingual collection of textual
corpora designed to offset anglo-centric biases in many LLM training datasets. It contains documents in French, English, Spanish, German and Italian, with French contributing the largest share. It also includes some code.
The Lucie-7B models include the foundation model, Lucie-7B, and two instruction fine-tuned versions of this model, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-humandata.
Lucie-7B is trained on equal amounts of data in French and English in an effort to better
represent cultural aspects of French-speaking communities. The instruction models are intended asdemonstrations of the foundation model in use and figure in a larger effort to release aligned models in the near future. The Lucie resources are all open which makes it one of the first LLM compliant with the OSI definition. The datasets figuring in the Lucie Training Dataset
are described and almost all datasets are redistributed in the form used for training on Hugging Face.
Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for
the former, are likewise published on Hugging Face under Apache 2 license, while model training and data preparation code is freely available on GitHub under AGPL v3.
to be completed.