The Lucie-7B LLM and the Lucie training dataset : lessons learnt from training a true open source AI model
Jean-Pierre Lorré, LINAGORA
The OpenLLM-France community has created the Lucie Training DataSet and Lucie-7B models to address anglo-centric biases in LLM datasets. The Lucie Training DataSet contains texts in French, English, Spanish, German, and Italian with French contributing the largest share.
The Lucie resources are open, making it one of the first LLM compliant with the Open Source Initiative (OSI) AI definition. Datasets are available on Hugging Face. Model weights for Lucie-7B and Instruction are published under Apache 2 license. Training and data preparation code is freely available on GitHub under AGPL v3.
AI, Data, Cloud-Edge and Security
Main stage