Introducing Japanese Stable LM Beta

Stability AI Japan has released a "Japanese Stable LM Beta (JSLM Beta)" series of LLMs, including one high-performing open Japanese large language model. Based on Llama 2, the models have been further trained to enhance Japanese language capabilities and knowledge to specifically tailor their use for Japan. The most notable of these models, JSLM Beta 70B, is the series' largest instruction-tuned language model with a size of 70 billion parameters and is available for commercial use. As of November 2023, JSLM Beta 70B is the largest open Japanese-specific language model known to us.

The six models released are divided into three categories:

General-Purpose Language Model "JSLM Base Beta"

"JSLM Base Beta" underwent continued pretraining on the base model of Llama 2 in order to augment its capabilities in reading and writing Japanese, additionally enhancing its knowledge by providing contextual information specifically relevant to Japan with large-scale data primarily sourced from the web.

The training used Japanese and English data from sources like Wikipedia, mC4, CC-100, OSCAR, and SlimPajama (excluding Books3), totaling approximately 100 billion tokens.

Instruction-tuned Language Model "JSLM Instruct Beta"

"JSLM Instruct Beta" is an instruction-tuned language model that can respond to user instructions and tasks in Japanese. It was created by applying Supervised Fine-Tuning (SFT) to the aforementioned base model after the initial training was completed. The SFT utilized public datasets such as Databricks Dolly-15k and Anthropic HH.

Vocabulary-Extended Model "JSLM JA-Vocab Beta"

"JSLM JA-Vocab Beta" is a model that has undergone vocabulary expansion in addition to the pretraining of the JSLM Base Beta model. This is achieved by implementing a tokenizer originally specialized for English that has been further trained in Japanese vocabulary. By using this tuned tokenizer, the model gains a stronger ability to process language in Japanese with a more accurate understanding of conversational phrasing. The vocabulary addition was implemented prior to pretraining and involved adding around 20,000 Japanese words.

Based on our tests, the additional vocabulary training approximately doubled the speed of outputs generated in Japanese.


For more details, including performance evaluation, please see the original blog post in Japanese

To stay updated on our progress in Japan, follow Stability AI Japan on Twitter.

Previous
Previous

Introducing Stable LM Zephyr 3B: A New Addition to Stable LM, Bringing Powerful LLM Assistants to Edge Devices

Next
Next

Statement to the U.S. Senate AI Insight Forum on Transparency, Explainability, and Copyright