Pioneering language technology for low-resource languages
At Artisan Labs, NLP is not just a field of research—it’s a movement. Our mission is to bridge the digital language divide by developing cutting-edge, open-source tools tailored for the Kyrgyz language and beyond. Our projects include language models, tokenizers, and various NLP utilities that empower researchers and developers to advance language technology in low-resource settings.
We’re proud to share our innovative models that address unique linguistic challenges. Explore our key projects:
A small-scale BERT model trained on over 1.5 million Kyrgyz sentences. Designed for masked language modeling, text classification, and feature extraction, KyrgyzBert is optimized for the nuances of the Kyrgyz language. It features a custom tokenizer and an architecture with 6 layers and 8 attention heads.
Also known as kstunlp/kyrgyz_language_binary_classifier, this model identifies whether a given text is in Kyrgyz. It’s an essential tool for language detection tasks, data filtering, and building multilingual or code-switching pipelines.
Tokenization is the first step toward effective NLP. Our open-source tokenizers are specifically designed for the Kyrgyz language:
A WordPiece-based tokenizer trained from scratch for Kyrgyz text. It forms the backbone for KyrgyzBert and other NLP applications.
Built using SentencePiece with the unigram algorithm, this tokenizer is designed for T5-based models, ensuring optimal segmentation and performance for generative tasks.
All our models and tokenizers are available on Hugging Face. For example, you can load KyrgyzBert with the following code:
from transformers import BertTokenizerFast, BertForMaskedLM
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
text = "Бул жерден [MASK] нерселерди таба аласыз."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs).logits
# Get predictions for the [MASK] token
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()
print("Predictions:", tokenizer.decode(outputs[0, masked_index].argmax().item()))
Similarly, explore our tokenizers on Hugging Face to jumpstart your Kyrgyz NLP projects.
We’re passionate about collaboration. Whether you’re a researcher, developer, or language enthusiast, we invite you to contribute, share feedback, or even build upon our open-source projects. Visit our Hugging Face profile at metinovadilet for the latest updates and releases.