New release of python module

June 25, 2019

Today, we are happy to release a new version of the fastText python library. The main goal of this release is to merge two existing python modules: the official fastText module which was available on our github repository and the unofficial fasttext module which was available on pypi.org. We hope that this new version will address the confusion due to the previous existence of two similar, but different, python modules.

The new version of our library is now available on pypi.org as well as on our github repository, and you can find an overview of its API here.

fastText vs fasttext: what happened?

There was an ongoing confusion among our user community about the existence of both fastText and fasttext modules.

When fastText was first released in 2016, it was a command line only utility. Very soon, people wanted to use fastText's capabilities from python without having to call a binary for each action. In August 2016, Bayu Aldi Yansyah, a developer outside of Facebook, published a python wrapper of fastText. His work was very helpful to a lot of people in our community and he published his unofficial python library on pypi with the pretty straighforward module name fasttext (note the lowercase t).

Later, our team began to work on an official python binding of fastText, that was published under the same github repository as the C++ source code. However, the module name for this official library was fastText (note the uppercase T).

Last year, Bayu Aldi Yansyah gave us admin access to the pypi project so that we could merge the two libraries.

To sum up, we ended up with two libraries that had:

almost the same name
different APIs
different versions
different ways to install

That was a very confusing situation for the community.

What actions did we take?

Today we are merging the two python libraries. We decided to keep the official API and top level functions such as train_unsupervised and train_supervised as well as returning numpy objects. We remove cbow, skipgram and supervised functions from the unofficial API. However, we bring nice ideas from the unofficial API to the official one. In particular, we liked the pythonic approach of WordVectorModel. This new python module is named fasttext, and is available on both pypi and our github repository.

From now, we will refer to the tool as "fastText", however the name of the python module is fasttext.

What is the right way to do now?

Before, you would either use fastText (uppercase T):

import fastText
# and call:
fastText.train_supervised
fastText.train_unsupervised

or use fasttext (lowercase t):

import fasttext
# and call:
fasttext.cbow
fasttext.skipgram
fasttext.supervised

Now, the right way to do is to import fasttext (lowercase t) and use

import fasttext
# and call:
fasttext.train_supervised
fasttext.train_unsupervised

We are keeping the lowercase fasttext module name, while we keep the fastText API.

This is because:

the standard way to name python modules is all lowercases
the API from fastText is exposing numpy arrays, which is widely used by the machine learning community.

You can find a more comprehensive overview of our python API here.

Should I modify my existing code?

Depending on the version of the python module you were using, you might need to do some little modifications on your existing code.

1) You were using the official `fastText` module:

You don't have to do much. Just replace your import fastText lines by import fasttext and everything should work as usual.

2) You were using the unofficial `fasttext` module:

If you were using the functions cbow, skipgram, supervised and/or WordVectorModel, SupervisedModel objects, you were using the unofficial fasttext module.

Updating your code should be pretty straightforward, but it still implies some little changes.

`cbow` function: use `train_unsupervised` instead.

For example, replace:

fasttext.cbow("train.txt", "model_file", lr=0.05, dim=100, ws=5, epoch=5)

with

model = fasttext.train_unsupervised("train.txt", model='cbow', lr=0.05, dim=100, ws=5, epoch=5)
model.save_model("model_file.bin")

`skipgram` function: use `train_unsupervised` instead.

For example, replace:

fasttext.skipgram("train.txt", "model_file", lr=0.05, dim=100, ws=5, epoch=5)

with

model = fasttext.train_unsupervised("train.txt", model='skipgram', lr=0.05, dim=100, ws=5, epoch=5)
model.save_model("model_file.bin")

`supervised` function: use `train_supervised` instead

For example, replace:

fasttext.supervised("train.txt", "model_file", lr=0.1, dim=100, epoch=5, word_ngrams=2, loss='softmax')

with

model = fasttext.train_supervised("train.txt", lr=0.1, dim=100, epoch=5, , word_ngrams=2, loss='softmax')
model.save_model("model_file.bin")

Parameters

As you can see, you can use either word_ngrams or wordNgrams as parameter name. Because the parameter names from the unofficial API are mapped to the official ones: min_count to minCount, word_ngrams to wordNgrams, lr_update_rate to lrUpdateRate, label_prefix to label and pretrained_vectors to pretrainedVectors.
silent parameter is not supported. Use verbose parameter instead.
encoding parameter is not supported, every input should be encoded in utf-8.

`WordVectorModel` and `SupervisedModel` objects

Instead of WordVectorModel and SupervisedModel objects, we return a model object that mimics some nice ideas from the unofficial API.

model = fasttext.train_unsupervised("train.txt", model='skipgram')
print(model.words)      # list of words in dictionary
print(model['king'])    # get the vector of the word 'king'
print('king' in model)  # check if a word is in dictionary

model = fasttext.train_supervised("train.txt")
print(model.words)      # list of words in dictionary
print(model.labels)     # list of labels

The model object also contains the arguments of the training:

print(model.epoch)
print(model.loss)
print(model.wordNgrams)

Thank you!

We want to thank our incredible community. We truly appreciate your feedback, a big thank you to everyone reporting issues and contributing to the project. In particular we want to express how grateful we are to Bayu Aldi Yansyah who did a great job with his python library and for giving us the ownership of the pypi fasttext project.

Language identification

October 2, 2017

Edouard Grave

Fast and accurate language identification using fastText

We are excited to announce that we are publishing a fast and accurate tool for text-based language identification. It can recognize more than 170 languages, takes less than 1MB of memory and can classify thousands of documents per second. It is based on fastText library and is released here as open source, free to use by everyone. We are releasing several versions of the model, each optimized for different memory usage, and compared them to the popular tool langid.py.

fastText on mobile

May 2, 2017

Armand Joulin

Today, the Facebook AI Research (FAIR) team released pre-trained vectors in 294 languages, accompanied by two quick-start tutorials, to increase fastText’s accessibility to the large community of students, software developers, and researchers interested in machine learning. fastText’s models now fit on smartphones and small computers like Raspberry Pi devices thanks to a new functionality that reduces memory usage.

First open-sourced last summer, fastText was designed to be accessible to anyone with generic hardware like notebooks and X86 cloud instances, or almost any platform with enough memory. Smartphone and small computer support extend fastText’s accessibility to an even larger community and a greater range of applications.

Releasing fastText

August 18, 2016

Edouard Grave

Faster, better text classification!

Understanding the meaning of words that roll off your tongue as you talk, or your fingertips as you tap out posts is one of the biggest technical challenges facing artificial intelligence researchers. But it is an essential need. Automatic text processing forms a key part of the day-to-day interaction with your computer; it’s a critical component of everything from web search and content ranking to spam filtering, and when it works well, it’s completely invisible to you. With the growing amount of online data, there is a need for more flexible tools to better understand the content of very large datasets, in order to provide more accurate classification results.

To address this need, the Facebook AI Research (FAIR) lab is open-sourcing fastText, a library designed to help build scalable solutions for text representation and classification. Our ongoing commitment to collaboration and sharing with the community extends beyond just delivering code. We know it’s important to share our learnings to advance the field, so have also published our research relating to fastText.

FastText combines some of the most successful concepts introduced by the natural language processing and machine learning communities in the last few decades. These include representing sentences with bag of words and bag of n-grams, as well as using subword information, and sharing information across classes through a hidden representation. We also employ a hierachical softmax that takes advantage of the unbalanced distribution of the classes to speed up computation. These different concepts are being used for two different tasks: efficient text classification and learning word vector representations.

fastText vs fasttext: what happened?

What actions did we take?

What is the right way to do now?

Should I modify my existing code?

1) You were using the official fastText module:

2) You were using the unofficial fasttext module:

cbow function: use train_unsupervised instead.

skipgram function: use train_unsupervised instead.

supervised function: use train_supervised instead

Parameters

WordVectorModel and SupervisedModel objects

Thank you!

Fast and accurate language identification using fastText

Faster, better text classification!

1) You were using the official `fastText` module:

2) You were using the unofficial `fasttext` module:

`cbow` function: use `train_unsupervised` instead.

`skipgram` function: use `train_unsupervised` instead.

`supervised` function: use `train_supervised` instead

`WordVectorModel` and `SupervisedModel` objects