Aligned word vectors
We are publishing aligned word vectors for 44 languages based on the pre-trained vectors computed on Wikipedia using fastText. The alignments are performed with the RCSLS method described in Joulin et al (2018).
Vectors
The aligned vectors can be downloaded from:
Afrikaans: text | Arabic: text | Bulgarian: text | Bengali: text |
Bosnian: text | Catalan: text | Czech: text | Danish: text |
German: text | Greek: text | English: text | Spanish: text |
Estonian: text | Persian: text | Finnish: text | French: text |
Hebrew: text | Hindi: text | Croatian: text | Hungarian: text |
Indonesian: text | Italian: text | Korean: text | Lithuanian: text |
Latvian: text | Macedonian: text | Malay: text | Dutch: text |
Norwegian: text | Polish: text | Portuguese: text | Romanian: text |
Russian: text | Slovak: text | Slovenian: text | Albanian: text |
Swedish: text | Tamil: text | Thai: text | Tagalog: text |
Turkish: text | Ukrainian: text | Vietnamese: text | Chinese: text |
Format
The word vectors come in the default text format of fastText. The first line gives the number of vectors and their dimension. The other lines contain a word followed by its vector. Each value is space separated.
License
The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.
References
If you use these word vectors, please cite the following papers:
[1] A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion
@InProceedings{joulin2018loss,
title={Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion},
author={Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J\'egou, Herv\'e and Grave, Edouard},
year={2018},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
}
[2] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2017enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={Transactions of the Association for Computational Linguistics},
volume={5},
year={2017},
issn={2307-387X},
pages={135--146}
}