Please sponsor Gensim to help sustain this open source project!
- »
- Documentation »
- How-to Guides: Solve a Problem »
- How to download pre-trained models and corpora
Note
Click here to download the full example code
Demonstrates simple and quick access to common corpora and pretrained models.
import logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
One of Gensim’s features is simple and easy access to common data.The gensim-data project stores avariety of corpora and pretrained models.Gensim has a gensim.downloader module for programmatically accessing this data.This module leverages a local cache (in user’s home folder, by default) thatensures data is downloaded at most once.
This tutorial:
Downloads the text8 corpus, unless it is already on your local machine
(Video) How to Create an LDA Topic Model in Python with Gensim (Topic Modeling for DH 03.03)Trains a Word2Vec model from the corpus (see Doc2Vec Model for a detailed tutorial)
Leverages the model to calculate word similarity
Demonstrates using the API to load other models and corpora
Let’s start by importing the api module.
import gensim.downloader as api
Now, let’s download the text8 corpus and load it as a Python objectthat supports streamed access.
corpus = api.load('text8')
In this case, our corpus is an iterable.If you look under the covers, it has the following definition:
import inspectprint(inspect.getsource(corpus.__class__))
Out:
class Dataset(object): def __init__(self, fn): self.fn = fn def __iter__(self): corpus = Text8Corpus(self.fn) for doc in corpus: yield doc
For more details, look inside the file that defines the Dataset class for your particular resource.
print(inspect.getfile(corpus.__class__))
Out:
/Users/kofola3/gensim-data/text8/__init__.py
With the corpus has been downloaded and loaded, let’s use it to train a word2vec model.
from gensim.models.word2vec import Word2Vecmodel = Word2Vec(corpus)
Now that we have our word2vec model, let’s find words that are similar to ‘tree’.
print(model.wv.most_similar('tree'))
Out:
[('trees', 0.7091131806373596), ('bark', 0.673214316368103), ('leaf', 0.6706242561340332), ('flower', 0.6195512413978577), ('bird', 0.6081331372261047), ('nest', 0.602649450302124), ('avl', 0.5914573669433594), ('garden', 0.5712863206863403), ('egg', 0.5702848434448242), ('beetle', 0.5701731443405151)]
You can use the API to download several different corpora and pretrained models.Here’s how to list all resources available in gensim-data:
import jsoninfo = api.info()print(json.dumps(info, indent=4))
Out:
{ "corpora": { "semeval-2016-2017-task3-subtaskBC": { "num_records": -1, "record_format": "dict", "file_size": 6344358, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py", "license": "All files released for the task are free for general research use", "fields": { "2016-train": [ "..." ], "2016-dev": [ "..." ], "2017-test": [ "..." ], "2016-test": [ "..." ] }, "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section \u201cPapers\u201d of https://github.com/RaRe-Technologies/gensim-data/issues/18.", "checksum": "701ea67acd82e75f95e1d8e62fb0ad29", "file_name": "semeval-2016-2017-task3-subtaskBC.gz", "read_more": [ "http://alt.qcri.org/semeval2017/task3/", "http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english" ], "parts": 1 }, "semeval-2016-2017-task3-subtaskA-unannotated": { "num_records": 189941, "record_format": "dict", "file_size": 234373151, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskA-unannotated-eng/__init__.py", "license": "These datasets are free for general research use.", "fields": { "THREAD_SEQUENCE": "", "RelQuestion": { "RELQ_CATEGORY": "question category, according to the Qatar Living taxonomy", "RELQ_DATE": "date of posting", "RELQ_ID": "question indentifier", "RELQ_USERID": "identifier of the user asking the question", "RELQ_USERNAME": "name of the user asking the question", "RelQBody": "body of question", "RelQSubject": "subject of question" }, "RelComments": [ { "RelCText": "text of answer", "RELC_USERID": "identifier of the user posting the comment", "RELC_ID": "comment identifier", "RELC_USERNAME": "name of the user posting the comment", "RELC_DATE": "date of posting" } ] }, "description": "SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.", "checksum": "2de0e2f2c4f91c66ae4fcf58d50ba816", "file_name": "semeval-2016-2017-task3-subtaskA-unannotated.gz", "read_more": [ "http://alt.qcri.org/semeval2016/task3/", "http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english" ], "parts": 1 }, "patent-2017": { "num_records": 353197, "record_format": "dict", "file_size": 3087262469, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/patent-2017/__init__.py", "license": "not found", "description": "Patent Grant Full Text. Contains the full text including tables, sequence data and 'in-line' mathematical expressions of each patent grant issued in 2017.", "checksum-0": "818501f0b9af62d3b88294d86d509f8f", "checksum-1": "66c05635c1d3c7a19b4a335829d09ffa", "file_name": "patent-2017.gz", "read_more": [ "http://patents.reedtech.com/pgrbft.php" ], "parts": 2 }, "quora-duplicate-questions": { "num_records": 404290, "record_format": "dict", "file_size": 21684784, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/quora-duplicate-questions/__init__.py", "license": "probably https://www.quora.com/about/tos", "fields": { "question1": "the full text of each question", "question2": "the full text of each question", "qid1": "unique ids of each question", "qid2": "unique ids of each question", "id": "the id of a training set question pair", "is_duplicate": "the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise" }, "description": "Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not.", "checksum": "d7cfa7fbc6e2ec71ab74c495586c6365", "file_name": "quora-duplicate-questions.gz", "read_more": [ "https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs" ], "parts": 1 }, "wiki-english-20171001": { "num_records": 4924894, "record_format": "dict", "file_size": 6516051717, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py", "license": "https://dumps.wikimedia.org/legal.html", "fields": { "section_texts": "list of body of sections", "section_titles": "list of titles of sections", "title": "Title of wiki article" }, "description": "Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz`", "checksum-0": "a7d7d7fd41ea7e2d7fa32ec1bb640d71", "checksum-1": "b2683e3356ffbca3b6c2dca6e9801f9f", "checksum-2": "c5cde2a9ae77b3c4ebce804f6df542c2", "checksum-3": "00b71144ed5e3aeeb885de84f7452b81", "file_name": "wiki-english-20171001.gz", "read_more": [ "https://dumps.wikimedia.org/enwiki/20171001/" ], "parts": 4 }, "text8": { "num_records": 1701, "record_format": "list of str (tokens)", "file_size": 33182058, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py", "license": "not found", "description": "First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.", "checksum": "68799af40b6bda07dfa47a32612e5364", "file_name": "text8.gz", "read_more": [ "http://mattmahoney.net/dc/textdata.html" ], "parts": 1 }, "fake-news": { "num_records": 12999, "record_format": "dict", "file_size": 20102776, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "fields": { "crawled": "date the story was archived", "ord_in_thread": "", "published": "date published", "participants_count": "number of participants", "shares": "number of Facebook shares", "replies_count": "number of replies", "main_img_url": "image from story", "spam_score": "data from webhose.io", "uuid": "unique identifier", "language": "data from webhose.io", "title": "title of story", "country": "data from webhose.io", "domain_rank": "data from webhose.io", "author": "author of story", "comments": "number of Facebook comments", "site_url": "site URL from BS detector", "text": "text of story", "thread_title": "", "type": "type of website (label from BS detector)", "likes": "number of Facebook likes" }, "description": "News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.", "checksum": "5e64e942df13219465927f92dcefd5fe", "file_name": "fake-news.gz", "read_more": [ "https://www.kaggle.com/mrisdal/fake-news" ], "parts": 1 }, "20-newsgroups": { "num_records": 18846, "record_format": "dict", "file_size": 14483581, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py", "license": "not found", "fields": { "topic": "name of topic (20 variant of possible values)", "set": "marker of original split (possible values 'train' and 'test')", "data": "", "id": "original id inferred from folder name" }, "description": "The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups.", "checksum": "c92fd4f6640a86d5ba89eaad818a9891", "file_name": "20-newsgroups.gz", "read_more": [ "http://qwone.com/~jason/20Newsgroups/" ], "parts": 1 }, "__testing_matrix-synopsis": { "description": "[THIS IS ONLY FOR TESTING] Synopsis of the movie matrix.", "checksum": "1767ac93a089b43899d54944b07d9dc5", "file_name": "__testing_matrix-synopsis.gz", "read_more": [ "http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis" ], "parts": 1 }, "__testing_multipart-matrix-synopsis": { "description": "[THIS IS ONLY FOR TESTING] Synopsis of the movie matrix.", "checksum-0": "c8b0c7d8cf562b1b632c262a173ac338", "checksum-1": "5ff7fc6818e9a5d9bc1cf12c35ed8b96", "checksum-2": "966db9d274d125beaac7987202076cba", "file_name": "__testing_multipart-matrix-synopsis.gz", "read_more": [ "http://www.imdb.com/title/tt0133093/plotsummary?ref_=ttpl_pl_syn#synopsis" ], "parts": 3 } }, "models": { "fasttext-wiki-news-subwords-300": { "num_records": 999999, "file_size": 1005007116, "base_dataset": "Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py", "license": "https://creativecommons.org/licenses/by-sa/3.0/", "parameters": { "dimension": 300 }, "description": "1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).", "read_more": [ "https://fasttext.cc/docs/en/english-vectors.html", "https://arxiv.org/abs/1712.09405", "https://arxiv.org/abs/1607.01759" ], "checksum": "de2bb3a20c46ce65c9c131e1ad9a77af", "file_name": "fasttext-wiki-news-subwords-300.gz", "parts": 1 }, "conceptnet-numberbatch-17-06-300": { "num_records": 1917247, "file_size": 1225497562, "base_dataset": "ConceptNet, word2vec, GloVe, and OpenSubtitles 2016", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/conceptnet-numberbatch-17-06-300/__init__.py", "license": "https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt", "parameters": { "dimension": 300 }, "description": "ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.", "read_more": [ "http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972", "https://github.com/commonsense/conceptnet-numberbatch", "http://conceptnet.io/" ], "checksum": "fd642d457adcd0ea94da0cd21b150847", "file_name": "conceptnet-numberbatch-17-06-300.gz", "parts": 1 }, "word2vec-ruscorpora-300": { "num_records": 184973, "file_size": 208427381, "base_dataset": "Russian National Corpus (about 250M words)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-ruscorpora-300/__init__.py", "license": "https://creativecommons.org/licenses/by/4.0/deed.en", "parameters": { "dimension": 300, "window_size": 10 }, "description": "Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.", "preprocessing": "The corpus was lemmatized and tagged with Universal PoS", "read_more": [ "https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models", "http://rusvectores.org/en/", "https://github.com/RaRe-Technologies/gensim-data/issues/3" ], "checksum": "9bdebdc8ae6d17d20839dd9b5af10bc4", "file_name": "word2vec-ruscorpora-300.gz", "parts": 1 }, "word2vec-google-news-300": { "num_records": 3000000, "file_size": 1743563840, "base_dataset": "Google News (about 100 billion words)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py", "license": "not found", "parameters": { "dimension": 300 }, "description": "Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).", "read_more": [ "https://code.google.com/archive/p/word2vec/", "https://arxiv.org/abs/1301.3781", "https://arxiv.org/abs/1310.4546", "https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf" ], "checksum": "a5e5354d40acb95f9ec66d5977d140ef", "file_name": "word2vec-google-news-300.gz", "parts": 1 }, "glove-wiki-gigaword-50": { "num_records": 400000, "file_size": 69182535, "base_dataset": "Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 50 }, "description": "Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "c289bc5d7f2f02c6dc9f2f9b67641813", "file_name": "glove-wiki-gigaword-50.gz", "parts": 1 }, "glove-wiki-gigaword-100": { "num_records": 400000, "file_size": 134300434, "base_dataset": "Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-100/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 100 }, "description": "Pre-trained vectors based on Wikipedia 2014 + Gigaword 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-100.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "40ec481866001177b8cd4cb0df92924f", "file_name": "glove-wiki-gigaword-100.gz", "parts": 1 }, "glove-wiki-gigaword-200": { "num_records": 400000, "file_size": 264336934, "base_dataset": "Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-200/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 200 }, "description": "Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-200.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "59652db361b7a87ee73834a6c391dfc1", "file_name": "glove-wiki-gigaword-200.gz", "parts": 1 }, "glove-wiki-gigaword-300": { "num_records": 400000, "file_size": 394362229, "base_dataset": "Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-300/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 300 }, "description": "Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "29e9329ac2241937d55b852e8284e89b", "file_name": "glove-wiki-gigaword-300.gz", "parts": 1 }, "glove-twitter-25": { "num_records": 1193514, "file_size": 109885004, "base_dataset": "Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-25/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 25 }, "description": "Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "50db0211d7e7a2dcd362c6b774762793", "file_name": "glove-twitter-25.gz", "parts": 1 }, "glove-twitter-50": { "num_records": 1193514, "file_size": 209216938, "base_dataset": "Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-50/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 50 }, "description": "Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/)", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-50.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "c168f18641f8c8a00fe30984c4799b2b", "file_name": "glove-twitter-50.gz", "parts": 1 }, "glove-twitter-100": { "num_records": 1193514, "file_size": 405932991, "base_dataset": "Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-100/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 100 }, "description": "Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/)", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "b04f7bed38756d64cf55b58ce7e97b15", "file_name": "glove-twitter-100.gz", "parts": 1 }, "glove-twitter-200": { "num_records": 1193514, "file_size": 795373100, "base_dataset": "Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)", "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-200/__init__.py", "license": "http://opendatacommons.org/licenses/pddl/", "parameters": { "dimension": 200 }, "description": "Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/).", "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt`.", "read_more": [ "https://nlp.stanford.edu/projects/glove/", "https://nlp.stanford.edu/pubs/glove.pdf" ], "checksum": "e52e8392d1860b95d5308a525817d8f9", "file_name": "glove-twitter-200.gz", "parts": 1 }, "__testing_word2vec-matrix-synopsis": { "description": "[THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.", "parameters": { "dimensions": 50 }, "preprocessing": "Converted to w2v using a preprocessed corpus. Converted to w2v format with `python3.5 -m gensim.models.word2vec -train <input_filename> -iter 50 -output <output_filename>`.", "read_more": [], "checksum": "534dcb8b56a360977a269b7bfc62d124", "file_name": "__testing_word2vec-matrix-synopsis.gz", "parts": 1 } }}
There are two types of data resources: corpora and models.
print(info.keys())
Out:
dict_keys(['corpora', 'models'])
Let’s have a look at the available corpora:
for corpus_name, corpus_data in sorted(info['corpora'].items()): print( '%s (%d records): %s' % ( corpus_name, corpus_data.get('num_records', -1), corpus_data['description'][:40] + '...', ) )
Out:
20-newsgroups (18846 records): The notorious collection of approximatel...__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...fake-news (12999 records): News dataset, contains text and metadata...patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...quora-duplicate-questions (404290 records): Over 400,000 lines of potential question...semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 / 2017 Task 3 Subtask A una...semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 Subtask B and...text8 (1701 records): First 100,000,000 bytes of plain text fr...wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from October 20...
… and the same for models:
for model_name, model_data in sorted(info['models'].items()): print( '%s (%d records): %s' % ( model_name, model_data.get('num_records', -1), model_data['description'][:40] + '...', ) )
Out:
__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...glove-twitter-100 (1193514 records): Pre-trained vectors based on 2B tweets,...glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...word2vec-google-news-300 (3000000 records): Pre-trained vectors trained on a part of...word2vec-ruscorpora-300 (184973 records): Word2vec Continuous Skipgram vectors tra...
If you want to get detailed information about a model/corpus, use:
fake_news_info = api.info('fake-news')print(json.dumps(fake_news_info, indent=4))
Out:
{ "num_records": 12999, "record_format": "dict", "file_size": 20102776, "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "fields": { "crawled": "date the story was archived", "ord_in_thread": "", "published": "date published", "participants_count": "number of participants", "shares": "number of Facebook shares", "replies_count": "number of replies", "main_img_url": "image from story", "spam_score": "data from webhose.io", "uuid": "unique identifier", "language": "data from webhose.io", "title": "title of story", "country": "data from webhose.io", "domain_rank": "data from webhose.io", "author": "author of story", "comments": "number of Facebook comments", "site_url": "site URL from BS detector", "text": "text of story", "thread_title": "", "type": "type of website (label from BS detector)", "likes": "number of Facebook likes" }, "description": "News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.", "checksum": "5e64e942df13219465927f92dcefd5fe", "file_name": "fake-news.gz", "read_more": [ "https://www.kaggle.com/mrisdal/fake-news" ], "parts": 1}
Sometimes, you do not want to load a model into memory. Instead, you can requestjust the filesystem path to the model. For that, use:
print(api.load('glove-wiki-gigaword-50', return_path=True))
Out:
/Users/kofola3/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
If you want to load the model to memory, then:
model = api.load("glove-wiki-gigaword-50")model.most_similar("glass")
Out:
[('plastic', 0.79425048828125), ('metal', 0.7708716988563538), ('walls', 0.7700635194778442), ('marble', 0.7638523578643799), ('wood', 0.7624280452728271), ('ceramic', 0.7602593302726746), ('pieces', 0.7589112520217896), ('stained', 0.7528817653656006), ('tile', 0.748193621635437), ('furniture', 0.7463858723640442)]
For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped ina special class Dataset
, with an __iter__
method.
Total running time of the script: ( 1 minutes 39.422 seconds)
Estimated memory usage: 297 MB
Download Python source code: run_downloader_api.py
Download Jupyter notebook: run_downloader_api.ipynb
FAQs
What is topic modelling for humans? ›
The topic modelling process is a text mining approach. Most of the time we get unstructured data, e.g, Articles, Newspapers, Books, online posts etc and after performing topic modelling algorithms we can get a set of topics. Each topic contains top-ranked terms and reference to associated or relevant documents.
What is Gensim model used for? ›Gensim is an open source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek.
What is topic modeling with Gensim LDA? ›Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python's Gensim package.
What is the use of Gensim in NLP? ›Gensim is an open-source Python package for natural language processing used mainly for unsupervised topic modeling. It uses state-of-the-art academic models and modern statistical machine learning to perform complex NLP tasks.
Is topic modelling part of NLP? ›Overview. Natural Language Processing (NLP) is a field of machine learning related to interactions between the human language and computers. Topic modelling is a part of NLP that is used to determine the topic of a set of documents based on the content.
Is topic modelling AI? ›Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as 'unsupervised' machine learning because it doesn't require a predefined list of tags or training data that's been previously classified by humans.
What is Gensim vs spacy? ›Spacy is a natural language processing library for Python designed to have fast performance, and with word embedding models built in. Gensim is a topic modelling library for Python that provides modules for training Word2Vec and other word embedding algorithms, and allows using pre-trained models.
Is Gensim used for word embedding? ›Gensim Python Library
Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.
Gensim depends on the following software: Python, tested with versions 3.6, 3.7 and 3.8.
Why LDA is best for topic modeling? ›LDA can be used to summarize vast quantities of data. Instead of parsing through every detail in each document, one can leverage this topic modeling method to identify the main points being stated in the documents.
What is the difference between topic modeling and clustering? ›
No matter what approach you select, in topic modeling you will end up with a list of topics, each containing a set of associated keywords. Things are slightly different in clustering! Here, the algorithm clusters documents into different groups based on a similarity measure.
What are the advantages of Gensim in Python? ›It also provides more convenient facilities for text processing. Another most significant advantage of Gensim is that, it let us handle large text files even without loading the whole file in memory. Gensim doesn't require costly annotations or hand tagging of documents because it uses unsupervised models.
What does Gensim Word2Vec do? ›Word2Vec is a widely used word representation technique that uses neural networks under the hood. The resulting word representation or embeddings can be used to infer semantic similarity between words and phrases, expand queries, surface related concepts and more.
How does Gensim Summarizer work? ›Demonstrates summarizing text by extracting the most important sentences from it. This module automatically summarizes the given text, by extracting one or more important sentences from the text. In a similar way, it can also extract keywords.
What are the 5 phases of NLP? ›The five phases of NLP involve lexical (structure) analysis, parsing, semantic analysis, discourse integration, and pragmatic analysis.
What is topic modeling good for? ›Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines.
What is the best model for NLP? ›GPT-3 is capable of handling statistical interdependence between words. It's been trained on over 175 billion parameters and 45 TB of text gathered from all over the web. It is one of the most comprehensive pre-trained NLP models accessible.
What are the three types of AI models? ›Artificial Narrow Intelligence or ANI, that has a narrow range of abilities; Artificial General Intelligence or AGI, that has capabilities as in humans; Artificial SuperIntelligence or ASI, that has capability more than that of humans.
Is Gensim free? ›Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Is Gensim a machine learning? ›Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.
Is spaCy the best NLP? ›
As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. In word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy.
Why are word embeddings better than TF IDF? ›By having a larger vocabulary the embedding method is likely to assign rules to words that are only rarely seen in training. Conversely, the TF-IDF method had a smaller vocabulary and so rules could only be formed on words which had been seen in many training examples.
Which word embedding is best? ›Word2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words). It has several use cases such as Recommendation Engines, Knowledge Discovery, and also applied in the different Text Classification problems.
Is the Gensim Word2Vec pre trained? ›Accessing pre-trained embeddings is extremely easy with Gensim as it allows you to use pre-trained GloVe and Word2Vec embeddings with minimal effort.
What Python IDE do data scientists use? ›DataSpell is an IDE and notebook platform developed by Jetbrains that was specifically developed for data scientists. It is the newest tool on this list and was released in late 2021. They have built-in version control, a terminal, and a database tool to access databases directly in the IDE.
Which Python library is used for robotics? ›Pybotics is an open-source Python toolbox for robot kinematics and calibration. It was designed to provide a simple, clear, and concise interface to quickly simulate and evaluate common robot concepts, such as kinematics, dynamics, trajectory generations, and calibration.
Is gensim in Anaconda? ›anaconda / packages / gensim 0. 12
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
The most common form of topic modeling is Latent Dirichlet Allocation or LDA. LDA works as follows: First, LDA requires the research to specify a value of k or the number of topics in the corpus. In practice, this is a very difficult—and consequential—decision.
Why NMF is better than LDA? ›In contrast to LDA, NMF is a decompositional, non-probabilistic algorithm using matrix factorization and belongs to the group of linear-algebraic algorithms (Egger, 2022b). NMF works on TF-IDF transformed data by breaking down a matrix into two lower-ranking matrices (Obadimu et al., 2019).
Why is LSA better than LDA? ›Both LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix dimension while LDA solves topic modeling problems. I will not go through mathematical detail and as there is lot of great material for that. You may check it from reference.
Which is better LDA or SVM? ›
SVM makes no assumptions about the data at all, meaning it is a very flexible method. The flexibility on the other hand often makes it more difficult to interpret the results from a SVM classifier, compared to LDA. SVM classification is an optimization problem, LDA has an analytical solution.
What is the disadvantage of LDA? ›However, if the distribution's mean values are shared between the classes, Linear Discriminant Analysis cannot find a new linearly separable axis causing the LDA method to fail which is one of the disadvantages of linear discriminant analysis.
Which algorithm is most often used in topic modeling tools? ›Latent Dirichlet allocation is one of the most common algorithms for topic modeling.
Is topic modelling qualitative or quantitative? ›Topic models meet discourse analysis: a quantitative tool for a qualitative approach.
Is topic modelling supervised or unsupervised? ›Topic modeling is an unsupervised machine learning way to organize text (or image or DNA, etc.) information such that related pieces of text can be identified.
What are two of the most important benefits of the Python language? ›Most notably, interpreted programming languages like Python are platform-independent, meaning they can run on any operating machine, system, or platform. These types of coding languages are also generally smaller in size and have features such as dynamic typing.
What are 3 benefits of using Python? ›- Earning Potential. Python is the second-highest paid computer language, according to Indeed. ...
- Ease of Comprehension. One of the top benefits of Python is that it is easy to learn and fun to use. ...
- Flexibility. Not only is Python easy to learn, but also, it's flexible. ...
- Used in Many Industries.
Word embeddings like Word2Vec also help in figuring out the specific context in which a particular comment was made. Such algorithms prove very valuable in understanding the buyer or customer sentiment towards a particular business or social forum.
What are the disadvantages of Word2Vec? ›- Word2Vec cannot handle out-of-vocabulary words well. ...
- It relies on local information of language words. ...
- Parameters for training on new languages cannot be shared. ...
- Requires a comparatively larger corpus for the network to converge (especially if using skip-gram.
Word2Vec is a Machine Learning method of building a language model based on Deep Learning ideas; however, a neural network that is used here is rather shallow (consists of only one hidden layer).
Does Gensim use GPU? ›
The gensim Word2Vec/Doc2Vec code can't currently utilize a GPU.
What algorithm does Gensim use? ›The gensim implementation is based on the popular TextRank algorithm. It is an open-source vector space modelling and topic modelling toolkit, implemented in the Python programming language, using NumPy, SciPy and optionally Cython for performance.
What is the best Summarizer tool? ›- Summarize Bot. This AI and blockchain-powered tool allows users to know more by reading less with summarization of long texts. ...
- Resoomer. ...
- SMMRY. ...
- TextSummarization. ...
- Text Compactor.
LexRank Algorithm is an unsupervised approach to summarization and is inspired by the PageRank algorithm. LexRank uses an IDF-modified cosine similarity score to improve the Pagerank score for document summarization. It summarizes the text based on graph-based centrality scoring of sentences.
What is topic modelling example? ›Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic.
Why do we do topic modelling? ›Topic modelling is recognizing the words from the topics present in the document or the corpus of data. This is useful because extracting the words from a document takes more time and is much more complex than extracting them from topics present in the document.
What is the real world application of topic modeling? ›The main applications of Topic Modeling are classification, categorization, summarization of documents. AI methodologies associated with genetics, social media, and computer vision tasks are associated with Topic Modeling. It also powers analysis on social networks pertaining to the sentiments of users.
What is topic modelling NMF vs LDA? ›Specifically, LDA is a generative statistical model, NMF uses a linear algebra approach for topic extraction, and BERTopic and Top2Vec use an embedding approach.
Which is best topic modelling? ›Today's most prevalent and widely used technique for topic modeling is latent Dirichlet allocation (LDA). It's the one used by Facebook researchers in their 2013 research paper. LDA is a technique for topic modelling that is based on probabilistic assumptions.
How do you do a topic model in NLP? ›Implementation of Topic Modeling Using LDA
When using LDA for topic modeling NLP, we take bag of words as input matrix since it is a probabilistic model. The algorithm then decomposes the matrix into two smaller matrices: A document to topic matrix and a word to topic matrix.
What is the limitation of topic modelling? ›
Topic modeling has become very popular because it often achieves a decent approximation of what we intuitively refer to as “topics.” The fundamental limitation of topic modeling is that it is a tool that is in search of a theoretical construct.
What are the use cases of topic modeling? ›Topic modeling is used when you have a set of text documents (such as emails, survey responses, support tickets, product reviews, etc), and you want to find out the different topics that they cover and group them by those topics.
What are some examples of models in everyday life? ›Models are very common. The ingredients list on a bottle of ketchup is a model of its contents, and margarine is a model of butter. A box score from a baseball game is a model of the actual event. A trial over an automobile accident is a model of the actual accident.
How do you decide how many topics in topic modeling? ›To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.
What are the disadvantages of NMF? ›When NMF is applied for data representation, a major disadvantage is that it fails to consider the geometric structure in the data. In this paper, we develop a graph based approach for parts-based data representation in order to overcome this limitation.
Is NMF faster than LDA? ›Other topics show different patterns. On the other hand, comparing the results of LDA to NMF also shows that NMF performs better.