gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

corpora.ucicorpus – Corpus in UCI bag-of-words format

corpora.ucicorpus – Corpus in UCI bag-of-words format

University of California, Irvine (UCI) Bag-of-Words format.

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

class gensim.corpora.ucicorpus.UciCorpus(fname, fname_vocab=None)

Corpus in the UCI bag-of-words format.

create_dictionary()

Utility method to generate gensim-style Dictionary directly from the corpus and vocabulary data.

docbyoffset(offset)

Return document at file offset offset (in bytes)

classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

save(*args, **kwargs)

Save the object to file (also see load).

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

static save_corpus(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)

Save a corpus in the UCI Bag-of-Words format.

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

This function is automatically called by UciCorpus.serialize; don’t call it directly, call serialize instead.

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

  • save_corpus method that returns a sequence of byte offsets, one for

    each saved document,

  • the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).

Example:

>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
skip_headers(input_file)

Skip file headers that appear before the first document.

class gensim.corpora.ucicorpus.UciReader(input)

Initialize the reader.

The input parameter refers to a file on the local filesystem, which is expected to be in the UCI Bag-of-Words format.

docbyoffset(offset)

Return document at file offset offset (in bytes)

skip_headers(input_file)

Skip file headers that appear before the first document.

class gensim.corpora.ucicorpus.UciWriter(fname)

Store a corpus in UCI Bag-of-Words format.

This corpus format is identical to MM format, except for different file headers. There is no format line, and the first three lines of the file contain number_docs, num_terms, and num_nnz, one value per line.

This implementation is based on matutils.MmWriter, and works the same way.

update_headers(num_docs, num_terms, num_nnz)

Update headers with actual values.

static write_corpus(fname, corpus, progress_cnt=1000, index=False)

Save the vector space representation of an entire corpus to disk.

Note that the documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.

write_headers()

Write blank header lines. Will be updated later, once corpus stats are known.

write_vector(docno, vector)

Write a single sparse vector to the file.

Sparse vector is any iterable yielding (field id, field value) pairs.