flair.datasets.base.MongoDataset#

class flair.datasets.base.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub #

Bases: FlairDataset

__init__(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub #

Reads Mongo collections.

Each collection should contain one document/text per item.

Each item should have the following format: { ‘Beskrivning’: ‘Abrahamsby. Gård i Gottröra sn, Långhundra hd, Stockholms län, nära Långsjön.’, ‘Län’:’Stockholms län’, ‘Härad’: ‘Långhundra’, ‘Församling’: ‘Gottröra’, ‘Plats’: ‘Abrahamsby’ }

Parameters:

query (str) – Query, e.g. {‘Län’: ‘Stockholms län’}
host (str) – Host, e.g. ‘localhost’,
port (int) – Port, e.g. 27017
database (str) – Database, e.g. ‘rosenberg’,
collection (str) – Collection, e.g. ‘book’,
text_field (str) – Text field, e.g. ‘Beskrivning’,
categories_field (Optional[list[str]]) – List of category fields, e.g [‘Län’, ‘Härad’, ‘Tingslag’, ‘Församling’, ‘Plats’],
max_tokens_per_doc (int) – Takes at most this amount of tokens per document. If set to -1 all documents are taken as is.
max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens
max_chars_per_doc (int) – If set, truncates each Sentence to a maximum number of chars
tokenizer (Tokenizer) – Custom tokenizer to use (default SegtokTokenizer)
in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings
tag_type (str) – The tag type to assign labels to.

Returns: list of sentences

Methods

`__init__`(query, host, port, database, ...[, ...])	Reads Mongo collections.
`is_in_memory`()

is_in_memory()View on GitHub #

Return type:: bool

Table of Contents

flair.datasets.base.MongoDataset#