flair.datasets.base#

class flair.datasets.base.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, drop_last=False, timeout=0, worker_init_fn=None)View on GitHub#

Bases: DataLoader

dataset: Dataset[TypeVar(T_co, covariant=True)]#
batch_size: Optional[int]#
num_workers: int#
pin_memory: bool#
drop_last: bool#
timeout: float#
sampler: Union[Sampler, Iterable]#
pin_memory_device: str#
prefetch_factor: Optional[int]#
class flair.datasets.base.FlairDatapointDataset(datapoints)View on GitHub#

Bases: FlairDataset, Generic[DT]

A simple Dataset object to wrap a List of Datapoints, for example Sentences.

__init__(datapoints)View on GitHub#

Instantiate FlairDatapointDataset.

Parameters:

datapoints (Union[TypeVar(DT, bound= DataPoint), List[TypeVar(DT, bound= DataPoint)]]) – DT or List of DT that make up FlairDatapointDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.base.SentenceDataset(sentences)View on GitHub#

Bases: FlairDatapointDataset

__init__(sentences)View on GitHub#

Deprecated since version 0.11: The ‘SentenceDataset’ class was renamed to ‘FlairDatapointDataset’

class flair.datasets.base.StringDataset(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#

Bases: FlairDataset

A Dataset taking string as input and returning Sentence during iteration.

__init__(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#

Instantiate StringDataset.

Parameters:
  • texts (Union[str, List[str]]) – a string or List of string that make up StringDataset

  • use_tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use. If instead of providing a function, this parameter is just set to True, flair.tokenization.SegTokTokenizer will be used.

abstract is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.base.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#

Bases: FlairDataset

__init__(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#

Reads Mongo collections.

Each collection should contain one document/text per item.

Each item should have the following format: { ‘Beskrivning’: ‘Abrahamsby. Gård i Gottröra sn, Långhundra hd, Stockholms län, nära Långsjön.’, ‘Län’:’Stockholms län’, ‘Härad’: ‘Långhundra’, ‘Församling’: ‘Gottröra’, ‘Plats’: ‘Abrahamsby’ }

Parameters:
  • query (str) – Query, e.g. {‘Län’: ‘Stockholms län’}

  • host (str) – Host, e.g. ‘localhost’,

  • port (int) – Port, e.g. 27017

  • database (str) – Database, e.g. ‘rosenberg’,

  • collection (str) – Collection, e.g. ‘book’,

  • text_field (str) – Text field, e.g. ‘Beskrivning’,

  • categories_field (Optional[List[str]]) – List of category fields, e.g [‘Län’, ‘Härad’, ‘Tingslag’, ‘Församling’, ‘Plats’],

  • max_tokens_per_doc (int) – Takes at most this amount of tokens per document. If set to -1 all documents are taken as is.

  • max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens

  • max_chars_per_doc (int) – If set, truncates each Sentence to a maximum number of chars

  • tokenizer (Tokenizer) – Custom tokenizer to use (default SegtokTokenizer)

  • in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings

  • tag_type (str) – The tag type to assign labels to.

Returns: list of sentences

is_in_memory()View on GitHub#
Return type:

bool

flair.datasets.base.find_train_dev_test_files(data_folder, dev_file, test_file, train_file, autofind_splits=True)View on GitHub#