flair.datasets.base#
- class flair.datasets.base.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, drop_last=False, timeout=0, worker_init_fn=None)View on GitHub#
Bases:
DataLoader
- class flair.datasets.base.FlairDatapointDataset(datapoints)View on GitHub#
Bases:
FlairDataset
,Generic
[DT
]A simple Dataset object to wrap a List of Datapoints, for example Sentences.
- __init__(datapoints)View on GitHub#
Instantiate FlairDatapointDataset.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.base.SentenceDataset(sentences)View on GitHub#
Bases:
FlairDatapointDataset
- __init__(sentences)View on GitHub#
Deprecated since version 0.11: The ‘SentenceDataset’ class was renamed to ‘FlairDatapointDataset’
- class flair.datasets.base.StringDataset(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#
Bases:
FlairDataset
A Dataset taking string as input and returning Sentence during iteration.
- __init__(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#
Instantiate StringDataset.
- Parameters:
texts (
Union
[str
,List
[str
]]) – a string or List of string that make up StringDatasetuse_tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use. If instead of providing a function, this parameter is just set to True,flair.tokenization.SegTokTokenizer
will be used.
- abstract is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.base.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#
Bases:
FlairDataset
- __init__(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#
Reads Mongo collections.
Each collection should contain one document/text per item.
Each item should have the following format: { ‘Beskrivning’: ‘Abrahamsby. Gård i Gottröra sn, Långhundra hd, Stockholms län, nära Långsjön.’, ‘Län’:’Stockholms län’, ‘Härad’: ‘Långhundra’, ‘Församling’: ‘Gottröra’, ‘Plats’: ‘Abrahamsby’ }
- Parameters:
query (
str
) – Query, e.g. {‘Län’: ‘Stockholms län’}host (
str
) – Host, e.g. ‘localhost’,port (
int
) – Port, e.g. 27017database (
str
) – Database, e.g. ‘rosenberg’,collection (
str
) – Collection, e.g. ‘book’,text_field (
str
) – Text field, e.g. ‘Beskrivning’,categories_field (
Optional
[List
[str
]]) – List of category fields, e.g [‘Län’, ‘Härad’, ‘Tingslag’, ‘Församling’, ‘Plats’],max_tokens_per_doc (
int
) – Takes at most this amount of tokens per document. If set to -1 all documents are taken as is.max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens
max_chars_per_doc (
int
) – If set, truncates each Sentence to a maximum number of charstokenizer (
Tokenizer
) – Custom tokenizer to use (default SegtokTokenizer)in_memory (
bool
) – If True, keeps dataset as Sentences in memory, otherwise only keeps stringstag_type (
str
) – The tag type to assign labels to.
Returns: list of sentences
- is_in_memory()View on GitHub#
- Return type:
bool
- flair.datasets.base.find_train_dev_test_files(data_folder, dev_file, test_file, train_file, autofind_splits=True)View on GitHub#