FedNLPDataset#

class fl_sim.data_processing.FedNLPDataset(datadir: Path | str | None = None, seed: int = 0, **extra_config: Any)[source]#

Bases: FedDataset, ABC

Base class for all federated NLP datasets.

Methods that have to be implemented by subclasses:

  • get_dataloader

  • _preload

  • evaluate

  • get_word_dict

Properties that have to be implemented by subclasses:

  • url

  • candidate_models

  • doi

Parameters:
  • datadir (Union[str, pathlib.Path], optional) – The directory to store the dataset. If None, use default directory.

  • seed (int, default 0) – The random seed.

  • **extra_config (dict, optional) – Extra configurations.

abstract get_dataloader(train_bs: int, test_bs: int, client_idx: int | None = None) Tuple[DataLoader, DataLoader][source]#

Get dataloader for client client_idx or get global dataloader.

load_partition_data(batch_size: int | None = None) tuple[source]#

Partition data into all local clients.

Parameters:

batch_size (int, optional) – Batch size for dataloader. If None, use default batch size.

Returns:

  • train_clients_num: int

    Number of training clients.

  • train_data_num: int

    Number of training data.

  • test_data_num: int

    Number of testing data.

  • train_data_global: torch.utils.data.DataLoader

    Global training dataloader.

  • test_data_global: torch.utils.data.DataLoader

    Global testing dataloader.

  • data_local_num_dict: dict

    Number of local training data for each client.

  • train_data_local_dict: dict

    Local training dataloader for each client.

  • test_data_local_dict: dict

    Local testing dataloader for each client.

  • vocab_len: int

    Length of the vocabulary.

Return type:

tuple

load_partition_data_distributed(process_id: int, batch_size: int | None = None) tuple[source]#

Get local dataloader at client process_id or get global dataloader.

Parameters:
  • process_id (int) – Index of the client to get dataloader. If None, get the dataloader containing all data, usually used for centralized training.

  • batch_size (int, optional) – Batch size for dataloader. If None, use default batch size.

Returns:

Return type:

tuple