graphnet.data.dataset.parquet.parquet_dataset module

Base Dataset class(es) used in GraphNeT.

class graphnet.data.dataset.parquet.parquet_dataset.ParquetDataset(*args, **kwargs)[source]

Bases: Dataset

Dataset class for Parquet-files converted with ParquetWriter.

Construct Dataset.

NOTE: DataLoaders using this Dataset should have “multiprocessing_context = ‘spawn’” set to avoid thread locking.

Parameters:
  • path (str) – Path to the file(s) from which this Dataset should read.

  • pulsemaps (Union[str, List[str]]) – Name(s) of the pulse map series that should be used to construct the nodes on the individual graph objects, and their features. Multiple pulse series maps can be used, e.g., when different DOM types are stored in different maps.

  • features (List[str]) – List of columns in the input files that should be used as node features on the graph objects.

  • truth (List[str]) – List of event-level columns in the input files that should be used added as attributes on the graph objects.

  • node_truth (Optional[List[str]], default: None) – List of node-level columns in the input files that should be used added as attributes on the graph objects.

  • index_column (str, default: 'event_no') – Name of the column in the input files that contains unique indicies to identify and map events across tables.

  • truth_table (str, default: 'truth') – Name of the table containing event-level truth information.

  • node_truth_table (Optional[str], default: None) – Name of the table containing node-level truth information.

  • string_selection (Optional[List[int]], default: None) – Subset of strings for which data should be read and used to construct graph objects. Defaults to None, meaning all strings for which data exists are used.

  • selection (Union[str, List[int], List[List[int]], None], default: None) – The batch ids to include in the dataset. Defaults to None, meaning that batches are read.

  • dtype (dtype, default: torch.float32) – Type of the feature tensor on the graph objects returned.

  • loss_weight_table (Optional[str], default: None) – Name of the table containing per-event loss weights.

  • loss_weight_column (Optional[str], default: None) – Name of the column in loss_weight_table containing per-event loss weights. This is also the name of the corresponding attribute assigned to the graph object.

  • loss_weight_default_value (Optional[float], default: None) – Default per-event loss weight. NOTE: This default value is only applied when loss_weight_table and loss_weight_column are specified, and in this case to events with no value in the corresponding table/column. That is, if no per-event loss weight table/column is provided, this value is ignored. Defaults to None.

  • seed (Optional[int], default: None) – Random number generator seed, used for selecting a random subset of events when resolving a string-based selection (e.g., “10000 random events ~ event_no % 5 > 0” or “20% random events ~ event_no % 5 > 0”).

  • graph_definition (GraphDefinition) – Method that defines the graph representation.

  • cache_size (int, default: 1) – Number of files to cache in memory. Must be at least 1. Defaults to 1.

  • labels (Optional[Dict[str, Any]], default: None) – Dictionary of labels to be added to the dataset.

  • args (Any)

  • kwargs (Any)

Return type:

object

property chunk_sizes: List[int]

Return a list of the chunk sizes.

query_table(table, columns, sequential_index, selection)[source]

Query a table at a specific index, optionally with some selection.

Parameters:
  • table (str) – Table to be queried.

  • columns (Union[List[str], str]) – Columns to read out.

  • sequential_index (Optional[int], default: None) – Sequentially numbered index (i.e. in [0,len(self))) of the event to query. This _may_ differ from the indexation used in self._indices. If no value is provided, the entire column is returned.

  • selection (Optional[str], default: None) – Selection to be imposed before reading out data. Defaults to None.

Return type:

ndarray

Returns:

List of tuples containing the values in columns. If the table

contains only scalar data for columns, a list of length 1 is returned

Raises:

ColumnMissingException – If one or more element in columns is not present in table.