nemo_automodel.components.datasets.llm.nanogpt_dataset#

PyTorch IterableDataset for .bin shards written by NanoGPT preprocessing scripts.

Supports both legacy fineweb.py format and the newer nanogpt_data_processor.py format.

Legacy format (fineweb.py)::

int32[256] header
    header[0] = 20240520        # magic number
    header[1] = 1               # version
    header[2] = num_tokens      # number of uint16 tokens that follow
    header[3] = (unused)        # defaults to 0

uint16[num_tokens] tokens

New format (nanogpt_data_processor.py)::

int32[256] header
    header[0] = 2788_95051      # magic number
    header[1] = 1               # version
    header[2] = num_tokens      # number of tokens that follow
    header[3] = dtype.itemsize  # bytes per token (2 for uint16, 4 for uint32)

uint16/uint32[num_tokens] tokens

Optionally, a corresponding .bos.idx file can exist alongside each .bin file::

int32[n_bos_tokens] bos_positions
    # Array of absolute byte positions where BOS tokens occur in the .bin file

The dataset streams one contiguous seq_len token slice at a time and returns the pair (inputs, labels) where labels is shifted by one position. Optionally, slices can be forced to start at the BOS token (align_to_bos=True). When BOS alignment is enabled, the dataset will use .bos.idx files for efficient BOS token lookup when available, falling back to linear search otherwise.

This file is copied (with minimal adjustments) from modded-nanogpt/data/bin_dataset.py so that projects depending on nemo_automodel can directly import BinTokenDataset without taking a runtime dependency on the NanoGPT codebase.

Module Contents#

Classes#

NanogptDataset

Dataset class for NanoGPT Dataset.

Functions#

_peek_num_tokens

Returns total number of tokens from the shard header, without traversing the data. Supports both legacy fineweb.py and new nanogpt_data_processor.py formats.

_load_bos_index

Load BOS token positions from a .bos.idx file if it exists.

_find_next_bos_with_index

Find the next BOS token position using the index.

_get_dtype_from_val

Returns the torch.dtype for the given value.

load_bin_shard

Memory-map a .bin shard and return it as a 1-D torch.uint16/uint32 tensor.

_get_next_bos_position

Get the next BOS token position.

_get_start_end_pos_single_file

Get the start and end positions for a single file, accounting for the number of workers.

_get_worker_id_and_total_workers

Get the total number of workers.

Data#

API#

nemo_automodel.components.datasets.llm.nanogpt_dataset.__all__#

[‘NanogptDataset’, ‘load_bin_shard’]

nemo_automodel.components.datasets.llm.nanogpt_dataset.MAGIC#

278895051

nemo_automodel.components.datasets.llm.nanogpt_dataset.LEGACY_MAGIC#

20240520

nemo_automodel.components.datasets.llm.nanogpt_dataset.VERSION#

1

nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_BYTES#

None

nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_SIZE#

256

nemo_automodel.components.datasets.llm.nanogpt_dataset._peek_num_tokens(path: str | os.PathLike) int#

Returns total number of tokens from the shard header, without traversing the data. Supports both legacy fineweb.py and new nanogpt_data_processor.py formats.

nemo_automodel.components.datasets.llm.nanogpt_dataset._load_bos_index(path: str | os.PathLike) numpy.ndarray | None#

Load BOS token positions from a .bos.idx file if it exists.

Parameters:

path – Path to the .bin file (will look for corresponding .bos.idx file)

Returns:

Array of BOS token positions if index file exists, None otherwise.

nemo_automodel.components.datasets.llm.nanogpt_dataset._find_next_bos_with_index(
bos_positions: numpy.ndarray,
start_pos: int,
max_pos: int,
) int#

Find the next BOS token position using the index.

Parameters:
  • bos_positions – Array of BOS token positions

  • start_pos – Current position to search from

  • max_pos – Maximum position to search up to

Returns:

Position of next BOS token, or max_pos if none found.

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_dtype_from_val(n_bytes: int) torch.dtype#

Returns the torch.dtype for the given value.

nemo_automodel.components.datasets.llm.nanogpt_dataset.load_bin_shard(path: str | os.PathLike) torch.Tensor#

Memory-map a .bin shard and return it as a 1-D torch.uint16/uint32 tensor.

The returned tensor shares memory with the underlying file and is therefore extremely cheap. Do not modify it in-place.

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_next_bos_position(
tokens: torch.Tensor,
bos_token: int,
bos_positions: numpy.ndarray,
pos: int,
max_pos: int,
) int#

Get the next BOS token position.

Parameters:
  • tokens – Tensor of tokens

  • bos_token – BOS token ID

  • bos_positions – Array of BOS token positions

  • pos – Current position

  • max_pos – Maximum position

Returns:

Next BOS token position

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_start_end_pos_single_file(
total_tokens: int,
total_workers: int,
global_worker_id: int,
) tuple[int, int]#

Get the start and end positions for a single file, accounting for the number of workers.

Parameters:
  • total_tokens – Total number of tokens in the file

  • total_workers – Total number of workers

  • global_worker_id – Global worker ID

Returns:

Tuple of (start position, end position)

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_worker_id_and_total_workers(
worker: torch.utils.data.get_worker_info,
) tuple[int, int]#

Get the total number of workers.

class nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset(
file_pattern: str | Sequence[str],
seq_len: int,
*,
bos_token: int | None = None,
shuffle_files: bool = False,
align_to_bos: bool = False,
)#

Bases: torch.utils.data.IterableDataset

Dataset class for NanoGPT Dataset.

A NanoGPT Dataset is a dataset that stores tokens in a binary file. The header contains:

  • 256x4-byte header (magic number, version, num_tokens, dtype.itemsize)

  • And the tokens themselves.

Optionally, a corresponding .bos.idx file can be present alongside each .bin file containing precomputed BOS token positions for efficient alignment when align_to_bos=True. If the index file is not present, the dataset falls back to linear search for BOS tokens.

Parameters:
  • file_pattern – str | Sequence[str] Glob pattern (e.g. "data/fineweb_*_train_*.bin") or an explicit list of file paths.

  • seq_len – int Length of the training sample returned (not counting the next-token target). labels are simply inputs[1:].

  • shuffle_files – bool, default False Shuffle the order of shards each epoch/iteration.

  • align_to_bos – bool, default False Ensure that every slice starts with bos_token. When enabled, the dataset searches forward from the current position until it finds the next BOS token and starts there. Uses .bos.idx files when available for efficient search, falls back to linear search otherwise. Requires bos_token to be provided.

  • bos_token – int, optional, default None. Token ID marking beginning-of-document.

Initialization

_setup_worker_context(
files,
shuffle,
) tuple[List[str], random.Random, bool, int, int]#

Set up worker-specific context including file assignment and splitting parameters.

Returns:

Tuple of (worker_files, rng, split_single_file, file_start_pos, file_end_pos)

_process_file_tokens(
file: str,
split_single_file: bool,
file_start_pos: int,
file_end_pos: int,
) Iterator[dict]#

Process tokens from a single file and yield training samples.

Parameters:
  • file – Path to the .bin file to process

  • split_single_file – Whether we’re splitting a single file among workers

  • file_start_pos – Starting position in the file (for single file splitting)

  • file_end_pos – Ending position in the file (for single file splitting)

Yields:

Dictionary containing ‘input_ids’ and ‘labels’ for training

_get_file_iterator(
worker_files: List[str],
rng: random.Random,
split_single_file: bool,
file_start_pos: int,
file_end_pos: int,
) Iterator[dict]#

Generate training samples from all assigned files, handling infinite iteration.

Parameters:
  • worker_files – List of files assigned to this worker

  • rng – Random number generator for shuffling

  • split_single_file – Whether we’re splitting a single file among workers

  • file_start_pos – Starting position in file (for single file splitting)

  • file_end_pos – Ending position in file (for single file splitting)

Yields:

Training sample dictionaries from all files

__iter__() Iterator[dict]#

Iterate over training samples from the dataset.

Yields:

Dictionary containing ‘input_ids’ and ‘labels’ for training

abstractmethod __len__() int#
abstractmethod __getitem__(index: int)#