API Reference¶
Tokenizers for LLMs.
GPTSmallTextDataset
¶
Bases: Dataset
GPT dataset interface for any 'small' text data.
This will tokenize all text in-memory using a GPT2's tokenization algorithm, which is a pre-trained Bite Pair Encoding (BPE).
Source code in src/llmz/datasets.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
__init__(text, max_length=256, stride=128)
¶
Initialise.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Raw text data to convert into tokens. |
required |
max_length
|
int
|
Number of tokens for each data instance. Defaults to 256. |
256
|
stride
|
int
|
Separation (in tokens) between consecutive instances. Defaults to 128. |
128
|
Source code in src/llmz/datasets.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
create_data_loader(batch_size=4, shuffle=True, drop_last=True, num_workers=0)
¶
Create data loader.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
The batch size. Defaults to 4. |
4
|
shuffle
|
bool
|
Whether to randomise instance order after each iteration. Defaults to True. |
True
|
drop_last
|
bool
|
Drop last batch if less than |
True
|
num_workers
|
int
|
Number of CPU processes to use for pre-processing. Defaults to 0. |
0
|
Returns:
Type | Description |
---|---|
DataLoader
|
A fully configured DataLoader |
Source code in src/llmz/datasets.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|