Operator Schemas

Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets.

This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the API documentation for the specific parameters of each operator. Users can refer to and run the unit tests for examples of operator-wise usage as well as the effects of each operator when applied to built-in test data samples.

Overview

The operators in Data-Juicer are categorized into 5 types.

Type	Number	Description
Formatter	7	Discovers, loads, and canonicalizes source data
Mapper	43	Edits and transforms samples
Filter	41	Filters out low-quality samples
Deduplicator	5	Detects and removes duplicate samples
Selector	2	Selects top samples based on ranking

All the specific operators are listed below, each featured with several capability tags.

Domain Tags
- General: general purpose
- LaTeX: specific to LaTeX source files
- Code: specific to programming codes
- Financial: closely related to financial sector
- Image: specific to images or multimodal
- Audio: specific to audios or multimodal
- Video: specific to videos or multimodal
- Multimodal: specific to multimodal
Language Tags
- en: English
- zh: Chinese

Formatter

Operator	Domain	Lang	Description
remote_formatter	General	en, zh	Prepares datasets from remote (e.g., HuggingFace)
csv_formatter	General	en, zh	Prepares local `.csv` files
tsv_formatter	General	en, zh	Prepares local `.tsv` files
json_formatter	General	en, zh	Prepares local `.json`, `.jsonl`, `.jsonl.zst` files
parquet_formatter	General	en, zh	Prepares local `.parquet` files
text_formatter	General	en, zh	Prepares other local text files (complete list)
mixture_formatter	General	en, zh	Handles a mixture of all the supported local file types

Mapper

Operator	Domain	Lang	Description
audio_ffmpeg_wrapped_mapper	Audio	-	Simple wrapper to run a FFmpeg audio filter
chinese_convert_mapper	General	zh	Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by opencc)
clean_copyright_mapper	Code	en, zh	Removes copyright notice at the beginning of code files (:warning: must contain the word copyright)
clean_email_mapper	General	en, zh	Removes email information
clean_html_mapper	General	en, zh	Removes HTML tags and returns plain text of all the nodes
clean_ip_mapper	General	en, zh	Removes IP addresses
clean_links_mapper	General, Code	en, zh	Removes links, such as those starting with http or ftp
expand_macro_mapper	LaTeX	en, zh	Expands macros usually defined at the top of TeX documents
fix_unicode_mapper	General	en, zh	Fixes broken Unicodes (by ftfy)
image_blur_mapper	Image	-	Blur images
image_captioning_from_gpt4v_mapper	Multimodal	-	generate samples whose texts are generated based on gpt-4-visison and the image
image_captioning_mapper	Multimodal	-	generate samples whose captions are generated based on another model (such as blip2) and the figure within the original sample
image_diffusion_mapper	Multimodal	-	Generate and augment images by stable diffusion model
image_face_blur_mapper	Image	-	Blur faces detected in images
nlpaug_en_mapper	General	en	Simply augments texts in English based on the `nlpaug` library
nlpcda_zh_mapper	General	zh	Simply augments texts in Chinese based on the `nlpcda` library
punctuation_normalization_mapper	General	en, zh	Normalizes various Unicode punctuations to their ASCII equivalents
remove_bibliography_mapper	LaTeX	en, zh	Removes the bibliography of TeX documents
remove_comments_mapper	LaTeX	en, zh	Removes the comments of TeX documents
remove_header_mapper	LaTeX	en, zh	Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names
remove_long_words_mapper	General	en, zh	Removes words with length outside the specified range
remove_non_chinese_character_mapper	General	en, zh	Remove non Chinese character in text samples.
remove_repeat_sentences_mapper	General	en, zh	Remove repeat sentences in text samples.
remove_specific_chars_mapper	General	en, zh	Removes any user-specified characters or substrings
remove_table_text_mapper	General, Financial	en	Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile)
remove_words_with_incorrect_ substrings_mapper	General	en, zh	Removes words containing specified substrings
replace_content_mapper	General	en, zh	Replace all content in the text that matches a specific regular expression pattern with a designated replacement string
sentence_split_mapper	General	en	Splits and reorganizes sentences according to semantics
video_captioning_from_audio_mapper	Multimodal	-	Caption a video according to its audio streams based on Qwen-Audio model
video_captioning_from_frames_mapper	Multimodal	-	generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string
video_captioning_from_summarizer_mapper	Multimodal	-	Generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...)
video_captioning_from_video_mapper	Multimodal	-	generate samples whose captions are generated based on another model (video-blip) and sampled video frame within the original sample
video_face_blur_mapper	Video	-	Blur faces detected in videos
video_ffmpeg_wrapped_mapper	Video	-	Simple wrapper to run a FFmpeg video filter
video_remove_watermark_mapper	Video	-	Remove the watermarks in videos given regions
video_resize_aspect_ratio_mapper	Video	-	Resize video aspect ratio to a specified range
video_resize_resolution_mapper	Video	-	Map videos to ones with given resolution range
video_split_by_duration_mapper	Multimodal	-	Mapper to split video by duration
video_spit_by_key_frame_mapper	Multimodal	-	Mapper to split video by key frame
video_split_by_scene_mapper	Multimodal	-	Split videos into scene clips
video_tagging_from_audio_mapper	Multimodal	-	Mapper to generate video tags from audio streams extracted from the video.
video_tagging_from_frames_mapper	Multimodal	-	Mapper to generate video tags from frames extracted from the video.
whitespace_normalization_mapper	General	en, zh	Normalizes various Unicode whitespaces to the normal ASCII space (U+0020)

Filter

Operator	Domain	Lang	Description
alphanumeric_filter	General	en, zh	Keeps samples with alphanumeric ratio within the specified range
audio_duration_filter	Audio	-	Keep data samples whose audios' durations are within a specified range
audio_nmf_snr_filter	Audio	-	Keep data samples whose audios' Signal-to-Noise Ratios (SNRs, computed based on Non-Negative Matrix Factorization, NMF) are within a specified range
audio_size_filter	Audio	-	Keep data samples whose audios' sizes are within a specified range
average_line_length_filter	Code	en, zh	Keeps samples with average line length within the specified range
character_repetition_filter	General	en, zh	Keeps samples with char-level n-gram repetition ratio within the specified range
flagged_words_filter	General	en, zh	Keeps samples with flagged-word ratio below the specified threshold
image_aesthetics_filter	Image	-	Keeps samples containing images whose aesthetics scores are within the specified range
image_aspect_ratio_filter	Image	-	Keeps samples containing images with aspect ratios within the specified range
image_face_ratio_filter	Image	-	Keeps samples containing images with face area ratios within the specified range
image_nsfw_filter	Image	-	Keeps samples containing images with NSFW scores below the threshold
image_shape_filter	Image	-	Keeps samples containing images with widths and heights within the specified range
image_size_filter	Image	-	Keeps samples containing images whose size in bytes are within the specified range
image_text_matching_filter	Multimodal	-	Keeps samples with image-text classification matching score within the specified range based on a BLIP model
image_text_similarity_filter	Multimodal	-	Keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model
image_watermark_filter	Image	-	Keeps samples containing images with predicted watermark probabilities below the threshold
language_id_score_filter	General	en, zh	Keeps samples of the specified language, judged by a predicted confidence score
maximum_line_length_filter	Code	en, zh	Keeps samples with maximum line length within the specified range
perplexity_filter	General	en, zh	Keeps samples with perplexity score below the specified threshold
phrase_grounding_recall_filter	Multimodal	-	Keeps samples whose locating recalls of phrases extracted from text in the images are within a specified range
special_characters_filter	General	en, zh	Keeps samples with special-char ratio within the specified range
specified_field_filter	General	en, zh	Filters samples based on field, with value lies in the specified targets
specified_numeric_field_filter	General	en, zh	Filters samples based on field, with value lies in the specified range (for numeric types)
stopwords_filter	General	en, zh	Keeps samples with stopword ratio above the specified threshold
suffix_filter	General	en, zh	Keeps samples with specified suffixes
text_action_filter	General	en, zh	Keeps samples containing action verbs in their texts
text_entity_dependency_filter	General	en, zh	Keeps samples containing entity nouns related to other tokens in the dependency tree of the texts
text_length_filter	General	en, zh	Keeps samples with total text length within the specified range
token_num_filter	General	en, zh	Keeps samples with token count within the specified range
video_aesthetics_filter	Video	-	Keeps samples whose specified frames have aesthetics scores within the specified range
video_aspect_ratio_filter	Video	-	Keeps samples containing videos with aspect ratios within the specified range
video_duration_filter	Video	-	Keep data samples whose videos' durations are within a specified range ｜
video_frames_text_similarity_filter	Multimodal	-	Keep data samples whose similarities between sampled video frame images and text are within a specific range ｜
video_motion_score_filter	Video	-	Keep samples with video motion scores within a specific range ｜
video_nsfw_filter	Video	-	Keeps samples containing videos with NSFW scores below the threshold
video_ocr_area_ratio_filter	Video	-	Keep data samples whose detected text area ratios for specified frames in the video are within a specified range ｜
video_resolution_filter	Video	-	Keeps samples containing videos with horizontal and vertical resolutions within the specified range
video_watermark_filter	Video	-	Keeps samples containing videos with predicted watermark probabilities below the threshold
video_tagging_from_frames_filter	Video	-	Keep samples containing videos with given tags
word_num_filter	General	en, zh	Keeps samples with word count within the specified range
word_repetition_filter	General	en, zh	Keeps samples with word-level n-gram repetition ratio within the specified range

Deduplicator

Operator	Domain	Lang	Description
document_deduplicator	General	en, zh	Deduplicates samples at document-level by comparing MD5 hash
document_minhash_deduplicator	General	en, zh	Deduplicates samples at document-level using MinHashLSH
document_simhash_deduplicator	General	en, zh	Deduplicates samples at document-level using SimHash
image_deduplicator	Image	-	Deduplicates samples at document-level using exact matching of images between documents
video_deduplicator	Video	-	Deduplicates samples at document-level using exact matching of videos between documents
ray_document_deduplicator	General	en, zh	Deduplicates samples at document-level by comparing MD5 hash on ray
ray_image_deduplicator	Image	-	Deduplicates samples at document-level using exact matching of images between documents on ray
ray_video_deduplicator	Video	-	Deduplicates samples at document-level using exact matching of videos between documents on ray

Selector

Operator	Domain	Lang	Description
frequency_specified_field_selector	General	en, zh	Selects top samples by comparing the frequency of the specified field
topk_specified_field_selector	General	en, zh	Selects top samples by comparing the values of the specified field

Contributing

We welcome contributions of adding new operators. Please refer to How-to Guide for Developers.

22 KiB Raw Permalink Blame History