audio_ffmpeg_wrapped_mapper |
Audio |
- |
Simple wrapper to run a FFmpeg audio filter |
chinese_convert_mapper |
General |
zh |
Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by opencc) |
clean_copyright_mapper |
Code |
en, zh |
Removes copyright notice at the beginning of code files (:warning: must contain the word copyright) |
clean_email_mapper |
General |
en, zh |
Removes email information |
clean_html_mapper |
General |
en, zh |
Removes HTML tags and returns plain text of all the nodes |
clean_ip_mapper |
General |
en, zh |
Removes IP addresses |
clean_links_mapper |
General, Code |
en, zh |
Removes links, such as those starting with http or ftp |
expand_macro_mapper |
LaTeX |
en, zh |
Expands macros usually defined at the top of TeX documents |
fix_unicode_mapper |
General |
en, zh |
Fixes broken Unicodes (by ftfy) |
image_blur_mapper |
Image |
- |
Blur images |
image_captioning_from_gpt4v_mapper |
Multimodal |
- |
generate samples whose texts are generated based on gpt-4-visison and the image |
image_captioning_mapper |
Multimodal |
- |
generate samples whose captions are generated based on another model (such as blip2) and the figure within the original sample |
image_diffusion_mapper |
Multimodal |
- |
Generate and augment images by stable diffusion model |
image_face_blur_mapper |
Image |
- |
Blur faces detected in images |
nlpaug_en_mapper |
General |
en |
Simply augments texts in English based on the nlpaug library |
nlpcda_zh_mapper |
General |
zh |
Simply augments texts in Chinese based on the nlpcda library |
punctuation_normalization_mapper |
General |
en, zh |
Normalizes various Unicode punctuations to their ASCII equivalents |
remove_bibliography_mapper |
LaTeX |
en, zh |
Removes the bibliography of TeX documents |
remove_comments_mapper |
LaTeX |
en, zh |
Removes the comments of TeX documents |
remove_header_mapper |
LaTeX |
en, zh |
Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names |
remove_long_words_mapper |
General |
en, zh |
Removes words with length outside the specified range |
remove_non_chinese_character_mapper |
General |
en, zh |
Remove non Chinese character in text samples. |
remove_repeat_sentences_mapper |
General |
en, zh |
Remove repeat sentences in text samples. |
remove_specific_chars_mapper |
General |
en, zh |
Removes any user-specified characters or substrings |
remove_table_text_mapper |
General, Financial |
en |
Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile) |
remove_words_with_incorrect_ substrings_mapper |
General |
en, zh |
Removes words containing specified substrings |
replace_content_mapper |
General |
en, zh |
Replace all content in the text that matches a specific regular expression pattern with a designated replacement string |
sentence_split_mapper |
General |
en |
Splits and reorganizes sentences according to semantics |
video_captioning_from_audio_mapper |
Multimodal |
- |
Caption a video according to its audio streams based on Qwen-Audio model |
video_captioning_from_frames_mapper |
Multimodal |
- |
generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string |
video_captioning_from_summarizer_mapper |
Multimodal |
- |
Generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...) |
video_captioning_from_video_mapper |
Multimodal |
- |
generate samples whose captions are generated based on another model (video-blip) and sampled video frame within the original sample |
video_face_blur_mapper |
Video |
- |
Blur faces detected in videos |
video_ffmpeg_wrapped_mapper |
Video |
- |
Simple wrapper to run a FFmpeg video filter |
video_remove_watermark_mapper |
Video |
- |
Remove the watermarks in videos given regions |
video_resize_aspect_ratio_mapper |
Video |
- |
Resize video aspect ratio to a specified range |
video_resize_resolution_mapper |
Video |
- |
Map videos to ones with given resolution range |
video_split_by_duration_mapper |
Multimodal |
- |
Mapper to split video by duration |
video_spit_by_key_frame_mapper |
Multimodal |
- |
Mapper to split video by key frame |
video_split_by_scene_mapper |
Multimodal |
- |
Split videos into scene clips |
video_tagging_from_audio_mapper |
Multimodal |
- |
Mapper to generate video tags from audio streams extracted from the video. |
video_tagging_from_frames_mapper |
Multimodal |
- |
Mapper to generate video tags from frames extracted from the video. |
whitespace_normalization_mapper |
General |
en, zh |
Normalizes various Unicode whitespaces to the normal ASCII space (U+0020) |