08 February 2024

MorphPiece : A Linguistic Tokenizer for Large Language Models

Written by Startup Idea

Authors: Haris Jabbar

Published on: July 14, 2023

Impact Score: 8.07

Arxiv code: Arxiv:2307.07262

Summary

What is new: The introduction of a linguistically motivated tokenization scheme, MorphPiece, which outperforms traditional stat-based tokenizers in various NLP tasks.
Why this is important: Existing tokenizers do not consider linguistic features, affecting the efficiency of language models.
What the research proposes: Developing MorphPiece, a new tokenization method based on morphological segmentation.
Results: MorphGPT, utilizing MorphPiece, showed superior or comparable performance to GPT-2 across different benchmarks, with fewer training iterations.

Technological frameworks used: GPT-style causal language model

Models used: MorphGPT, OpenAI GPT-2

Data used: GLUE Benchmark, Massive Text Embedding Benchmark (MTEB), FLOTA

NLP-driven technology companies, language model developers, educational tech, and AI-driven analytics firms.

We have generated a startup concept here: Morphling AI.