Tag Archives: hijack

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

The new tokenizer has 200,000 tokens in total, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese. “So the tokenizer’s main impact, in my […]