. . AI: Benchmarking automated assistants

Deep Fake Co-Workers, Hugging Face Assistants, Qwen

Friends, imagine a world where open source assistants were judged on a benchmark for automated agents, were multilingual, and had precise understanding of your personal documents in order to offer highly informed recommendations. That's the world these stories are pointing toward. But watch out for meetings entirely attended by deepfake coworkers! Wow.

As always, these are the stories the AI community is most-engaged with today. I hope you find them useful and interesting.

-Marshall Kirkpatrick, Editor

First impacted: Finance Teams, Fraud Specialists
Time to impact: Short

A finance worker at a multinational firm in Hong Kong was tricked into transferring $25 million to fraudsters who used AI deepfake technology to pose as the company's CFO and as multiple other employees. The deepfake scam was only exposed when the employer cross-verified the transaction with the corporation's head office. Hong Kong police have arrested six suspects linked with the use of AI deepfakes already. [Finance worker pays out $25 million after video call with deepfake ‘chief financial officer’] Explore more of our coverage of: AI Deepfake Fraud, Corporate Cybersecurity, Facial Recognition Systems. Share this story by email

First impacted: AI developers, AI researchers
Time to impact: Short

Clement Delangue, CEO of Hugging Face, announced that the platform has added over 4,000 new AI assistants over the weekend. It appears to be a very viable competitor to ChatGPT's custom GPTs platform. [via @ClementDelangue] Explore more of our coverage of: Hugging Face, AI Assistants, Technology Engagement. Share this story by email

First impacted: AI developers, Mandarin Speakers
Time to impact: Short

Qwen (a project of AliBaba) has launched its new base language models called Qwen1.5, which the company says includes chat models in six different sizes. In their release, the team notes that their Qwen1.5-72B model outperforms Llama2-70B in language comprehension, logical reasoning, and math. It also boasts improved multilingual capabilities, a context length of up to 32K tokens and is the highest on the C-eval ranking, which is "a comprehensive Chinese evaluation suite for foundation models". [Introducing Qwen1.5] Explore more of our coverage of: Qwen Language Models, AI Comprehension Improvement, Hugging Face Integration. Share this story by email

First impacted: Software Developers, Data Scientists, AI developers
Time to impact: Short

Jina AI is a company that offers embedding models that can significantly improve search and RAG (Retrieval-Augmented Generation) systems. The company's newest model supports searches in English and 30 programming languages up to 8K token length and represents a new state of the art level of performance according to the company. [Embedding API] Explore more of our coverage of: Jina AI, Code Search, Cloud Services. Share this story by email

First impacted: AI Developers, Travel Industry Professionals
Time to impact: Medium

Despite the advancements in large language models, AI agents still struggle with multi-constraint tasks such as comprehensive travel planning. A new paper reveals that even the most sophisticated models like GPT-4 only achieve a 0.6% success rate and highlights the gap between current AI capabilities and human-level planning and reasoning, offering a fertile ground for future research and development in AI. The paper also introduces "TravelPlanner," a new benchmark designed to evaluate language agents' ability to handle tool-use and complex planning tasks, and use 'common sense' within multiple real-world constraints, as required for travel planning scenarios. [Paper page - TravelPlanner: A Benchmark for Real-World Planning with Language Agents] Explore more of our coverage of: Language Agents, GPT-4 Models, AI Task Management. Share this story by email

That’s it! More AI news tomorrow!