Databricks wheels in Dolly chatbot

Published thu 13 Apr 2023 // 13:45 UTC

Lakehouse shipper Databricks has updated its open-source Dolly ChatGPT-like large language model to make its AI facilities available for business applications without needing massive GPU resources or costly API use.

ChatGPT is a chatbot launched by OpenAI which is based on a machine learning Large Language Model (LLM) and generates sensible-seeming text in response to user requests. This has generated an extraordinary wave of interest despite the fact that it can generate wrong answers and make up sources for statements it wants to use. ChatGPT was trained using a large population of GPUs and input parameters; the GPT-3 (Generative Pretrained Transformer 3) model on which it is based used 175 billion parameters.

BANDF AD

Databricks sidestepped these limitations to create its Dolly chatbot, a 12 billion parameter language model based on the EleutherAI pythia model family. Its creators write on a company blog: “This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.”

The Databricks team did this in two stages. In late March they released Dolly v1.0, an LLM trained using a 6 billion parameter model from Eleuther.AI. This was modified “ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.”

They say Dolly v1.0 shows how “anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data.”

They open sourced the code for Dolly and showed how it can be re-created on Databricks, saying: “We believe models like Dolly will help democratize LLMs, transforming them from something very few companies can afford into a commodity every company can own and customize to improve their products.”

BANDF AD

Now Dolly 2.0 has a larger model of 12 billion parameters – “based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.” Databricks is “open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use.”

The dataset, databricks-dolly-15k, contains 15,000 prompt/response pairs designed for LLM instruction tuning, “authored by more than 5,000 Databricks employees during March and April of 2023.”

The OpenAI API dataset has terms of service that prevent users creating a model that competes with OpenAI. Databricks crowdsourced its own 15,000 prompt/response pair dataset to get around this additional limitation. Its dataset has been “generated by professionals, is high quality, and contains long answers to most tasks.” In contrast: “Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors.”

To download Dolly 2.0 model weights, visit the Databricks Hugging Face page and visit the Dolly repo on databricks-labs to download the databricks-dolly-15k dataset. And join a Databricks webinar to discover how you can harness LLMs for your own organization.

BANDF AD

+ Comment

A capability of Dolly-like LLMs is that they can write code, specifically SQL code. That could lead to non-SQL specialists being able to set up and run queries on the Databricks lakehouse without knowing any SQL at all.

This leads on to two thoughts: one, SQL devs could use it to become more productive and, two, you don’t need so many SQL devs. Dolly could reduce the need for SQL programmers in the Databricks world.

Extend that thought to Snowflake and all the other data warehouse environments and SQL skills could become a lot less valuable in future.

Databricks wheels in Dolly chatbot

+ Comment

AI server frenzy fuels record revenues for Dell

All-flash array topline boost puts NetApp on track to strongest year yet

Everpure tops $1B quarter as FY 26 revenue hits $3.7B

Nutanix beats the Street on revenue, lands $150M AMD AI alliance

VAST broadens AI platform push with Nvidia tie-up and control plane

Index Engines flags rise of polymorphic, shadow-encrypting ransomware

Commvault plugs AI anomaly alerts into CrowdStrike Falcon SIEM

Scality RING becomes back-end object store for WEKA NeuralMesh

Druva adds Agentic Memory to speed forensic compliance probes

Backblaze lands first eight-figure neocloud deal as revenue climbs 12%

Backblaze brings backend B2 Neo cloud storage to GPU server farms

HDD is back: The return of the hard drive

Storage news ticker – February 23

How AI Is forcing storage back into the enterprise conversation

StorONE arrays adopt external flash JBODs in flash program

Flamethrower from Backblaze to fire up startup cloud storage

Quantum results show green shoots as tape sales double

Platters: WD new disk drive tech hits lucky 14

Court dismisses NetApp complaint against ex-CTO now at VAST, but NetApp is appealing

Suite Studios cuts out proprietary file formats with S3-native streaming

Dell VP says discrete beats disaggregated storage for AI

SK Hynix proposes HBM and HBF hybrid for LLM inference