Large language models

Large language models (LLM's) are software programs that are also known as a form of "artificial intelligence" (AI); LLM's are specifically an aspect of generative AI. This wiki area is for learning, teaching, and research related to LLM's.

Discourse and ideas

Here is discourse and ideas related to large language models. Perhaps once significantly developed/refined, some of these can have their own sub-page or become a unique learning resource.

Learning wikis as training data

Unless laws change, Creative Commons content appears to be valid training data for LLM's. As LLM's progress and advance, more and more data can be utilized to training increasingly complex models. Learning wikis devoted to learning, teaching, and resource, that allow for original research and original content creation (related to learning, teaching, and research), can potentially be extremely valuable (in terms of educational value) for large language models. Perhaps in the future (if this does not already exist), large language models will be able to continuously be trained on, retain, and learn from new data and information. Perhaps in the future, an open source large language model could only be trained on Creative Commons data, and therefore, all generated content would also be licensed under Creative Commons.

Discussion questions

Here are some learning and teaching oriented discussion questions related to large language models. Humans can use language and mental effort to explore these ideas collaboratively, or some of these could be used as prompts to see how an LLM might respond.

Would a large language model that is only trained on Creative Commons licensed data only be capable of generating responses to prompts that can also be rightly and correctly licensed under a Creative Commons license?
How might large language models affect learning and research. Will LLM's eventually seen like calculators are in math and sciences now? But for everything (all subjects/topics, including math, physics, ethics, biology, psychology, chemistry, engineering, art)?
What are some ethical considerations related to large language models that should be considered?
What are some pros and cons to open source large language models? Will open source LLM's likely become more advanced the propriety LLM's eventually? What do you think?
How can large language models help to advance and accelerate technological automation in ways that will benefit all of humanity?
In what ways can large language models help programmers to code?
Can music be thought of a language within the realm of large language models?
What is differentiable computing and how does differentiable computing relate to large language models?
How can teachers utilize large language models to help accelerate student learning and to help students learn more efficiently?

Educational prompt ideas

These are original prompt ideas regarding ways to learn about large language models, and also to explore using LLM's for learning, teaching, and research. Input these into your preferred LLM (without quotes) to see what results are generated. LLM's might produce interesting or useful answers in response to these prompts. Some of these prompts may be interesting or useful for discussions among and between humans.

"Describe to me how large language models can be utilized for learning, teaching, and research. Do this in an about 200 word two paragraph mini essay. Explain it to me like I am a freshman in community college."
"Give me a list of 12 ways that large language models can be utilized for learning, teaching, and research."
"How can LLM's be utilized to accelerate the pace of research and scientific discovery?"
"What are some ethical considerations related to large language models that should be considered?"
"What are some pros and cons to open source large language models? Will open source LLM's likely become more advanced the propriety LLM's eventually? What do you think?"
"What are some project ideas to integrate large language models in with humanoid robots, and/or other sorts of robots? Please give me 15 project ideas that can be relatively simple or extremely complex."
"Please search the Internet if possible. In what ways have university professors and academic researchers been using large language models in the last year? Please respond in list form."
"In what ways can large language models help programmers to code? Please provide me 8 examples and respond in list form."
"Can music be thought of a language within the realm of large language models?"
"What is differentiable computing and how does differentiable computing relate to large language models?"
"How can one fine tune an open source large language model?"
"What are some popular state of the art open source large language models. Please search the internet as helpful and respond to me in list form."
"Please give me a list of important terminology that I should be aware of when working with and training open source large language models. Please be comprehensive. Please respond in list form. And please search the internet as helpful."
"What sort of hardware should I utilize to run the most competent open source large language models that I want to utilize for learning, teaching, and research? Please search the internet as helpful."
"How can teachers utilize large language models to help accelerate student learning and to help students learn more efficiently? Please respond in list form."
"How can researchers utilize large language models to create theories, hypothesis, and to formulate potential research studies? Please respond in short paragraphs, but in list form."

Readings and learning media

External

Introduction to Hugging Face NLP

Introductory course about natural language processing (NLP) using libraries from the Hugging Face ecosystem – Transformers, Datasets, Tokenizers, and Accelerate.

NLP Course

transformer models

NLP, What, How, Encoder, Decoder, Sequence-to-sequence, Bias and limitations,

using transformers:

pipeline, models, tokenizer, batching, decoding, padding, attention mask

fine-tuning a pretrained model:

Preprocessing: tokenization, padding, Fine-tuning, Full training, map, dataset, dynamic padding, batch, collate function, train, predict, evaluate, accelerate

sharing models and tokenizers:

hub, model card

the datasets library:

batch, DataFrame, validation, splitting, embedding, FAISS

the tokenizers library:

training tokenizer, grouping, QnA, normalizers, pre-tokenization, models,trainers: Byte-Pair Encoding (BPE), WordPiece, Unigram, post processors, decoders

main nlp tasks:

token classification, metrics, perplexity, translation, summarization, training CLM, QnA,

how to ask for help

building and sharing demos

Hugging Face docs

https://huggingface.co/docs

Core libraries

Transformers – State-of-the-art ML for Pytorch, TensorFlow, and JAX.

pipeline – simple interface for inference with models.

Auto classes: AutoConfig, AutoModel, and AutoTokenizer. The from_pretrained method.

Trainer and TrainingArguments

Datasets – Access and share datasets for computer vision, audio, and NLP tasks.

Accelerate – Easily train and use PyTorch models with multi-GPU, TPU, mixed-precision.

Tokenizers – Fast tokenizers, optimized for both research and production.

Components: Normalizers, Pre-tokenizers, Models, Post-Processors, Decoders ...

Hub – Host Git-based models, datasets and Spaces on the Hugging Face Hub.

Diffusers – State-of-the-art diffusion models for image and audio generation in PyTorch.

Hub Python Library – Client library for the HF Hub: manage repositories from your Python runtime.

Huggingface.js – A collection of JS libraries to interact with Hugging Face, with TS types included.

Transformers.js – Community library to run pretrained models from Transformers in your browser.

Inference API (serverless) – Experiment with over 200k models easily using the serverless tier of Inference Endpoints.

Inference Endpoints (dedicated) – Easily deploy models to production on dedicated, fully managed infrastructure.

PEFT – Parameter efficient fine-tuning methods for large models

Soft prompting, LoRA, IA3

Optimum – Fast training and inference of HF Transformers with easy to use hardware optimization tools.

AWS Trainium & Inferentia – Train and Deploy Transformers & Diffusers with AWS Trainium and AWS Inferentia via Optimum

Evaluate – Evaluate and report model performance easier and more standardized.

types: metrics, comparisons, measurements

Tasks

extraction, question answering, classification, generation ...

Dataset viewer – API to access the contents, metadata and basic statistics of all Hugging Face Hub datasets.

Splits and subsets, dataset-viewer

TRL – Transformer Reinforcement Learning

reward modeling, fine-tuning, optimizations,

Amazon SageMaker – Train and Deploy Transformer models with Amazon SageMaker and Hugging Face Deep Learning Containers (DLC).

timm – Pytorch Image Models.

State-of-the-art computer vision models, layers, optimizers, training/evaluation, and utilities.

Safetensors – Simple, safe way to store and distribute neural networks weights.

Text Generation Inference (TGI) – Toolkit to serve Large Language Models.

AutoTrain – AutoTrain API and UI.

autotrain

Text Embeddings Inference – Toolkit to serve Text Embedding Models.

Competitions – Create your own competitions on Hugging Face.

Bitsandbytes – Toolkit to optimize and quantize models.

Google TPUs – Deploy models on Google TPUs via Optimum.

Chat UI – Open source chat frontend, powers the HuggingChat app.

Leaderboards – Create your own Leaderboards on Hugging Face.

Hugging Face Generative AI Services (HUGS) – optimized, zero-configuration inference microservices designed to simplify and accelerate the development of AI applications with open models.

Videos

Data sets

Releasing the largest multilingual open pretraining dataset

Common Corpus

Files and versions