Open Machine Learning

Introduction

This page about Open Machine Learning can be displayed as Wiki2Reveal slides. Single sections are regarded as slides and modifications on the slides will immediately affect the content of the slides. The following aspects of Open Machine Learning are considered in detail:

(1) Open Source for learning algorithms
(2) Open Training Data for reproducible status used pre-trained machine e.g. in science and education
(3) Licensing of derivative work in the context of Open Machine Learning

Objective

This learning resource about Open Machine Learning in Wikiversity has the objective to address the role of Open Training Data (OTD) for generated output.

Modules of the Course

From Tensorflow and its application towards and understanding of the underlying mathematical principles.
Motion Capturing

Target Group

The target groups for Open Machine Learning of the learning resource are:

Bachelor/Master students with the subject
people that are interested in Open Innovation Ecosystems and want to use ML infrastructure on Open Data repositories (e.g. for a humanitarian support of communities).

Reproducibility

In reproducibility of machine learning, one considers the sources necessary for it to be possible to reproduce an existing system capable of learning (Digital Public Good^[1]). These include.

Algorithms that defines the behavior of the adaptive system,
training data that also defines and modifies the behavior of the system; and
machine state data, which describes the state $M_{t}$ of the adaptive system at time $t\in T$ .

Reproducibility for algorithms

Learning algorithms define how the machine state changes as a function of the input data or training data. For open machine learning, these must be available as open source code so that the code can not only be used, but also reviewed and improved by the scientific community. As an introductory example, the gradient descent method can be mentioned, which is used for learning algorithms of backpropagation networks for error minimization.

Training data - Definition of In-Output-Behaviour

Two learning systems $(M_{t})_{t\in T}$ and $({\widetilde {M_{t}}})_{t\in T}$ can, for example. use the same artificial neural network for machine learning with the same initial state $M_{o}={\widetilde {M_{o}}}$ , but "feed" it with completely different training data sets and thus exhibit completely different behavior at a later time $t>0$ . In an open reproducible system, the training data used is openly accessible according to the FAIR data principle.

Learning Tasks - Machine Learning

Learning activities focus on role of training data for the machine learning algorithms:

(Novice to ML) If you are new to Machine Learning (ML) it is recommended to start with a exploring the concept and foundations of machine learning.
(Supervised, unsupervised ML) Explain the concept supervised and unsupervised machine learning and apply this to text generation with training data as text documents under an open content license for the training data.

Learning Tasks - Derivative Work

To address licensing and derivative work we consider open licensing models, that can be used to assure community access. The community can access the resources, modify the resources and preserve the resources for the community to which people contributed to in an evolutionary process.

(Derivated Work) Explain what "derivative work" means e.g. for Open Source software or Creative Commons documents.
(Open Data) What are the challenges and constraints for Open Data that is used for Machine Learning (ML)? Can users add data to an repository?
(Transparent Chain of Licenses) Assume that a machine $M:=(M_{t})_{t\in T}$ with an initial state $M_{0}$ is trained with training data $\mathbb {T}$ that is issued under a specific open license $L$ . The license $L$ allows derivative work. If machine learning is generative then the chain licenses will assign the same license to the generated text.
(Versions of Training Data Sets) Training data may change over time so the the training data $\mathbb {T} _{t}$ may have a time index $t$ a well.
(Multiple Licenses for Training Data Sets ) If training data sets are aggregated from different sources, then different involved licenses may also change of time (e.g. $\mathbb {L} _{t}:=\{L_{1},L_{2},L_{3}\}$ means that at time $t$ data set $\mathbb {T} _{t}$ is aggregated from training data with the licenses $L_{1},L_{2}$ and $L_{3}$ .
(Learning Algorithm) The used learning algorithm $\lambda$ defines how the machine state evolves depending on the training data. With a discrete iterative step from $t$ to $t+1$ and the training data $\mathbb {T} _{t}$ used at time $t$ $\lambda (M_{t},\mathbb {T} _{t})=M_{t+1}$ . This means that learning algorithm $\lambda$ maps the current machine state $M_{t}$ together with the training data $\mathbb {T} _{t}$ to the new machine state $M_{t+1}$ . The next learning step $\lambda (M_{t+1},\mathbb {T} _{t+1})=M_{t+2}$ creates in an inductive way the next machine state $M_{t+2}$ . We can define $\lambda (M_{t},\emptyset ):=M_{t}$ , meaning that the machine state will remain unchanged if no training data is provided.
(Origin, Experimental Design, Meta data) For scientific purpose it is often important for a specific purposes, that it can be clarified who collected the data and what was the experimental design. Explain the requirements of data collection and the related scientific standards for it. Discuss the similarities and differences in the context of training data for machine learning. Assume that also origin, institution and experimental design is available as meta data for the training data set. What other meta is relevant for you to assess the quality of data, that is used for training?
(Transparency for Trained Models) If training data $\mathbb {T} _{t}$ is not available due privacy regulation e.g. for medical data, then the detailed license information with meta data together institute which trained the model, could help to assess, if the machine $M_{t}$ at time $t$ might be used for a specific purpose.
(Machine State) If we aggregate the consideration above we can create a reference to the machine state $M_{t}$ , that trains the machine with the learning algorithm $\lambda$ and the training data $\mathbb {T} _{t}$ . At time $t$ machine learning used the list of licenses $\mathbb {L} _{t}$ and optional other metadata (for reproducability). The in-output-pair $(x_{t},y_{t})$ defines that $M_{t}$ generated $y_{t}:=M_{t}(x_{t})$ with the input $x_{t}$ . So the tupel $(\lambda ,M_{0},M_{t},\mathbb {T} _{t},\mathbb {L} _{t},x_{t},y_{t})$ defines how $y_{t}$ , was generated with $M_{t}$ and the involved licenses $\mathbb {L} _{t}$ with the meta data for $\mathbb {T} _{t}$ . Assuming that $M_{t}$ and $\mathbb {L} _{t}$ are references to version in a version control system. Discuss the limitations of the approach especially when a huge amount of data is used for training and $(M_{t})_{t\in T}$ is constantly trained with a input stream of data.
(Generative Artificial Intelligence) Assume a user uses generative AI, some components of the tupel $(\lambda ,M_{0},M_{t},\mathbb {T} _{t},\mathbb {L} _{t},x_{t},y_{t})$ might not be known. But at least the sequence of prompts $x_{1},x_{2},\ldots ,x_{n}$ with the corresponding output $y_{1},y_{2},\ldots ,y_{n}$ can be compulsory to be documented especially for student home work. This allows to identify the added value of the student beyond generative AI. Key questions for the assessment:
- Was the logical structure of thesis defined by the generative AI and what are the reasons for students to change the provided generated logical structure? How did the changes improve the logical structure of the thesis in comparison to the AI generated results?
- Are citations/references are included in the generated document? Do the citations provide evidence for the discussed content in the thesis?
- Is the state of the art in science properly integrated in the thesis or does the research question in the thesis require other relevant scientific results to cover the current scientific knowledge about the topic? Did the student add those references and did argue in the thesis why these citations are missing?
- Using generative AI in a thesis in terms of transparency requires 3 components
  - (Prompt Results) prompts and results pairs $(x_{t},y_{t})$ ,
  - (Manual Changes) manual changes of $y_{t}$ by the students to ${\widetilde {y_{t}}}$ and
  - (Meta Analysis) meta discussion why changes to $y_{t}$ are necessary to fulfill the scientific requirements of the thesis.
(Licensing Chains) With licensing chains it is possible to $(\mathbb {L} _{0},...,\mathbb {L} _{t},,\mathbb {L} _{t+1})$ with different licensing models $\mathbb {L} _{t}$ are used along with the training data $\mathbb {T} _{t}$ . Due to this openness of licensing training data can be reduced to subset ${\widetilde {\mathbb {T} }}_{t}\subset \mathbb {T} _{t}$ , because not the complete set of training data is compliant with a required license. Instead of training the machine $\lambda (M_{t},\mathbb {T} _{t})=M_{t+1}$ with $\mathbb {T} _{t}$ we train with the license compliant subset ${\widetilde {\mathbb {T} }}_{t}$ with a suitable license and obtain a new machine state according to license compatibility $\lambda (M_{t},{\widetilde {\mathbb {T} }}_{t})={\widetilde {M}}_{t+1}$ . Discuss applications of this scenario and discuss PROs and CONs of a reduced set of training data for the training process at time $t$ .

Examples - Derivative Work for Data

Consider the following examples as introduction and discuss differences and similarities of Machine Learning that is based on the training set $\mathbb {T} _{t}$ :

(New Data) Due to a new empirical study with the same experimental design of the existing training data set $\mathbb {T} _{t}$ at time $t$ data is added and new set ${\widetilde {\mathbb {T} _{t}}}$ ,
(Missing Data) in the existing training data set $\mathbb {T} _{t}$ at time $t$ missing values are added and ${\widetilde {\mathbb {T} _{t}}}$ is the corrected data.
(Correct Data) input errors correct e.g. input data about the temperature $-350^{\circ }\,C$ was modified to $-3.50^{\circ }\,C$ ).

Describe your workflow, so that the new state of machine does include the improved dataset. Does it make sense to train $M_{t+1}$ with the improved dataset ${\widetilde {\mathbb {T} _{t}}}$ . What are implication on the version control of machine states?

Learning Tasks - Open Data

Transfer the concept of derivative work to open data and discuss how alteration and modification of data can be managed in a trusted way by the community.

Learning Tasks - Training Data

Analyze open licensing models (like GNU Public License, Creative commons, ...) how derivative work of donate digital contributions and derivative can remain open for community? How does the licensing model contribute to Open Innovation Ecosystem with digital public goods^[1] . Apply this concept to training data for Open Machine Learning and discuss the requirements and constraints. Apply open data on Spatial Risk Management e.g. in the context of road safety^[2]. What are the benefits, challenges, requirements and constraints of using machine learning in that context?

Learning Tasks - Open Machine Learning Licensing Chains

Assume we have training data (e.g. text documents under a Creative Commons License) and train a system machine $M_{t}$ at a time $t$ with an open source machine learning algorithm, that is transparently available by the community. A new system state of the machine $M_{t+1}$ changes the In-Out-Behaviour (IOB) by the training process. Now it generates the output $y$ with input data $x$ with $M_{t+1}(x)=y$ . What licensing model should be assigned to the output $y$ if the training data is provided under the license $L$ ? Discuss different views about a licensing chain that the uses the output $y$ given by machine $M_{t+1}$ is used as training data to generate a new machine state $M_{t+2}$ .

Learning Tasks - Same Machine Learning Algorithm different Training Data

Assuming we have two different Open Training data sets $\mathbb {T}$ and ${\widetilde {\mathbb {T} }}$ . Furthermore use a neural network model (e.g. Backpropagation networks and a predefined topology of the network (i.e. number of neurons, connections between neurons, layers of neurons, ...) and activation functions of the neurons. So the starting states of two machines $M_{0}$ and ${\widetilde {M}}_{0}$ at time $t=0$ are considered to the same.

Use different Training Data Sets

Due to the two different data sets $\mathbb {T} _{1}$ and $\mathbb {T} _{2}$ . for the open transparent and reproducible training of $M_{t}$ and ${\widetilde {M}}_{t}$ the machines evolve on different pathways in the time index $t\in T$ .

Trained In-Output-Behavior

In general the In-Output-Behaviour (IOB) will be different at the time $t\in T$ . Discuss the role training data $\mathbb {T}$ and ${\widetilde {\mathbb {T} }}$ as part of "programming" the IOB of the machines.

Bias in Training Data

What is a Bias? Discuss an example training data that has bias (e.g. in the context of human rights, added fake news data, missing data, unreliable data sources,...) and explain how the bias in the training data has an impact of the In-Output-Behaviour (IOB) of the machine that is used for decision making e.g. in the medical domain^[3]. How can transparency and openness for the training data can help to identify a bias^[4]. What are the challenges, requirements and constraints (e.g. data privacy regulations)?

Assessment and Machine Learning

Assume a student $A$ has to do an assessment about a specific topic and has to produce a specific document as homework for it. The student works hard $A$ , creates a good appropriate homework and submits the result to the teacher. Additionally student $A$ provides the document to another student $B$ . Student $B$ takes the same document just like taking a book from the shelf and submits the same document to the teacher too (e.g. without looking into the document). From an outside perspective we assume that student $A$ has understood the topic of the homework. Student $B$ may not have any insights into the topic:

An oral assessment about the topic or
a supervised transfer task of the lessons learnt to a new example or counterexample may uncover the knowledge about the topic.

Transfer this example to Machine Learning and create consequences, how an assessment can be designed that assesses gained knowledge and problem solving capabilities instead of a successful delivery of an appropriate document for the learning task.

References

↑ ^1.0 ^1.1 Nordhaug, L. M., & Harris, L. (2021). Digital public goods: enablers of digital sovereignty. DOI: 10.1787/c023cb2e-en - In book: Development Co-operation Report 2021
↑ Najjar, A., Kaneko, S. I., & Miyanaga, Y. (2017, February). Combining satellite imagery and open data to map road safety. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).
↑ Mac Namee, B., Cunningham, P., Byrne, S., & Corrigan, O. I. (2002). The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1), 51-70.
↑ Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12 (pp. 158-171). Springer Berlin Heidelberg.

External References

LAION - Democratizing AI

Page Information

You can display this page as Wiki2Reveal slides

Wiki2Reveal

The Wiki2Reveal slides were created for the Open community approach' and the Link for the Wiki2Reveal Slides was created with the link generator.

This page is designed as a PanDocElectron-SLIDE document type.
Source: Wikiversity https://en.wikiversity.org/wiki/Open%20community%20approach/Open%20Machine%20Learning
see Wiki2Reveal for the functionality of Wiki2Reveal.

[Nordhaug-1] 1.0 ^1.1 Nordhaug, L. M., & Harris, L. (2021). Digital public goods: enablers of digital sovereignty. DOI: 10.1787/c023cb2e-en - In book: Development Co-operation Report 2021

[2] Najjar, A., Kaneko, S. I., & Miyanaga, Y. (2017, February). Combining satellite imagery and open data to map road safety. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).

[3] Mac Namee, B., Cunningham, P., Byrne, S., & Corrigan, O. I. (2002). The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1), 51-70.

[4] Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12 (pp. 158-171). Springer Berlin Heidelberg.

[1]

[2]

[3]

[4]