Introduction

edit

This page about Open Machine Learning can be displayed as Wiki2Reveal slides. Single sections are regarded as slides and modifications on the slides will immediately affect the content of the slides. The following aspects of Open Machine Learning are considered in detail:

  • (1) Open Source for learning algorithms
  • (2) Open Training Data for reproducible status used pre-trained machine e.g. in science and education
  • (3) Licensing of derivative work in the context of Open Machine Learning

Objective

edit

This learning resource about Open Machine Learning in Wikiversity has the objective to address the role of Open Training Data (OTD) for generated output.

Modules of the Course

edit

Target Group

edit

The target groups for Open Machine Learning of the learning resource are:

  • Bachelor/Master students with the subject
  • people that are interested in Open Innovation Ecosystems and want to use ML infrastructure on Open Data repositories (e.g. for a humanitarian support of communities).

Reproducibility

edit

In reproducibility of machine learning, one considers the sources necessary for it to be possible to reproduce an existing system capable of learning (Digital Public Good[1]). These include.

  • Algorithms that defines the behavior of the adaptive system,
  • training data that also defines and modifies the behavior of the system; and
  • machine state data, which describes the state   of the adaptive system at time  .

Reproducibility for algorithms

edit

Learning algorithms define how the machine state changes as a function of the input data or training data. For open machine learning, these must be available as open source code so that the code can not only be used, but also reviewed and improved by the scientific community. As an introductory example, the gradient descent method can be mentioned, which is used for learning algorithms of backpropagation networks for error minimization.

Training data - Definition of In-Output-Behaviour

edit

Two learning systems   and   can, for example. use the same artificial neural network for machine learning with the same initial state  , but "feed" it with completely different training data sets and thus exhibit completely different behavior at a later time  . In an open reproducible system, the training data used is openly accessible according to the FAIR data principle.

Learning Tasks - Machine Learning

edit

Learning activities focus on role of training data for the machine learning algorithms:

  • (Novice to ML) If you are new to Machine Learning (ML) it is recommended to start with a exploring the concept and foundations of machine learning.
  • (Supervised, unsupervised ML) Explain the concept supervised and unsupervised machine learning and apply this to text generation with training data as text documents under an open content license for the training data.

Learning Tasks - Derivative Work

edit

To address licensing and derivative work we consider open licensing models, that can be used to assure community access. The community can access the resources, modify the resources and preserve the resources for the community to which people contributed to in an evolutionary process.

  • (Derivated Work) Explain what "derivative work" means e.g. for Open Source software or Creative Commons documents.
  • (Open Data) What are the challenges and constraints for Open Data that is used for Machine Learning (ML)? Can users add data to an repository?
  • (Transparent Chain of Licenses) Assume that a machine   with an initial state   is trained with training data   that is issued under a specific open license  . The license   allows derivative work. If machine learning is generative then the chain licenses will assign the same license to the generated text.
  • (Versions of Training Data Sets) Training data may change over time so the the training data   may have a time index   a well.
  • (Multiple Licenses for Training Data Sets ) If training data sets are aggregated from different sources, then different involved licenses may also change of time (e.g.   means that at time   data set   is aggregated from training data with the licenses   and  .
  • (Learning Algorithm) The used learning algorithm   defines how the machine state evolves depending on the training data. With a discrete iterative step from   to   and the training data   used at time    . This means that learning algorithm   maps the current machine state   together with the training data   to the new machine state  . The next learning step   creates in an inductive way the next machine state  . We can define  , meaning that the machine state will remain unchanged if no training data is provided.
  • (Origin, Experimental Design, Meta data) For scientific purpose it is often important for a specific purposes, that it can be clarified who collected the data and what was the experimental design. Explain the requirements of data collection and the related scientific standards for it. Discuss the similarities and differences in the context of training data for machine learning. Assume that also origin, institution and experimental design is available as meta data for the training data set. What other meta is relevant for you to assess the quality of data, that is used for training?
  • (Transparency for Trained Models) If training data   is not available due privacy regulation e.g. for medical data, then the detailed license information with meta data together institute which trained the model, could help to assess, if the machine   at time   might be used for a specific purpose.
  • (Machine State) If we aggregate the consideration above we can create a reference to the machine state  , that trains the machine with the learning algorithm   and the training data  . At time   machine learning used the list of licenses   and optional other metadata (for reproducability). The in-output-pair   defines that   generated   with the input  . So the tupel   defines how  , was generated with   and the involved licenses   with the meta data for  . Assuming that   and   are references to version in a version control system. Discuss the limitations of the approach especially when a huge amount of data is used for training and   is constantly trained with a input stream of data.
  • (Generative Artificial Intelligence) Assume a user uses generative AI, some components of the tupel   might not be known. But at least the sequence of prompts   with the corresponding output   can be compulsory to be documented especially for student home work. This allows to identify the added value of the student beyond generative AI. Key questions for the assessment:
    • Was the logical structure of thesis defined by the generative AI and what are the reasons for students to change the provided generated logical structure? How did the changes improve the logical structure of the thesis in comparison to the AI generated results?
    • Are citations/references are included in the generated document? Do the citations provide evidence for the discussed content in the thesis?
    • Is the state of the art in science properly integrated in the thesis or does the research question in the thesis require other relevant scientific results to cover the current scientific knowledge about the topic? Did the student add those references and did argue in the thesis why these citations are missing?
    • Using generative AI in a thesis in terms of transparency requires 3 components
      • (Prompt Results) prompts and results pairs  ,
      • (Manual Changes) manual changes of   by the students to   and
      • (Meta Analysis) meta discussion why changes to   are necessary to fulfill the scientific requirements of the thesis.
  • (Licensing Chains) With licensing chains it is possible to   with different licensing models   are used along with the training data  . Due to this openness of licensing training data can be reduced to subset  , because not the complete set of training data is compliant with a required license. Instead of training the machine   with   we train with the license compliant subset   with a suitable license and obtain a new machine state according to license compatibility  . Discuss applications of this scenario and discuss PROs and CONs of a reduced set of training data for the training process at time  .

Examples - Derivative Work for Data

edit

Consider the following examples as introduction and discuss differences and similarities of Machine Learning that is based on the training set  :

  • (New Data) Due to a new empirical study with the same experimental design of the existing training data set   at time   data is added and new set  ,
  • (Missing Data) in the existing training data set   at time   missing values are added and   is the corrected data.
  • (Correct Data) input errors correct e.g. input data about the temperature   was modified to  ).

Describe your workflow, so that the new state of machine does include the improved dataset. Does it make sense to train   with the improved dataset  . What are implication on the version control of machine states?

Learning Tasks - Open Data

edit

Transfer the concept of derivative work to open data and discuss how alteration and modification of data can be managed in a trusted way by the community.

Learning Tasks - Training Data

edit

Analyze open licensing models (like GNU Public License, Creative commons, ...) how derivative work of donate digital contributions and derivative can remain open for community? How does the licensing model contribute to Open Innovation Ecosystem with digital public goods[1] . Apply this concept to training data for Open Machine Learning and discuss the requirements and constraints. Apply open data on Spatial Risk Management e.g. in the context of road safety[2]. What are the benefits, challenges, requirements and constraints of using machine learning in that context?

Learning Tasks - Open Machine Learning Licensing Chains

edit

Assume we have training data (e.g. text documents under a Creative Commons License) and train a system machine   at a time   with an open source machine learning algorithm, that is transparently available by the community. A new system state of the machine   changes the In-Out-Behaviour (IOB) by the training process. Now it generates the output   with input data   with  . What licensing model should be assigned to the output   if the training data is provided under the license  ? Discuss different views about a licensing chain that the uses the output   given by machine   is used as training data to generate a new machine state  .

Learning Tasks - Same Machine Learning Algorithm different Training Data

edit

Assuming we have two different Open Training data sets   and  . Furthermore use a neural network model (e.g. Backpropagation networks and a predefined topology of the network (i.e. number of neurons, connections between neurons, layers of neurons, ...) and activation functions of the neurons. So the starting states of two machines  and  at time   are considered to the same.

Use different Training Data Sets

edit

Due to the two different data sets  and  . for the open transparent and reproducible training of   and   the machines evolve on different pathways in the time index  .

Trained In-Output-Behavior

edit

In general the In-Output-Behaviour (IOB) will be different at the time  . Discuss the role training data   and   as part of "programming" the IOB of the machines.

Bias in Training Data

edit

What is a Bias? Discuss an example training data that has bias (e.g. in the context of human rights, added fake news data, missing data, unreliable data sources,...) and explain how the bias in the training data has an impact of the In-Output-Behaviour (IOB) of the machine that is used for decision making e.g. in the medical domain[3]. How can transparency and openness for the training data can help to identify a bias[4]. What are the challenges, requirements and constraints (e.g. data privacy regulations)?

Assessment and Machine Learning

edit

Assume a student   has to do an assessment about a specific topic and has to produce a specific document as homework for it. The student works hard  , creates a good appropriate homework and submits the result to the teacher. Additionally student   provides the document to another student  . Student   takes the same document just like taking a book from the shelf and submits the same document to the teacher too (e.g. without looking into the document). From an outside perspective we assume that student   has understood the topic of the homework. Student   may not have any insights into the topic:

  • An oral assessment about the topic or
  • a supervised transfer task of the lessons learnt to a new example or counterexample may uncover the knowledge about the topic.

Transfer this example to Machine Learning and create consequences, how an assessment can be designed that assesses gained knowledge and problem solving capabilities instead of a successful delivery of an appropriate document for the learning task.

References

edit
  1. 1.0 1.1 Nordhaug, L. M., & Harris, L. (2021). Digital public goods: enablers of digital sovereignty. DOI: 10.1787/c023cb2e-en - In book: Development Co-operation Report 2021
  2. Najjar, A., Kaneko, S. I., & Miyanaga, Y. (2017, February). Combining satellite imagery and open data to map road safety. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).
  3. Mac Namee, B., Cunningham, P., Byrne, S., & Corrigan, O. I. (2002). The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1), 51-70.
  4. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12 (pp. 158-171). Springer Berlin Heidelberg.

External References

edit

See also

edit

Page Information

edit

You can display this page as Wiki2Reveal slides

Wiki2Reveal

edit

The Wiki2Reveal slides were created for the Open community approach' and the Link for the Wiki2Reveal Slides was created with the link generator.