Audio-Video-Compression

This learning environment starts with a theoretical minimal bandwidth setting for an Audio and Video communication e.g. used for Video conferencing.

GIF Animation - Speaking
GIF Animation - Silence
Video Interview - Analysis and identification of frames for a GIF animation

Learning Task edit

  • (Size of Audio Streams) Calculate the size of 1 minute stream of an uncompressed HD video format with 25 frames per second and an uncompressed audio stream stored in the WAV format in CD quality. Estimate the size of an uncompressed movie of 90min length.
  • (Compression and Loss of Information) An compression algorithm can be performed on the audio and/or the video stream. The compression rate is dependent on source video and source audio.
    • Record an uncompressed audio file and store the audio file in the WAV format e.g. with Audacity (length approx 1min) and video with the same length with VLC player or the OBS Screencast software. Compress a 1min uncompressed audio and/or video stream in different format (MP3. OGG Audio - OGV, MP4 Video) e.g. in the OpenSource software AviDemux or VirtualDub and calculate the compression rate of your video (e.g. 0.1 means that the compressed file requires 10% of uncompressed Audio-Video source.
  • (GIF Animation) Decompose a video recording of 1min with yourself talking and decompose the video into single frames (i.e. a single image at a specific time of the video) or use the OGV video on the right and select 20 frames in video to be used for a GIF animation.
    • Create a 2D Animation short with 10 frames per second as GIF animation showing yourself talking. Create the animation e.g. with OpenSource software Pencil2D and save the file as talking.gif.
    • Create another 2D Animation in which you have your mouth closed in which you move sligtly your head and store that as a file: silence.gif.
    • place the GIF images in a HTML web page offline and swap the image with a key pressed event or a button. Analyze the potential of using this concept for audio video compression in low-bandwidth environment in rural areas and for communication environments in which the minimal production of network traffic is required to share the collaborative resource of bandwidth in a sustainable way.
    • You can also analyze the video on the right to create those GIF animations.
  • (Wiki2Reveal and Speech Synthesis) Analyze Wiki2Reveal and the audio comments.
    • How can speech synthesis and a locally assigned speech synthesis profile emulate the voice of teacher so the learning material selected by the teacher in Wikiversity could have a similar voice of the teacher?
    • Compare compressing audio comments for a slide in Wiki2Reveal and storing the audio comments in Wiki Commons with a stored text for an audio comment of Wiki2Reveal slides. Identify the benefits for editing and improving speech synthesis audio comments and identify possible drawbacks from that approach.
  • (Video conferencing) A video conferencing has to distribute audio comments between participants of the online meeting. Compare the 3 steps
    • Speech Recognition to sequence of words
    • Transmission of the sequence of words (instead of audio signal)
    • Speech Synthesis for generation of an profile specific audio stream on the client based on the sequence of words.
with the following approach of
  • Phoneme Recognition to sequence of phonemes
  • Transmission of the sequence of phonemes (instead of audio signal)
  • Audio Synthesis for generation of an profile specific audio stream on the client based on the sequence of phone.
  • Benefits and Drawbacks of the Approaches:
    • What are the benefits phoneme encoding of speech input in comparison to an speech recognition of sequence of spoken words.
    • What are the requirements to generate a speech synthesis profile for a user specific Audio Synthesis profile?
    • A user specific Audio Synthesis Profile (uAS-Profile) generated for the user and would be transmitted at the very beginning of the OpenSource video conference to the client with the consent of the user in e.g. a low bandwidth environments. Discuss alternatives of like default Audio Synthesis-Profiles (for female, male, child voice, ...) that can be used if that is not possible to consent a user specific AS-Profile to all clients including the standard compressed audio transmission of video compression.
  • (Risk Management for Audio Connections) In risk management and implementation of fallback methods a priority of methods is defined and if one method fails a fallback method is used. Identify priority of methods for the audio transmission and discuss how this fallback methods can be introduced into Open Source video conferencing systems for testing and scientific analysis of the resillience of the whole audio connection if bandwidth is decreasing of connection quality interrupts for a fraction of a second. Start with
    • Audio Stream high quality,
    • Audio Stream medium quality,
    • Audio Sreams low quality,
    • Encoded Stream of Phonems,
    • Encoded Stream of recognized text.
    • just use talking.gif and speaking.gif of a speaker with silence detection in audio stream or with silence in the stream of encoded phonems.
Explain how this concept can be explained to the compression and resillience of the video stream. Start with a high quality of the video stream and end with GIF animation for silence and talking. Discuss also levels in between e.g. mapping of phonem pattern to face expression for a specific phonem. Explain how phonems can be encoded in string or a sequence of submitted strings defining also the video face expression of the user.
  • (Phonem Video Image Mapping PVIM) In the following learning task we create a offline WebApp for PVIM that has in very beginning just 5 images of your face and in the first step we implement a HTML and Javascript code that shows just one image of your face by pressing the key 1,2,3,4 or 5.
    • This can be done with an IMG tag in HTML and an event handler in Javascript that checks which key is pressed. E.g. if you press "1" on the keyboard and the Javascript code will hide the images 2,3,4,5 and shows the image 1. That requires only a few lines of code.
    • (Create Images of your face for Phonems) Assure that your head does not move very much in the webcam when to take a snapshot of your face for the 5 images. Just modify your facial expression for speaking a specific phonem. Beside your closed mouth (Image 1) check the word "Sustainability" in front of a mirror and select just 4 other expressions of your face to encode "Sustainability". Reduction of the required facial expression you need for speaking the words reduces the size of a user specific Video Profile (uV-Profile). Furthermore watch yourself in a mirror and identify how many facial expression would you need for saying the word "Sustainability".
    • Now record a audio sequence for a single word e.g. "Sustainability" with the OpenSource software Audacity
      • normal speed,
      • slow speed,
      • fast speaking
and look at the curve for the audio recording in Audacity. Can you identify the phonems of the word "Sustainability" in the curve? What is the challenge for Speech Recognition if somebody else is speaking the same word "Sustainability"?
    • Playback the audio recording for the word "Sustainability" in a loop and use your WebApp with the five images for the slow speaking audio sample and try to encode your own audio recording with your 5 facial expressions. Analyse the audio recording in Audacity and identify at which time stamp the specific facial expression (1,2,3,4,5) should be shown by your WebApp.
    • Assume you have just one sequence of phonems encoded in a string and explain how a video conferencing system could use that approach in a low bandwidth environment and just send the phonem sequence instead of an Audio Video Streams.
    • Compression of Audio and Video streams imply always a loss of information (see Fourier Analysis of Audio Stream in the MP3 ord OGG format or MP4 or OGV compression of a Video Stream). The phonem encoding of this simple example is less scientifically advanced in comparison to Fourier Analysis and removal of frequences for compression, but it has the option to use OpenSource in low bandwidth networks in remote areas by joint video and audio compression by a single phonem stream.
    • The participants in a video conference should be aware of the fact that she/he not a receiving the real video stream and audio stream from the speaker but only the
      • Audio Synthesis of a phonem sequence and
      • Video Frame sequence of pre recorded facial expressions.
Explaion whyparticipant of a video conference might want to see at even a stagnant image or a GIF animation of the speaker instead of having just the audio stream available, if bandwidth is not sufficient?
  • Analyse the compression ratio for this introductory example of a phonem sequence in comparison to bandwidth requirements of compressed Audio Video stream in a video conference.
  • (Transport) Audio video compression is performed to reduced the size for storage and transmission of audio and video streams. What are the challenges for transport and latency? Discuss also the communication between astronauts in space and the latency of the signal e.g. between Mars and earth. Latency can be induced by large sizes of audio video streams. Explain why audio and video compression can not help for the latency problem for a theortical Audio Video communication between astronauts on the Mars with the relatives on Earth.

Wrap up for the Learning Resource edit

This learning resource has shown

  • how image swapping for facial expression in a very simple HTML WebApp can be used to swap images by key press events,
  • how phonems can be mapped to facial expressions with mirror example. Learners watch there own facial expression for a single word "Sustainability".
  • analyze a recording of the word "Sustainability" in a Open Source Audio recording software and identify phonems in the curve of the audio sample.
  • how can be very simple GIF animations for "talking" and "silence" can replace a stagnant image of the speaker in a low bandwidth video conference.
  • how can GIF animations for "talking" and "silence" or even stagnant image be used during network problems.
  • understand basic principles of phonem sequence transmissions in comparision to standard speech recognition and speech synthesis,
  • understand the basic principles of compression, that it implies loss of information.
  • understand that loss of information should be transparent for the recipient of an audio video stream e.g. with Avatar icon especially for the basic example of facial expression encoding for phonems by images.

Software edit

Used software for the learning resource:

  • Audacity
  • Firefox browser or any other browser that is able to run an offline WebApp
  • AppLSAC - WebApps for Wikiversity learning resources.

See also edit