Applied Programming/Strings

This lesson introduces strings and string processing.

Objectives and SkillsEdit

Objectives and skills for this lesson include:[1]

  • Evaluate an expression to identify the data type assigned to each variable
    • Identify str, int, float, and bool data types
  • Perform data and data type operations
    • Convert from one data type to another type; construct data structures; perform indexing and slicing operations
  • Determine the sequence of execution based on operator precedence
    • Assignment; Comparison; Logical; Arithmetic; Identity (is); Containment (in)
  • Select the appropriate operator to achieve the intended result
    • Assignment; Comparison; Logical; Arithmetic; Identity (is); Containment (in)

ReadingsEdit

  1. Wikipedia: String (computer science)
  2. Wikipedia: Run-length encoding
  3. Wikipedia: Escape sequence

MultimediaEdit

  1. YouTube: Python Tutorial: Slicing Lists and Strings
  2. YouTube: Python Tutorial: String Formatting
  3. YouTube: Python Simple String Manipulation
  4. YouTube: Strings, Escape Sequences and Comments : Python Tutorial #4
  5. YouTube: Implement Run-length Encoding of Strings
  6. YouTube: Iterating Over a Python String
  7. YouTube: Python Tutorial for Beginners 7: Loops and Iterations - For/While Loops
  8. YouTube: Computer Programming - Strings

ExamplesEdit

ActivitiesEdit

  1. Review Wikipedia: Run-length encoding. Create a program that asks the user for an input string of alphabetic characters. Convert the string to a run-length encoded (RLE) string of characters and numbers. Use the compressed format, where a single instance of a character has no count. For example, AAABCC would be A3BC2. Use a separate function for string processing. Avoid using global variables by passing parameters and returning results. Include appropriate data validation and parameter validation. Add program and function documentation, consistent with the documentation standards for your selected programming language.
  2. Enhance the RLE program above to check to see if a string has numbers in it. If so, it is already in RLE format. Decode RLE strings and display the results. Strings that have no encoding should be encoded as above. Use a separate function for decoding. Validate parameters and update program and function documentation, consistent with the documentation standards for your selected programming language.
  3. Review Wikipedia: Escape sequence. Enhance the RLE program above by allowing a # symbol to be used as an escape sequence, indicating that the following number is a number character rather than an encoding count. Use a pair of # symbols (##) to indicate a # character. This change should allow any input sequence to be encoded. Enhance the decoding function to support the new format. Validate parameters and update program and function documentation, consistent with the documentation standards for your selected programming language.

Lesson SummaryEdit

  • Run-Length Encoding is a form of lossless data compression in which runs of data are stored as a single data value and count.[2]
  • Escape Sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters.[3]
  • A string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. Strings are widely used in almost all programming languages as they are quite powerful. [4]
  • A character in a programming language is the smallest unit of textual information.[5]
    • Strings are lists or containers of individual characters 'strung' together, integrated with other useful functionality (like the ability to 'find' a character or a sub-string).[5]
  • Characters may be alphabetic letters, numeric digits, punctuation marks and other symbols, whitespace, or 'control characters'.[5]
    • Control characters are not printed like other symbols; instead, they communicate a more abstract idea, like the signaling of a newline or the ringing of a bell.[6]
  • In some lower-level languages like C++ or Java, individual characters are of a specific data type, usually termed 'char'.[5]
  • Since information stored in a computer is represented in binary format, the need for a standardized table, pairing numbers with characters, became readily apparent. Thus, ASCII, the American Standard Code for Information Interchange, was born.[7]
  • ASCII is a near-ubiquitous example of an encoding scheme or a character set, the systematic mapping of code points to symbols of language.[7]
  • Originally, ASCII only defined 128 characters, encompassing the entire alphabet (both lower- and upper-case), the digits (0-9), and other characters of importance.[7]
    • It was subsequently expanded through various unofficial revisions in order to make full use of the 28 or 256 possibilities given a one byte series of data.[8]
    • Although still relevant, ASCII is being phased out by Unicode, a superset of ASCII that implements universal (not just English-based) symbols by extending the original set of 128 characters.[9]
  • String literal is a quoted sequence of characters (formally "bracketed delimiters"), as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value.[10]
  • There are numerous alternate notations for specifying string literals and the exact notation depends on the individual programming language.[10]
  • An empty string is literally written by a pair of quotes with no character at all in between.[10]
  • The string must begin and end with the same kind of quotation mark and the type of quotation mark may give slightly different semantics.[10]
  • A number of languages provide for paired delimiters, where the opening and closing delimiters are different.[10]
    • For example: “Hi There!”, ‘Hi There!’, „Hi There!“, «Hi There!»
  • String literals can contain literal newlines, spanning several lines. Alternatively, newlines can be escaped, most often as \n.[10]
  • Python has a special form of a string, designed for multiline literals, called triple quoting. These literals strip leading indentation are especially used for inline documentation, known as docstrings.[10]
  • Strings are a data type typically implemented as an array data structure of bytes. [4]
  • In general, there are two types of strings, fixed and variable-length strings.[4]
    • Fixed-length strings contain a fixed maximum length to be determined at compile time and the same amount of memory will be used.[4]
    • Variable-length strings do not have a fixed length and can vary the amount of memory to be used at runtime. Variable-Length strings are also more common in modern programming languages.[4]
  • Python string objects are immutable, i.e., a state of a string can't be modified once it's created. On the other side, string variables are changeable. A string variable can be assigned a new value, but this action won't affect the original string.[11][12]
  • There are a variety of algorithms for processing strings:[4]
    • String searching algorithms for finding a given substring or pattern
    • String manipulation algorithms
    • Sorting algorithms
    • Regular expression algorithms
    • Parsing a string
    • Sequence mining
  • Most programming languages offer string functions in order to manipulate a string. Refer to String functions for a list of string functions used in various languages.
  • String indexing and slicing ...
    • Strings and substrings can be accessed by index, with the first character receiving an index of 0, and all others with an index incremented from that of the previous character.
    • All characters including white spaces are given an index.
    • Though implementation varies between languages, various functions such as slicing, and concatenating can be done by utilizing index. Refer to String functions

Key TermsEdit

concatenation
When a sequence of symbols in string S is joined/followed by the sequence of characters in string T, and is denoted string ST.[4]
escape sequence
An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly. [13]
fixed-length strings
Fixed length strings have a fixed, maximum length to be determined at compile time and use the same amount of memory whether the maximum is needed or not.[4]
prefix
A string A = a1, a2, …an has a prefix  = a1, a2, … am when m ≤ n. A proper prefix of the string A would not be equal to itself (0 ≤ m < n).[4]
reversal
The reverse of a string is a string with the same symbols but in reverse order.[4]
rotation
A string s = uv is said to be a rotation of t if t = vu.[4]
run-length encoding
run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.[14]
string
Traditionally a sequence of characters, either as a literal constant or as some kind of variable.[4]
string datatype
A datatype modeled on the idea of a formal string.[4]
string literal
When a string appears literally in source code, also known as an anonymous string.[4]
substring
Occurs when one string is a prefix of a suffix of an original string, and equivalently a suffix of a prefix.[4]
suffix
Any substring of an original string that includes the original string’s last letter, including itself. A proper suffix of a string is not equal to/the same as the string original string itself.[4]
variable-length strings
Variable-length strings have a length that is not arbitrarily fixed and can use varying amounts of memory depending on the actual requirements at run time.[4]

See AlsoEdit

ReferencesEdit