C-Strings | Shallow Dive Blog

Table of Contents

Basics

The C programming language has a set of functions implementing operations on strings (character/byte strings) in its standard library:

#include <stdio.h>  // strlen, strcpy, strtok, strstr
#include <string.h> // printf

int main() {
    char str1[] = "Hello";
    char text[] = "one,two,three";
    // strlen - get length of string
    printf("Length of str1: %zu\n", strlen(str1)); // 5
    // strcpy - copy string into buffer
    char buffer[50];
    strcpy(buffer, str1); // buffer now contains "Hello"
    printf("Copied: %s\n", buffer);
    // strcat - concatenate two strings
    strcat(buffer, " ");   // buffer: "Hello "
    strcat(buffer, "World");  // buffer: "Hello World"
    printf("Concatenated: %s\n", buffer);
    // strtok - tokenize a string by commas
    char *token = strtok(text, ",");
    while (token != NULL) {
        printf("Token: %s\n", token);
        token = strtok(NULL, ",");
    }
    // strstr - search for a substring
    char *found = strstr(buffer, "World");
    if (found) {
        printf("Found substring at position: %ld\n", found - buffer);
    }
    // ...
}

Length of str1: 5
Copied: Hello
Concatenated: Hello World
Token: one
Token: two
Token: three
Found substring at position: 6

There is no real string data type but instead char[] are used (character arrays). Also while other languages like Java store the length of an array C instead has the convention to null-terminate every char[] instead:

a string of $n$ characters is represented as an array of $n + 1$ elements
each character is usually a numeric value that maps to a character (e.g. ASCII codes)
the last character is a “NULL character” with numeric value $0$ (no other character has this numeric value!)

Meaning the line char str1[] = "Hello" actually stores 6 chars where the last one is NULL:

#include <stdio.h>  // strlen
#include <string.h> // printf

int main() {
    char str1[] = "Hello";
    // Reported length of string
    printf("Length of str1: %zu\n", strlen(str1)); // 5
    // Memory layout
    for (size_t i = 0; i < strlen(str1) + 1; i++) {
        printf("str[%zu] = '%c' (ASCII %d)\n", i, str1[i], (unsigned char)str1[i]);
    }
}

Length of str1: 5
str[0] = 'H' (ASCII 72)
str[1] = 'e' (ASCII 101)
str[2] = 'l' (ASCII 108)
str[3] = 'l' (ASCII 108)
str[4] = 'o' (ASCII 111)
str[5] = '' (ASCII 0)

Hardware

the smallest addressable memory unit in most hardware nowadays is a byte $=8$ bits
- each bit can store $2$ states: $1$/$0$
- $n$ bits can store $2^n$ possible combinations (e.g. ${\color{red}2}$ bits: $2^{\color{red}2} = 4$ [00, 01, 10, 11])
$32$-bit, $64$-bit, … refer to CPU architectures (and corresponding operating systems)
- Registers INSIDE CPUs are used to hold immediate data:
  - instructions
  - memory addresses
  - data for calculations
- A $n$-bit CPU has $n$-bit wide registers (e.g. $64$-bit CPU has $64$ bit $= 8$ byte wide registers)
  - This length is often called the natural unit of data a CPU processes at a time: a word / size_t
    - Linux: unsigned long
    - Windows: SIZE_T
- Since memory access instructions (e.g. Assembly mov/load/store) of the CPU ISA (Instruction Set Architecture) are hard-coded to use a register as memory address
  - a $32$ bit CPU can address a theoretic maximum of $2^{32}$ byte $\approx 4$ gigabytes
  - a $64$ bit CPU can address a theoretic maximum of $2^{64}$ byte $\approx 18$ exabytes ($= 18000$ terabytes)

Hardware Implementation C Strings

on a hardware level all characters are stored as:
- consecutive (in memory)
- fixed width (all characters have the same bit length)
  $$ \overbrace{00001010}^{\text{character } 1}\overbrace{00000010}^{\text{character }2}\dots\overbrace{00000000}^{\text{last character }} $$
- unsigned binary integers (each character represents a positive number)
  $$ \begin{aligned} 00001010_2 &= 2^0*0 + 2^1 * 1 + 2^2 * 0 + 2^3 * 1 + 2^4 * 0 + \dots \\ &= 2_{10} + 8_{10} = {10}_{10} \end{aligned} $$
- terminated at the end by an additional zero code unit (all code units before are non-zero)
  $$ \begin{aligned} 00000000_2 &= 2^0*0 + 2^1 * 0 + \dots \\ &= 0_{10} \end{aligned} $$
when speaking of strings normally code units of type char/wchar_t are meant
- a char is $8$ bits $=1$ byte (on most modern machines)
- a wchar_t is $16/32$ bits $=$ $2/4$ byte

Memory (in C)

There are 3 memory areas:

Registers: (in CPU, managed by the CPU, generally not accessible)
- Extremely fast but tiny storage inside the CPU
- Each register has the width of the bit count of the CPU ($32$ bit = $32$ bit registers)
- Example a modern x86-64 $64$-bit CPU like the AMD Ryzen 7 7800X3D has per Instruction Set Architecture (ISA):
  - $16$ general purpose registers where each one is $64$-bits wide: integer arithmetic, pointer storage, function arguments, local storage, …
  - Some special registers like the instruction pointer register: tracks the address of the next instruction to execute
  - SIMD/floating-point registers: AVX-$512$ allows up to $32 \times 512$-bit vector registers
  - Some legacy registers that remain largely unused
  - BUT in reality on the micro-architecture level there are hundreds of more registers to enable out-of-order execution, pipelining, …
Stack: (in RAM, managed by CPU/OS)
```
void f() {
    int arr[1024]; // 4 KB on the stack
} // automatically freed when function returns
```
- A contiguous block of memory that (usually limited, e.g. $8$MB per thread) stores local variables and function call information
- Memory is allocated when a function is called and freed when it returns
- If you exceed it the CPU throws a page fault/the OS throws a stack overflow exception
Heap: (in RAM, managed by OS)
```
void f() {
    int* arr = malloc(1024 * sizeof(int)); // 4 KB on heap
    // it actually allocates some additonal metadata
    // ...
    free(arr); // must be manually released
    // free knows through the metadata "how much" it needs to free
    // free also frees the metadata so it's only possible to call it once
}
```
- A region of memory used for dynamic allocation
- Memory is manually managed:
  - First the memory is requested with e.g. malloc
  - Then it must be explicitly released with e.g. free()
- The heap is larger and more flexible than the stack, but access is generally slower because for example:
  - Sometimes memory is not one contiguous block
  - Fragmentation creates holes and finding a suitable free block can take time during allocation

Demo for metadata on the heap on Linux with malloc and glibc:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main() {
    size_t n = 10; // number of ints to allocate
    int* arr = malloc(n * sizeof(int)); // allocate n integers on heap
    printf("Pointer returned by malloc: %p (requested %zu bytes)\n",
           (void*)arr, n * sizeof(int));

    // ===== Accessing metadata =====
    // WARNING: This is implementation-specific!
    // In glibc, the size of the allocated block is usually stored
    // just before the pointer returned by malloc.
    // Technically this is undefined behavior in C and should only
    // be used for experimentation.
    size_t* meta = (size_t*)arr - 1; // one word before arr
    printf("glibc: Raw metadata (first word, contains size + flags): 0x%zx\n",
           *meta);
    // mask off the lowest 3 bits used for flags to get the total memory size
    // metadata + requested allocation size
    printf("       -> size: %zu bytes\n", *meta & ~0x7);

    free(arr); // free memory based on the metadata
    return 0;
}

Pointer returned by malloc: 0x562b7de26310 (requested 40 bytes)
glibc: Raw metadata (first word, contains size + flags): 0x31
       -> size: 48 bytes

Demo for reaching the stack limit with a big stack allocation:

#include <stdio.h>        // printf
#include <stdlib.h>       // malloc
#include <sys/resource.h> // stack size limit

int main() {
    // Stack allocation have limits
    struct rlimit rl;
    if (getrlimit(RLIMIT_STACK, &rl) == 0) {
        printf("Stack size limit: %ld bytes (%.2f MB)\n",
               rl.rlim_cur, rl.rlim_cur / (1024.0 * 1024));
    } else {
       fprintf(stderr, "Error: Unable to get stack size limit.\n");
       return 1;
    }
    // E.g. with a 8MB limit for the stack allocating more triggers a stack overflow
    int bigArray[10 * 1024 * 1024]; // 40 MB -> Triggers a Segmentation fault on Linux

    // Dynamic array (Heap allocated)
    // Either big or dynamic sized arrays can be allocated in a region of RAM
    size_t elements = 500;
    size_t size = elements * sizeof(int);
    printf("Allocate memory for %d elements of %d byte size => %zu bytes\n",
           elements, sizeof(int), size);
    int *arr = malloc(size);
    if (arr == NULL) {
       fprintf(stderr, "Error: Memory allocation of %zu bytes failed.\n", size);
       return 1;
    }
    // ... access/change is the same
    free(arr); // THE ALLOCATED RAM REGION MUST BE MANUALLY FREED
    // THIS CAN ONLY BE DONE ONCE AND arr IS NOT ALLOWED TO BE ACCESSED AGAIN!
}
// Stack size limit: 8388608 bytes (8.00 MB)
// Allocate memory for 500 elements of 4 byte size => 2000 bytes

Pointers in C

Stack Allocation
```
int x = 10;
int *p = &x; // p points to stack memory
// or
int arr[1024];
```
- Pointers can reference variables allocated on the stack
- Lifetime ends when the function scope ends, all memory is automatically released, the memory location is now hot garbage that crashes your program if you look at it!

Heap Allocation

int *p = malloc(sizeof(int));
*p = 20; // p points to heap memory
// or
int* arr = malloc(1024 * sizeof(int));
// ...
free(p);
free(arr);

Pointers can reference dynamically allocated memory using e.g. malloc
Lifetime ends when the memory is manually freed with e.g. free()

Immutable String Literals (char*)
```
char *s = "hello"; // points to immutable string
```
- String literals are stored in read-only memory (implementation-defined)
  - Memory efficiency
  - Safety from accidental modification
  - Faster access
  - Functions that take const char * can safely accept string literals without copying
- Modifying the contents leads to undefined behavior!
- Use const char * to make intent explicit (and help the compiler):
```
const char *s = "hello";
```

Demo for the different pointer types:

#include <stdio.h>  // printf
#include <stdlib.h> // malloc, free
#include <string.h> // strcpy

void print_array(const char* name, const char* chars) {
    printf("%s [code units]: ", name);
    const char *separator = "";
    size_t counter = 0;
    while (*chars) {
        printf("%s%d", separator, *chars);
        separator = ",";
        chars++;
    }
    printf("%s%d", separator, *chars);
    printf("\n");
}

int main() {
    // 1. Stack Allocation (automatic memory management)

    // > Pointer to string literal (read-only! | they are always immutable!)
    char* literalPtr = "Literal String";  // Automatically inserts a NULL at the end
    //literalPtr[0] = 'l';                // NOT POSSIBLE! IMMUTABLE!

    // To indicate that you always write 'const'
    const char* CONST_LITERAL_PTR = "Immutable Literal String";

    // > Static string (can be updated | mutable)
    char staticStr[] = "Static String";
    staticStr[0] = 's';

    printf("Literal Pointer:         %s\n", literalPtr);
    printf("Literal Pointer [CONST]: %s\n", CONST_LITERAL_PTR);
    printf("Static String:           %s\n", staticStr);
    print_array("literalPtr", literalPtr);
    print_array("staticStr ", staticStr);

    // 2. Heap Allocation (manual memory management)

    // Allocate memory for a string on the heap
    size_t len = strlen("Heap String") + 1;   // +1 for NULL terminator
    char* heapStr = (char*)malloc(len);
    if (!heapStr) {
        perror("malloc failed");
        return 1;
    }

    // Copy content into heap memory (mutable)
    strcpy(heapStr, "Heap String");
    heapStr[0] = 'h'; // modifying is safe as long as index < strlen

    printf("Heap String:             %s\n", heapStr);
    print_array("heapStr   ", heapStr);

    // Clean up heap memory
    free(heapStr);
}

Literal Pointer:         Literal String
Literal Pointer [CONST]: Immutable Literal String
Static String:           static String
literalPtr [code units]: 76,105,116,101,114,97,108,32,83,116,114,105,110,103,0
staticStr  [code units]: 115,116,97,116,105,99,32,83,116,114,105,110,103,0
Heap String:             heap String
heapStr    [code units]: 104,101,97,112,32,83,116,114,105,110,103,0

Character encoding

To consistently map (encode) binary numbers represent by bits also known as code points to characters/symbols there are multiple standards:

ASCII (American Standard Code for Information Interchange)

In 1963 the ASCII character encoding defined $2^7=128$ symbols consisting of $33$ unprintable symbols and $95$ printable characters.

Inspired by type/telewriter layout
- A-Z, a-z, 0-9, punctuation marks
- the ordering of characters, such as the uppercase letters preceding lowercase ones
- control characters like
  - Carriage Return (CR) and Line Feed (LF) $\Rightarrow$ Creates new line by moving the printer back to first column and to the next line

They are represented by numbers from $[0,127]$:

 0 NUL    16 DLE    32      48 0    64 @    80 P    96 `   112 p
 1 SOH    17 DC1    33 !    49 1    65 A    81 Q    97 a   113 q
 2 STX    18 DC2    34 "    50 2    66 B    82 R    98 b   114 r
 3 ETX    19 DC3    35 #    51 3    67 C    83 S    99 c   115 s
 4 EOT    20 DC4    36 $    52 4    68 D    84 T   100 d   116 t
 5 ENQ    21 NAK    37 %    53 5    69 E    85 U   101 e   117 u
 6 ACK    22 SYN    38 &    54 6    70 F    86 V   102 f   118 v
 7 BEL    23 ETB    39 '    55 7    71 G    87 W   103 g   119 w
 8 BS     24 CAN    40 (    56 8    72 H    88 X   104 h   120 x
 9 HT     25 EM     41 )    57 9    73 I    89 Y   105 i   121 y
10 LF     26 SUB    42 *    58 :    74 J    90 Z   106 j   122 z
11 VT     27 ESC    43 +    59 ;    75 K    91 [   107 k   123 {
12 FF     28 FS     44 ,    60 <    76 L    92 \   108 l   124 |
13 CR     29 GS     45 -    61 =    77 M    93 ]   109 m   125 }
14 SO     30 RS     46 .    62 >    78 N    94 ^   110 n   126 ~
15 SI     31 US     47 /    63 ?    79 O    95 _   111 o   127 DEL

The reason the symbols are encoded as $7$-bit values was intentional because early computers handled data in bytes which are blocks of 8 bits. The extra, eighth bit was originally reserved for error checking.

ASCII Extensions

This was enough for basic English use but was extended by various countries internationally to $8$-bit values ($2^8=256$ meaning $128$ custom symbols) to support their symbols/punctuation/…

e.g. ISO/IEC 8859 which contains ä, ü, ö, ß, €, Ä, …

Unicode

To solve the problem of multiple character encoding standards (that are different depending on the current locale) Unicode is one table that contains all of them instead of having to switch the encoding depending on the current use case.

Designed in 1988 by Joe Becker at Xerox this standard (where the first $256$ code points mirror the ISO/IEC 8859-1 standard) extends the code points to $16$ bits ($2^{16}=65536$) which should allow to universally encode all modern/future text with one standard: https://symbl.cc/en/unicode-table/.

This was later extended to $32$ bits for supplementary characters in Unicode 3.0 (1999) and nowadays contains even a lot of emotes.

UTF (Unicode Transformation Format)

If you would store each character as $32$ bits this would waste for most characters in English/Latin $2/3$ of $4$ empty bytes.

To support all current Unicode code points but still keep the storage size small a transfer function is used that widens the amount of bytes depending on how big the code point of a single symbol is (variable length characters):

UTF-8 uses $1$ to $4$ byte per code point (indicated by a bit mask)

Maximal compatibility with ASCII since the first $128$ symbol have the same code points

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000 (0)	U+007F (127)	`0yyyzzzz`
U+0080 (128)	U+07FF (2047)	`110xxxyy`	`10yyzzzz`
U+0800 (2048)	U+FFFF (65535)	`1110wwww`	`10xxxxyy`	`10yyzzzz`
U+010000 (65536)	U+10FFFF (1114111)	`11110uvv`	`10vvwwww`	`10xxxxyy`	`10yyzzzz`

If the first bit in a byte is $0$ it’s a $1$ byte wide wide symbol
If the first bits in a byte are $110$ it’s a $2$ byte wide wide symbol
…

e.g. ä has the Unicode code point U+00E4 (228, 1110 0100):

// UTF-8 encoding pattern for 2-byte characters:
   110x xxyy   10yy zzzz
// Filling in the bits of 00E4:
// -> 0000 0000 1110 0100
//    wwww xxxx yyyy zzzz
//
//    0 0011     10 0100
// 110x xxyy   10yy zzzz
   1100 0011   1010 0100

UTF-16 uses $2$ or $4$ byte per code point (indicated by a high surrogate)
- Used by the Windows API and by many programming environments such as Java and Qt
- If the first 2 bytes are between U+D800 ($55296$) and U+DFFF ($57343$) this indicates a surrogate pair meaning this code point is combined by this and the next $2$ bytes
UTF-32 uses a fixed width $4$ byte code point
- No transfer necessary since every Unicode character fits into $32$ bit

Using a variable length encoding like UTF-8 breaks existing code for string length and other operations when counting the actual code points!

The bigger size or e.g. resolving the actual encoded symbols with multiple bytes needs to be explicitly counted!

#include <stdio.h>  // printf
#include <string.h> // strlen

const char* ASCII_CONSTANT = "Hello!"; // Only ASCII symbols
const char* UTF8_CONSTANT = "ÄÖÜ€";    // UTF-8 symbols (file is UTF-8 encoded)

size_t utf8_strlen(const char *s) {
    size_t count = 0;
    while (*s) {                   // As long as the byte/char is not NULL
        if ((*s & 0xC0) != 0x80) { // Start of code point if the byte bitmask
            count++;               // isn't 10xxxxxx ['abcd & 1000'='a000']
        }                          // -> '0xC0'(hex)='11000000'(binary)
                                   // -> '0x80'(hex)='10000000'(binary)
                                   // -> the first 2 bits of *s must be !='10'
        s++;                       // Point to the next consecutive byte/char
    }                              // memory location
    return count;
}

int main() {
    printf("ASCII_CONSTANT: %s (length = %d, utf-8 length = %d)\n",
           ASCII_CONSTANT, strlen(ASCII_CONSTANT), utf8_strlen(ASCII_CONSTANT));
    printf("UTF8_CONSTANT: %s  (length = %d, utf-8 length = %d)\n",
           UTF8_CONSTANT, strlen(UTF8_CONSTANT), utf8_strlen(UTF8_CONSTANT));
}

ASCII_CONSTANT: Hello! (length = 6, utf-8 length = 6)
UTF8_CONSTANT: ÄÖÜ€  (length = 9, utf-8 length = 4)

Arrays/Pointer Arithmetic

C always knows thanks to the pointer signature int* array the size of each element (int, e.g. $32$ bits on $64$-bit x86-64) when for example an array of ints is allocated.

Using now arr[index] or *(arr + index) it can thus translate the memory location to the first element stored in the pointer to the correct memory offset of the array element which is called pointer arithmetic:

#include <stdio.h> // printf

void print_array(const char* name, const int integers[], int elements) {
    printf("%s (%d): ", name, elements);
    const char *separator = "";
    for(int j=0; j<elements; j++) {
      printf("%s%d", separator, integers[j]);
      separator = ",";
    }
    printf("\n");
 }

int main() {
    // Static array (fixed size)
    int staticArr[5] = {10, 20, 30, 40, 50}; // specify size + values
    staticArr[0] = 1;                        // change values
    int staticArrAuto[] = {10, 20, 30, 40};  // values determine size
    int staticArrZeros[5] = {1, 2};          // remaining values are 0

    print_array("staticArr", staticArr, 5);
    printf("0: %d, 2: %d\n", staticArr[0], staticArr[2]);
    print_array("staticArrAuto", staticArrAuto, 4);
    print_array("staticArrZeros", staticArrZeros, 5);

    // Pointer Arithmetic
    int *staticArrPtr = staticArr;    // use pointer to access/change values
    printf("[staticArrPtr] 0: %d, 2: %d\n", *staticArrPtr, *(staticArrPtr + 2));
    // This works because the compiler is automatically changing the memory
    // address by 2 * sizeOf(int)
    printf("staticArrPtr                 = %p [pointer]\n",
           (void *)staticArrPtr);
    printf("staticArrPtr + 1             = %p [pointer]\n",
           (void *)(staticArrPtr + 1));
    printf("staticArrPtr + 1             = %p [size_t]\n",
           (size_t)staticArrPtr + 1);
    printf("staticArrPtr + 1*sizeof(int) = %p [size_t]\n",
           (size_t)staticArrPtr + 1 * sizeof(int));
    // THATS WHY IT'S VERY IMPORTANT TO HAVE THE CORRECT POINTER DATA TYPE!
    long *staticArrPtrL = (long *)staticArr;
    printf("(long *) staticArrPtr + 1    = %p [long pointer]\n",
           (void *)(staticArrPtrL + 1));
}

staticArr (5): 1,20,30,40,50
0: 1, 2: 30
staticArrAuto (4): 10,20,30,40
staticArrZeros (5): 1,2,0,0,0
[staticArrPtr] 0: 1, 2: 30
staticArrPtr                 = 0x7ffe99b94580 [pointer]
staticArrPtr + 1             = 0x7ffe99b94584 [pointer]
staticArrPtr + 1             = 0x7ffe99b94581 [size_t]
staticArrPtr + 1*sizeof(int) = 0x7ffe99b94584 [size_t]
(long *) staticArrPtr + 1    = 0x7ffe99b94588 [long pointer]

Funnily enough this means that the following is true:

arr[index] == *(arr + index) == *(index + arr) == index[arr]

pointer[0] == *(pointer + 0) == *pointer

But C doesn’t know (always) or rather doesn’t check the bounds or the correct type of your array so be sure to not overshoot the actual memory of the array and make your program crash!