Files

"Theory and practice sometimes clash.
And when that happens, theory loses. Every single time."
Linus Torvalds

The protogenesis of Unix was a filesystem.[1][2] Designed and implemented by the researchers of Bell Laboratories, this piece of software became an operating system when a means of interacting with its contents was introduced. Being a Unix-like operating system, GNU/Linux inherits Unix's file semantics.

Files

In Unix, a file is a sequence of bytes.[1] This definition is sufficiently abstract to allow, for instance, a keyboard to appear as a file. Unix-like operating systems facilitate interaction with files through four essential system calls:

  1. open()
  2. close()
  3. read()
  4. write()

In particular, the read() system call fills a given buffer with the contents of a file; order is preserved across sequential calls to read(). When a program requests a read, the kernel suspends its execution, loads the appropriate data, then resumes execution of the program.

File Attributes

A substantial amount of information about each file is stored on the system. Such information is called metadata, though within the context of Unix files, it is usually referred to as file attributes.

File attributes consist of:

Aside:

Oddly, this information does not include the file's name. More on that later.

This data is stored alongside the file, typically on-disk, and its quantity can be explained by Unix's time-sharing heritage. It is used today to compartmentalize the machine.

Regular Files

A regular file is stored on a disk or other persistent storage device. Most files are regular files.

Here we have a typical ls listing:

Aside:

Access restrictions are enforced by the kernel.

The first character, a hyphen (-), tells us that bash (located in
/usr/bin) is a regular file. The next nine characters are its permission bits; the r and x present in each trio indicate that this file is readable and executable (respectively) by any user on the system. The w present only in the owner trio indicates that the file may be written to only by its owner, who in this case is root.

Two other tools for peeking into a file are file, which attempts to categorize the file, and stat, which prints the file's complete metadata:

$ file /usr/bin/bash
/bin/bash: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
BuildID[sha1]=a43fec47192ff49c2d3fed671f2be8df7e83784a, for
GNU/Linux 3.2.0, stripped

$ stat /usr/bin/bash
File: /usr/bin/bash
Size: 1183448 Blocks: 2312 IO Block: 4096 regular file
Device: 802h/2050d Inode: 1835042 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-07-14 13:33:43.628528710 -0500
Modify: 2020-02-25 06:03:22.000000000 -0600
Change: 2020-07-09 11:20:12.676956656 -0500
Birth: -

To simply view the contents of a file, we can use the less paging utility, which interprets the file as text.

Device Files

The Unix directory tree is used as a general-purpose meeting place: In addition to regular files, it also contains references to hardware, offers information about the running system, and allows interprocess communication. This mechanism is general, extensible, and allows both users and programmers access to the running machine.

A device file is a file that represents a device connected to the system. Device files reside in /dev.[3] For example, when a USB-stick is inserted, a new file appears in /dev, and the device may be read from or written to (directly) by passing that file to tools like cat or dd.

It's interesting to note that the essential capabilities of the USB-stick are preserved through the file interface: It can be read from and written to. If we were to add the capacity to seek (i.e., change location) within the file, then we would be able to use the device in its entirety.

Block and Character Device Files

Unix kernels support two kinds of device files: Block and Character. A character device file may only be read from or written to a single byte (ASCII character) at a time. A keyboard is one kind of device represented by a character device file.

Hard disks, SSD's and removable mass media are represented by block device files: These devices support reading and writing only in blocks of bytes at a time, and the kernel must buffer arbitrary interactions with it. Additionally, block device files support seeking, whereas character devices do not.

See Also:

Pseudo Files

The Linux kernel augments the traditional Unix directory tree with pseudo (or "synthetic") files. These files do not exist on disk, and do not exist on any persistent storage. Instead, their contents are generated dynamically by the kernel as processes interact with them.

For example, /proc/meminfo contains a description of the system's current memory usage:

$ cat /proc/meminfo
MemTotal: 3887988 kB
MemFree: 1619568 kB
MemAvailable: 2450268 kB
Buffers: 120152 kB
Cached: 984208 kB
SwapCached: 0 kB
Active: 324532 kB
Inactive: 1528196 kB
Active(anon): 9476 kB
Inactive(anon): 1061224 kB
[...]

/proc contains one directory for each process (running program) on the system; therein, each directory contains substantial information about its respective process. /sys contains information useful to only the lowest-level system daemons.[4]

Other File Types:

See Also:

Basic File Operations

and Introduction to Directories

Basic command-line utilities to manipulate files are:

Utilities to display and inspect file contents:

Utilities to manipulate directories:

Introduction to Directories

Aside:

find recursively descends through a directory tree in search of files.

Directories are a special kind of file, and early Unix kernels allowed them to be manipulated relatively unabated.[2] However, the directory structure is intended to be a hierarchy, and some software (like find) depends upon this structure. Leaving directory contents exposed to users introduced the possibility that a loop could form, or that a directory subtree could be broken off without being properly deallocated.[5] So, directories are now hidden behind a separate system call interface.

Directories start life with two entries: . ("dot"), and .. ("dot dot"). These entries cannot be removed, nor changed. The first, ., is a reference to the containing directory itself, and is used as a handle to the current working directory. .. is a reference to the directory's parent directory.

Since each directory has exactly one parent, the directory structure is guaranteed to be a tree.

See Also:

Filesystem Hierarchy

Linux file system foto no exif (1)

Within a Unix directory tree, some common themes will be found. For instance, bin directories are found in several places, such as /bin, /sbin, along with /usr/bin and often /home/user/.local/bin as well. These contain binary executable files— programs— such as bash and ls. Other standard directories include /root, the root user's home directory, and /proc and /dev, as mentioned above. Some of these have been standard since Unix's inception, while others have become standard in more recent years.

Aside:

The Filesystem Hierarchy Standard formally describes the GNU/Linux directory tree. It is maintained by the Linux Foundation.

Let's start at the top. Each Unix-like operating system has a single root directory, /, which is the root of the directory tree. Every file on the system may be addressed in absolute terms by giving its location relative to this directory.

/home contains users' home directories. As user josh, my home directory is /home/josh. This path is commonly abbreviated to ~. By default, it contains the usual "Music", "Movies", "Pictures" and "Downloads". Per-user configuration files are stored here— as either hidden files or in the directory ~/.config.

The /usr directory is dedicated to user-space (as opposed to kernel-space) files.[6] It contains programs, libraries, manual pages, C header files, and other things. These files are shared by all system users, and are read-only.

Pro Tip:

As a rule, modify /etc, and let your distribution handle /usr.[7]

System configuration is traditionally maintained as a collection of plain-text, human-readable regular files. System-specific configuration is kept in /etc, while distributions' default configurations are typically stored in /usr (alongside their respective packages).

/tmp, /run, and /var each hold transient files of running programs. /boot holds everything required for the boot sequence, including my very own Linux kernel:

$ file /boot/Image
Image:   Linux kernel ARM64 boot executable Image, little-endian, 4k pages

This file is copied (loaded) into memory early in the boot process, after which it is no longer needed.

Software libraries are kind of boring, but essential to modern systems. The basic idea is to factor out repeated code (for instance, C's printf function) not just from the source code, but from executable files as well. This means that, when a bug or security vulnerability is found in printf, any program that uses the function will not have to be recompiled; rather, the appropriate library can be adjusted, and all dependant programs can continue operation in blissful ignorance of the change. Such executable files are said to be dynamically linked, and cannot be expected to run properly without their dependencies. /lib holds libraries.

Mounting a Filesystem

As mentioned, Unix-like operating systems maintain a single directory tree for the entire system. When a new device is introduced to the machine, the directory tree its filesystem contains must be placed in the system's existing directory tree before its contents can be accessed. This operation is known as mounting.

Let's take a look at this, for it is new. Suppose we have a USB stick with some music on it. When we plug it into the machine, a file representing the device appears in /dev, but its contents are not yet available:

Assuming the device popped up as "/dev/usb", we could mount it with,

$ sudo mount /dev/usb /mnt

Resulting in:

The files are now accessible through the directory tree.

In practice, a USB-device is much more likely to be named /dev/sdb or /dev/sdc, with partitions, /dev/sdc1, /dev/sdc2, /dev/sdc3, ... . Hence, a more realistic call to mount would be,

$ sudo mount /dev/sdc1 /mnt

When a filesystem is mounted to a directory that is not empty, the original directory's contents become hidden by those of the new filesystem, and are inaccessible until the filesystem is unmounted. Therefore, we typically locate an empty directory in the existing directory tree, and use it as the to-be mounted filesystem's root directory.

See Also:

Directories

and Filenames

An inode is a collection of data describing a file. It contains all metadata of the file, as well as a sequence of references to the data constituting the file. Within a filesystem, inodes are located in an array, so that an index into this array uniquely identifies an inode, and thereby identifies a specific file. These indices are referred to as inode numbers.

A directory is a file which associates filenames with inode numbers. Both literally and conceptually, a directory is a file containing a two-column table, inode number and filename:

Remember:

Every path on the directory tree terminates in a inode reference.

$ ls -ia ~/Public/Learning-Linux
645278 ./
384844 ../
775487 assets/
645403 _config.yml
775449 _data/
642188 documentation.html
641473 files.html
775463 how-open-source-works.html
[...]

Here we have the file's inode numbers on the left, and its associated filename on the right. Each inode/filename number pair is called a hard link. A file may have more than one hard link, but when it has zero, it is no longer accessible through the directory tree, and the kernel prepares to delete it.[8]

Filenames

Unix-like operating systems allow files to be named very flexibly: Filenames may include any character except /, the path separator, and \NULL, the end-of-string delimiter. More, it is not required that filenames terminate in any special postfix; in particular, a file is executable exactly when its permissions say so.

Aside:

Hidden files are strictly a user-space convention, and have no meaning, per se, with respect to the system.

As a rule, filenames are not interpreted by the kernel, and exist for user interpretation alone. Filenames beginning with . are considered hidden (including . and ..). They can be listed by ls with the -a ("all") switch:

$ ls -a

See Also:

The Original Unix Filesystem

The inode was implemented literally in the Research Unix filesystem, and formed the basis of its architecture:

In the design, every file is represented by a single inode, which contains the file's metadata, and contains pointers to data blocks, where the contents of the file are kept.[9] These nodes are arranged in an array, and can be identified by an index into it. Data blocks are of fixed size, and either free or wholly consumed by a file.[9]

This design allows a file to grow or shrink without rearranging other files.[9] In addition, the inclusion of single-, double- and triple- indirection blocks allow the inode to remain of fixed, reasonable size, while supporting large files.[9] Finally, by keeping the inode array in memory, the design fosters the speedy retrieval of data in small files.[9]

It was the most advanced filesystem of its time, and as Unix grew in popularity, the semantics of file operations in Unix-like operating systems became dependant upon them. The design was demonstrated in the MINIX filesystem, which in turn served as Linux's first filesystem. The extended filesystem, ext, extended that filesystem's capabilities. With ext's inclusion in kernel 0.96c, Linux began to use its new Virtual File System, a kernel subsystem that provides userspace with a uniform filesystem interface.

While each filesystem implements its own functionality, most functionality is common; the Virtual File System maps system calls to filesystem-specific functions.[10] In addition, it imposes inodes on filesystems that do not natively employ them, thereby preserving traditional Unix file semantics across all supported filesystems.[10]

See Also:

Engineer Man

The Linux File System ...for humans

References

  1. Kernighan, Brian. Unix: A History and a Memoir. Published 2020, by Brian W. Kernighan via Kindle Direct Publishing.
  2. Ritchie, D. M. (1984). The Evolution of the Unix Time-sharing System. AT&T Bell Laboratories Technical Journal, 63(6), 2nd ser., 1577-1593.
  3. LSB Workgroup, The Linux Foundation. (2015, March 9). 3.6 /dev: Device Files. Retrieved July 17, 2020, from https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03.html
  4. Beekmans, G. (2021, March 1). 9.3. Overview of Device and Module Handling. 9.3. Overview of Device and Module Handling. https://linuxfromscratch.org/lfs/view/stable/chapter09/udev.html.
  5. Ritchie, D. M., & Thompson, K. (n.d.). The UNIX Time-Sharing System. UNIX-Annotated.pdf. https://dsf.berkeley.edu/cs262/UNIX-annotated.pdf.
  6. Nguyen, B. (2004, April 30). The Linux Filesystem Heirarchy, Section 1.17. /usr. /usr. https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/usr.html.
  7. Ward, B. (2021). How Linux Works: What Every Superuser Should Know (3rd ed.). No Starch Press.
  8. Kerrisk, Michael. The Linux Programming Interface. San Francisco, CA, No Starch Press, 2010.
  9. Bach, M. J. (1986). Chapter 4: Internal Representation of Files. In The Design of the UNIX Operating System (pp. 60-90). Englewood Cliffs, NG: Prentice-Hall.
  10. Overview of the Linux Virtual File System - The Linux Kernel Documentation. (n.d.). Retrieved July 17, 2020, from https://www.kernel.org/doc/html/latest/filesystems/vfs.html