Text, File and Data Wrangling

Posted on 2023-07-25 Edited on 2023-09-02 In Tool Views: Word count in article: 3k Reading time ≈ 11 mins.

Linux提供强大的能操作文本和文件的命令，通过使用这些命令及其组合，我们能够快速地处理文本、在文件中传递信息和进行数据整理。

常用命令

下面这些常用的命令可以帮助我们快速地处理文本、在文件中传递信息、进行数据整理。

I/O重定向

与C语言中的I/O重定向相同，Linux的I/O重定向允许我们自由地更改标准输入、标准输出和标准错误文件。它们的默认对象分别为键盘、显示器和显示器，文件句柄为0，1和2。

`>` and `>>`：输出重定向

>重定向符允许我们将程序的输出重定向到指定的文件：

[meme@localhost Playground]$ ls -l /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
[meme@localhost Playground]$ ls -l /dev/null > ls-output.txt
[meme@localhost Playground]$ cat ls-output.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null

将cat视作一个查看文件内容的命令即可，后面会有介绍。

使用>后，原本会被输出到显示器的信息被输出到了新文件ls-output.txt中。需要注意的是，作为重定向目标的文件如果不存在，则Linux会先在当前目录下创建该文件，若目标文件已存在，则Linux会先将文件内容清空：

[meme@localhost Playground]$ > ls-output.txt
[meme@localhost Playground]$ cat ls-output.txt
[meme@localhost Playground]$ ls -l ls-output.txt
-rw-rw-r--. 1 meme meme 0 Jul 25 19:52 ls-output.txt

由于>前没有程序输出，所以目标文件的内容会被直接清零。若想保持原文件内容不变，则应该使用>>重定向符，让输出被添加在目标文件末尾：

[meme@localhost Playground]$ ls -l /dev/null >> ls-output.txt
[meme@localhost Playground]$ cat ls-output.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
[meme@localhost Playground]$ ls -l /dev/null >> ls-output.txt
[meme@localhost Playground]$ cat ls-output.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null

此处用到的文件/dev/null是Linux中的一个特殊文件，它是一个系统设备，称“位存储桶”，它可以接受输入，但是不输入任何东西。有时一些输出信息如果我们不想看到，我们可以直接把它们重定向到/dev/null。

标准错误重定向

标准错误显示的是程序执行过程中出现的错误信息，比如，若我们想用ls查看一个不存在的文件夹的信息，ls输出的就是标准错误：

1 2	[meme@localhost Playground]$ ls -l /ddd/ds ls: cannot access /ddd/ds: No such file or directory

标准错误缺乏专用的重定向符，我们只能用文件句柄+输出重定向来实现对标准错误的重定向，即：

1
2
3

[meme@localhost Playground]$ ls -l /ddd/ds 2> ls-output-error.txt
[meme@localhost Playground]$ cat ls-output-error.txt
ls: cannot access /ddd/ds: No such file or directory

2是Linux内部的标准错误文件句柄，2>或2>>（添加到文件末尾）能够实现标准错误的重定向。

标准输出与错误同步

有时，我们会希望将标准输出和标准错误都重定向到一个文件。对于这个操作，Linux提供了两种方式。

在输出重定向后将标准错误重定向到标准输出：

[meme@localhost Playground]$ ls -l /ddd/ds >> ls-output-error.txt 2>&1
[meme@localhost Playground]$ cat ls-output-error.txt
ls: cannot access /ddd/ds: No such file or directory
ls: cannot access /ddd/ds: No such file or directory

上面的2>&1将标准错误2重定向至标准输出1，而前面的>>则同时决定了两者的重定向方式。>>和2>&1的顺序不能颠倒，否则标准错误会被先重定向到原来的标准输出（显示器）。

标准输出与标准错误同时重定向，这是一种更简洁的方法：

1
2
3

[meme@localhost Playground]$ ls -l /ddd/ds &> ls-output-error.txt
[meme@localhost Playground]$ cat ls-output-error.txt
ls: cannot access /ddd/ds: No such file or directory

&>或&>>同时将标准输出和标准错误重定向到同一个文件。

`<`：输入重定向

在了解<输入重定向符号之前，我们要先知道一个可以接受标准输入的命令，见cat：Concatenate，连接文件。

既然cat能够接受标准输入，那么我们就可以用<将标准输入重定向为我们想要的文件：

1
2
3

[meme@localhost Playground]$ cat < ls-output.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null

进一步地，可以同时进行输入和输出重定向：

[meme@localhost Playground]$ cat < ls-output.txt > ls-output-copy.txt
[meme@localhost Playground]$ cat ls-output-copy.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null

`cat`：Concatenate，连接文件

1	cat [OPTION]... [FILE]...

cat，如其字面意思，是一个连接操作。它将依次读取输入文件[FILE]的内容，并复制到标准输出。由于cat输出的不同文件的内容之间不会分页，也不会有分隔符，所以如果此时把输出重定向到另一个文件，那么其作用就好像是把这几个文件拼合成了一个文件：

[meme@localhost Playground]$ cat ls-output.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
[meme@localhost Playground]$ cat ls-output.txt ls-output-error.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
ls: cannot access /ddd/ds: No such file or directory
[meme@localhost Playground]$ cat ls-output.txt ls-output-error.txt > ls-output-cat.txt
[meme@localhost Playground]$ cat ls-output-cat.txt
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
crw-rw-rw-. 1 root root 1, 3 Jul 25 18:19 /dev/null
ls: cannot access /ddd/ds: No such file or directory

cat后面也可以不跟文件名，此时它将从标准输入中读取信息并复制到标准输出：

1
2
3

[meme@localhost Playground]$ cat
I don't know about you but I'm feeling 22.
I don't know about you but I'm feeling 22.

上面的指令中，我只输入了一次I don't know about you but I'm feeling 22.，但是回车后cat将会复制并再次输出一行相同的文本，直到我们Ctrl+D示意标准输入已经到达文件末尾。当然，我们也可以将cat输出的标准输入内容重定向到指定文件：

[meme@localhost Playground]$ cat > 22.txt
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.
[meme@localhost Playground]$ cat 22.txt
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.

cat有许多有趣的[OPTION]选项，如-n会给输出加入行号。用cat --help来查看更多有趣的选项操作。

Here-document

Here-document是Shell中一种特殊形式的重定向。Here-document，可以解释为“立即文档”。在这种特殊形式的重定向中，Here-document就是一段字符串，这段字符串的长度很短，以至于将其单独放在一个文件中有点“大动干戈”，不如直接放在命令或者脚本代码中。Here-document的基本用法为：

1
2
3

[command] << [Here Tag]
[Your Here-document]
[Here Tag] # 终止符[Here Tag]必须独占一行，且要顶格写

其中[command]是任意的能够接受输入文档的指令，[Your Here-document]是用户想要输入的字符串，[Here Tag]则用于标志用户输入的开始和终止，可以是任何的字符串，一般用EOF。以cat为例：

[meme@localhost Playground]$ cat << EOF
> Hello World
> EOF
Hello World

此时，Hello World充当了输入文件的内容。

Here-document常被用于Bash脚本中。

`|`：管道线

|运算符可用于连接两个命令，它会自动地将前一个命令的输出管道为后一个命令的输入：

[meme@localhost Playground]$ cat 22.txt | cat >> 22.txt
[meme@localhost Playground]$ cat 22.txt
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.

第一个cat输出22.txt的内容，|将其管道为第二个cat的输入，而第二个cat的输出又被重定向到22.txt，因此最后22.txt的前两行被复制了一份。

管道线可以用来实现很复杂的操作。有时会把多个命令用多个|连接，这样形成的复合命令一般被称为“过滤器”，因为它可以把原始的输出通过|的逐层过滤得到我们想要的输出形式。

`uniq`：Unique，报道或忽略重复行

1	uniq [OPTION]... [INPUT [OUTPUT]]

uniq从标准输入或者单个文件名参数中接受有序列表，并在默认情况下删除有序列表中任何重复的行。由于输入uniq的参数必须是有序的uniq才能其作用（因为它只能检测到连续重复的行），因此uniq通常会结合sort一起使用：

[meme@localhost Playground]$ cat 22.txt
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.
Everything will be alright if you keep me next to you.
You don't know about me but I'll bet you want to.
Everything will be alright if we just keep dancing like we're
22 22
[meme@localhost Playground]$ cat 22.txt | sort | uniq
22 22
Everything will be alright if we just keep dancing like we're
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.
You don't know about me but I'll bet you want to.

此处稍稍增添了点《22》的歌词。

选项-d会使得uniq只输出重复的行：

1
2
3

[meme@localhost Playground]$ cat 22.txt | sort | uniq -d
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.

`wc`：Word Count，打印行、单词和字节数

1	wc [OPTION]... [FILE]...

wc会统计标准输入或者文件内容的总行数、总单词数和总字节数（字符数）并输出：

1 2	[meme@localhost Playground]$ cat 22.txt \| sort \| uniq -d \| wc 2 20 98

wc默认会将三个数字一起输出，不过-l、-w、-m选项会让其分别只输出行、单词、字符数。

`head`/`tail`：打印文件的开头或结尾

1	head/tail [OPTION]... [FILE]...

head和tail是两个对称的命令，前者默认输出输入的前10行，后者输出输入的后10行。-n选项可用于调整打印的行数：

1 2	[meme@localhost Playground]$ cat 22.txt \| sort \| uniq -d \| head -n1 Everything will be alright if you keep me next to you.

若在使用tail的同时使用-f选项，那么除非人为地中止，tail程序不会退出。这使得我们可以用tail -f [FILE]来实时的监测文件内容。

`tee`：T-splitter，左右开弓

1	tee [OPTION]... [FILE]...

tee，如其名字的由来，T型分流器（T-splitter），可以接受标准输入读入的数据，并将其同时复制到标准输出和一个或多个文件中：

[meme@localhost Playground]$ cat 22.txt | sort | uniq -d | tee t1.txt t2.txt | head -n1
Everything will be alright if you keep me next to you.
[meme@localhost Playground]$ cat t1.txt
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.
[meme@localhost Playground]$ cat t2.txt
Everything will be alright if you keep me next to you.
I don't know about you but I'm feeling 22.

tee获取了uniq的输出，并将其输出到t1.txt和t2.txt中，同时还会输入到标准输出中。tee的这种不中断管道流的特性使得它很适合被用在过滤器中以提取中间信息。

`grep`：Global Regular Expression Print，打印匹配行

1	grep [OPTION]... PATTERN [FILE]...

grep是一个很强大的程序，它可以按照PATTERN的形式匹配文件或标准输入中的指定模式并打印其所在的行。PATTERN可以是我们需要检索的文本，也可以是更加复杂通用的正则表达式。

[meme@localhost Playground]$ cat 22.txt | grep Everything
Everything will be alright if you keep me next to you.
Everything will be alright if you keep me next to you.
Everything will be alright if we just keep dancing like we're

常用命令

I/O重定向

> and >>：输出重定向

标准错误重定向

标准输出与错误同步

<：输入重定向

cat：Concatenate，连接文件

Here-document

|：管道线

uniq：Unique，报道或忽略重复行

wc：Word Count，打印行、单词和字节数

head/tail：打印文件的开头或结尾

tee：T-splitter，左右开弓

grep：Global Regular Expression Print，打印匹配行

参考