Repository: missing-semester-cn/missing-semester-cn.github.io Branch: master Commit: 8334ef64fdf1 Files: 65 Total size: 603.7 KB Directory structure: gitextract_gfrnrc45/ ├── .editorconfig ├── .github/ │ └── ISSUE_TEMPLATE/ │ └── translation.md ├── .gitignore ├── 404.html ├── CNAME ├── Gemfile ├── README.md ├── _2019/ │ ├── automation.md │ ├── backups.md │ ├── command-line.md │ ├── course-overview.md │ ├── data-wrangling.md │ ├── dotfiles.md │ ├── editors.md │ ├── files/ │ │ ├── example-data.xml │ │ └── example.c │ ├── index.html │ ├── machine-introspection.md │ ├── os-customization.md │ ├── package-management.md │ ├── program-introspection.md │ ├── remote-machines.md │ ├── security.md │ ├── shell.md │ ├── version-control.md │ ├── virtual-machines.md │ └── web.md ├── _2020/ │ ├── command-line.md │ ├── course-shell.md │ ├── data-wrangling.md │ ├── debugging-profiling.md │ ├── editors-notes.txt │ ├── editors.md │ ├── files/ │ │ ├── example-data.xml │ │ └── vimrc │ ├── index.html │ ├── metaprogramming.md │ ├── potpourri.md │ ├── qa.md │ ├── security.md │ ├── shell-tools.md │ └── version-control.md ├── _config.yml ├── _includes/ │ ├── head.html │ ├── nav.html │ ├── scaled_image.html │ ├── scaled_video.html │ └── video.html ├── _layouts/ │ ├── default.html │ ├── lecture.html │ ├── page.html │ └── redirect.html ├── about.md ├── index.md ├── lectures.html ├── license.md ├── robots.txt └── static/ ├── css/ │ ├── main.css │ └── syntax.css └── files/ ├── logger.py ├── sorts.py └── subtitles/ └── 2020/ ├── command-line.sbv ├── debugging-profiling.sbv ├── qa.sbv └── shell-tools.sbv ================================================ FILE CONTENTS ================================================ ================================================ FILE: .editorconfig ================================================ root = true [*] charset = utf-8 end_of_line = lf indent_style = space insert_final_newline = true trim_trailing_whitespace = true [*.md] indent_size = 4 trim_trailing_whitespace = false [*.{html,xml}] indent_size = 2 [*.yml] indent_size = 2 [*.css] indent_size = 2 ================================================ FILE: .github/ISSUE_TEMPLATE/translation.md ================================================ --- name: translation about: choose the file you plan to translate title: '' labels: trans assignees: '' --- Filename : Estimated time of finish : Note: Please make sure you can finish it within two weeks. ================================================ FILE: .gitignore ================================================ .ruby-version .bundle/ _site/ .jekyll-metadata .claude/ ================================================ FILE: 404.html ================================================ --- layout: default title: "404: Page not found" permalink: /404.html ---

404

Sorry, the page you were looking for doesn't exist or has been moved.

You can go back to the home page or use the search bar to find what you're looking for.

If you think this is an error, please contact us.

================================================ FILE: CNAME ================================================ missing-semester-cn.github.io ================================================ FILE: Gemfile ================================================ source 'https://rubygems.org' gem 'github-pages' ================================================ FILE: README.md ================================================ # 计算机教育中缺失的一课 The Missing Semester of Your CS Education 英文课程网站在[这里](https://missing.csail.mit.edu/)! 这是[中文站点](https://missing-semester-cn.github.io)() 欢迎为本项目做出贡献!如果您要编辑添加内容,请提出 issue 或提交 pull request。 ## 开发部署 要在本地构建并查看网站,请运行: ```bash bundle install bundle exec jekyll serve -w ``` ## 许可说明 本课程的所有内容,包括网站源代码、讲义、练习题和讲课视频,均按照 [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) 国际许可协议进行许可。 有关贡献或翻译的更多信息,请参见[这里](https://missing.csail.mit.edu/license)。 ----------------- ## 2026 课程状态 请在[sync-offical-2026](https://github.com/missing-semester-cn/missing-semester-cn.github.io/tree/sync-offical-2026)分支的README中认领任务并提交翻译。 | 讲义 | 翻译者 | 状态 | | ---- | ---- | ---- | | [agentic-coding.md](_2026/agentic-coding.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 待翻译 | | [beyond-code.md](_2026/beyond-code.md) | 待分配 | 待翻译 | | [code-quality.md](_2026/code-quality.md) | 待分配 | 待翻译 | | [command-line-environment.md](_2026/command-line-environment.md) | 待分配 | 待翻译 | | [course-shell.md](_2026/course-shell.md) | 待分配 | 待翻译 | | [debugging-profiling.md](_2026/debugging-profiling.md) | 待分配 | 待翻译 | | [development-environment.md](_2026/development-environment.md) | 待分配 | 待翻译 | | [shipping-code.md](_2026/shipping-code.md) | 待分配 | 待翻译 | | [version-control.md](_2026/version-control.md) | 待分配 | 待翻译 | ----------------- ## 项目状态 想要参与这个翻译项目,请通过创建一个 issue 来预订您的主题,我会相应地更新此表格,以避免重复工作。 | 讲义 | 翻译者 | 状态 | | ---- | ---- |---- | | [course-shell.md](_2020/course-shell.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [shell-tools.md](_2020/shell-tools.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [editors.md](_2020/editors.md) | [@stechu](https://github.com/stechu) | 完成 | | [data-wrangling.md](_2020/data-wrangling.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [command-line.md](_2020/command-line.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [version-control.md](_2020/version-control.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [debugging-profiling.md](_2020/debugging-profiling.md) |[@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [metaprogramming.md](_2020/metaprogramming.md) | [@Lingfeng AI](https://github.com/hanxiaomax) | 完成 | | [security.md](_2020/security.md) | [@catcarbon](https://github.com/catcarbon) | 完成 | | [potpourri.md](_2020/potpourri.md) | [@catcarbon](https://github.com/catcarbon) | 完成 | | [qa.md](_2020/qa.md) | [@AA1HSHH](https://github.com/AA1HSHH) | 完成 | | [about.md](about.md) | [@Binlogo](https://github.com/Binlogo) | 完成 | ## 新项目 [Learncpp中文版](https://github.com/hanxiaomax/Learncpp_CN). ================================================ FILE: _2019/automation.md ================================================ --- layout: lecture title: "Automation" presenter: Jose video: aspect: 56.25 id: BaLlAaHz-1k --- Sometimes you write a script that does something but you want for it to run periodically, say a backup task. You can always write an *ad hoc* solution that runs in the background and comes online periodically. However, most UNIX systems come with the cron daemon which can run task with a frequency up to a minute based on simple rules. On most UNIX systems the cron daemon, `crond` will be running by default but you can always check using `ps aux | grep crond`. ## The crontab The configuration file for cron can be displayed running `crontab -l` edited running `crontab -e` The time format that cron uses are five space separated fields along with the user and command - **minute** - What minute of the hour the command will run on, and is between '0' and '59' - **hour** - This controls what hour the command will run on, and is specified in the 24 hour clock, values must be between 0 and 23 (0 is midnight) - **dom** - This is the Day of Month, that you want the command run on, e.g. to run a command on the 19th of each month, the dom would be 19. - **month** - This is the month a specified command will run on, it may be specified numerically (0-12), or as the name of the month (e.g. May) - **dow** - This is the Day of Week that you want a command to be run on, it can also be numeric (0-7) or as the name of the day (e.g. sun). - **user** - This is the user who runs the command. - **command** - This is the command that you want run. This field may contain multiple words or spaces. Note that using an asterisk `*` means all and using an asterisk followed by a slash and number means every nth value. So `*/5` means every five. Some examples are ```shell */5 * * * * # Every five minutes 0 * * * * # Every hour at o'clock 0 9 * * * # Every day at 9:00 am 0 9-17 * * * # Every hour between 9:00am and 5:00pm 0 0 * * 5 # Every Friday at 12:00 am 0 0 1 */2 * # Every other month, the first day, 12:00am ``` You can find many more examples of common crontab schedules in [crontab.guru](https://crontab.guru/examples.html) ## Shell environment and logging A common pitfall when using cron is that it does not load the same environment scripts that common shells do such as `.bashrc`, `.zshrc`, &c and it does not log the output anywhere by default. Combined with the maximum frequency being one minute, it can become quite painful to debug cronscripts initially. To deal with the environment, make sure that you use absolute paths in all your scripts and modify your environment variables such as `PATH` so the script can run successfully. To simplify logging, a good recommendation is to write your crontab in a format like this ```shell * * * * * user /path/to/cronscripts/every_minute.sh >> /tmp/cron_every_minute.log 2>&1 ``` And write the script in a separate file. Remember that `>>` appends to the file and that `2>&1` redirects `stderr` to `stdout` (you might to want keep them separate though). ## Anacron One caveat of using cron is that if the computer is powered off or asleep when the cron script should run then it is not executed. For frequent tasks this might be fine, but if a task runs less often, you may want to ensure that it is executed. [anacron](https://linux.die.net/man/8/anacron) works similar to `cron` except that the frequency is specified in days. Unlike cron, it does not assume that the machine is running continuously. Hence, it can be used on machines that aren't running 24 hours a day, to control regular jobs as daily, weekly, and monthly jobs. ## Exercises 1. Make a script that looks every minute in your downloads folder for any file that is a picture (you can look into [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types) or use a regular expression to match common extensions) and moves them into your Pictures folder. 1. Write a cron script to weekly check for outdated packages in your system and prompts you to update them or updates them automatically. {% comment %} - [fswatch](https://github.com/emcrisostomo/fswatch) - GUI automation (pyautogui) [Automating the boring stuff Chapter 18](https://automatetheboringstuff.com/chapter18/) - Ansible/puppet/chef - https://xkcd.com/1205/ - https://xkcd.com/1319/ {% endcomment %} ================================================ FILE: _2019/backups.md ================================================ --- layout: lecture title: "Backups" presenter: Jose video: aspect: 56.25 id: lrpqYF8tcYQ --- There are two types of people: - Those who do backups - Those who will do backups Any data you own that you haven't backed up is data that could be gone at any moment, forever. Here we will cover some good backup basics and the pitfalls of some approaches. ## 3-2-1 Rule The [3-2-1 rule](https://www.us-cert.gov/sites/default/files/publications/data_backup_options.pdf) is a general recommended strategy for backing up your data. It state that you should have: - at least **3 copies** of your data - **2** copies in **different mediums** - **1** of the copies being **offsite** The main idea behind this recommendation is not to put all your eggs in one basket. Having 2 different devices/disks ensures that a single hardware failure doesn't take away all your data. Similarly, if you store your only backup at home and the house burns down or gets robbed you lose everything, that's what the offsite copy is there for. Onsite backups give you availability and speed, offsite give you the resiliency should a disaster happen. ## Testing your backups A common pitfall when performing backups is blindly trusting whatever the system says it's doing and not verifying that the data can be properly recovered. Toy Story 2 was almost lost and their backups were not working, [luck](https://www.youtube.com/watch?v=8dhp_20j0Ys) ended up saving them. ## Versioning You should understand that [RAID](https://en.wikipedia.org/wiki/RAID) is not a backup, and in general **mirroring is not a backup solution**. Simply syncing your files somewhere will not help in several scenarios, such as: - Data corruption - Malicious software - Deleting files by mistake If the changes on your data propagate to the backup then you won't be able to recover in these scenarios. Note that this is the case for a lot of cloud storage solutions like Dropbox, Google Drive, One Drive, &c. Some of them do keep deleted data around for short amounts of time but usually the interface to recover is not something you want to be using to recover large amounts of files. A proper backup system should be versioned in order to prevent this failure mode. By providing different snapshots in time one can easily navigate them to restore whatever was lost. The most widely known software of this kind is macOS Time Machine. ## Deduplication However, making several copies of your data might be extremely costly in terms of disk space. Nevertheless, from one version to the next, most data will be identical and needs not be transferred again. This is where [data deduplication](https://en.wikipedia.org/wiki/Data_deduplication) comes into play, by keeping track of what has already been stored one can do **incremental backups** where only the changes from one version to the next need to be stored. This significantly reduces the amount of space needed for backups beyond the first copy. ## Encryption Since we might be backing up to untrusted third parties like cloud providers it is worth considering that if you backup your data is copied *as is* then it could potentially be looked by unwanted agents. Documents like your taxes are sensitive information that should not be backed up in plain format. To prevent this, many backup solutions offer **client side encryption** where data is encrypted before being sent to the server. That way the server cannot read the data it is storing but you can decrypt it with your secret key. As a side note, if your disk (or home partition) is not encrypted, then anyone that get hold of your computer can manage to override the user access controls and read your data. Modern hardware supports fast and efficient read and writes of encrypted data so you might want to consider enabling **full disk encryption**. ## Append only The properties reviewed so far focus on hardware failure or user mistakes but fail to address what happens if a malicious agent wanted to delete your data. Namely, say someone hacks into your system, are they able to wipe all your copies of the data you care about? If you worry about that scenario then you need some sort of append only backup solution. In general, this means having a server that will allow you to send new data but will refuse to delete existing data. Usually users have two keys, an append only key that supports creating new backups and a full access key that also allows for deleting old backups that are no longer needed. The latter one is stored offline. Note that this is a quite challenging scenario since you need the ability to make changes whilst still preventing a malicious user from deleting your data. Existing commercial solutions include [Tarsnap](https://www.tarsnap.com/) and [Borgbase](https://www.borgbase.com/). ## Additional considerations Some other things you may want to look into are: - **Periodic backups**: outdated backups can become pretty useless. Making backups regularly should be a consideration for your system - **Bootable backups**: some programs allow you to clone your entire disk. That way you have an image that contains an entire copy of your system you can boot directly from. - **Differential backup strategies**, you may not necessarily care the same about all your data. You can define different backup policies for different types of data. - **Append only backups** an additional consideration is to enforce append only operations to your backup repositories in order to prevent malicious agents to delete them if they get hold of your machine. ## Webservices Not all the data that you use lives on your hard disk. If you use **webservices**, then it might be the case that some data you care about, such as Google Docs presentations or Spotify playlists, is stored online. Another easy example that is easy to forget is email accounts with web access, such as Gmail. Figuring out a backup solution in these cases is somewhat trickier. However, there are many services that allow you to download your data, either directly or via an API. Tools such as [gmvault](https://github.com/gaubert/gmvault) for Gmail are available to download the email files to your computer. ## Webpages Similarly, some high quality content can be found online in the form of webpages. If said content is static one can easily back it up by just saving the website and all of its attachments. Another alternative is the [Wayback Machine](https://archive.org/web/), a massive digital archive of the World Wide Web managed by the [Internet Archive](https://archive.org/), a non profit organization focused on the preservation of all sorts of media. The Wayback Machine allows you to capture and archive webpages being able to later retrieve all the snapshots that have been archived for that website. If you find it useful, consider [donating](https://archive.org/donate/) to the project. ## Resources Some good backup programs and services we have used and can honestly recommend: - [Tarsnap](https://www.tarsnap.com/) - deduplicated, encrypted online backup service for the truly paranoid. - [Borg Backup](https://borgbackup.readthedocs.io) - deduplicated backup program that supports compression and authenticated encryption. If you need a cloud provider [rsync.net](https://www.rsync.net/products/borg.html) has special offerings for Borg users. - [rsync](https://rsync.samba.org/) is a utility that provides fast incremental file transfer. It is not a full backup solution. - [rclone](https://rclone.org/) like rsync but for cloud storage providers such as Amazon S3, Dropbox, Google Drive, rsync.net, &c. Supports client side encryption of remote folders. ## Exercises 1. Consider how you are (not) backing up your data and look into fixing/improving that. 1. Figure out how to backup your email accounts 1. Choose a webservice you use often (Spotify, Google Music, etc.) and figure out what options for backing up your data are. Often people have already made tools (such as [youtube-dl](https://ytdl-org.github.io/youtube-dl/)) solutions based on available APIs. 1. Think of a website you have visited repeatedly over the years and look it up in [archive.org](https://archive.org/web/), how many versions does it have? 1. One way to efficiently implement deduplication is to use hardlinks. Whereas symbolic link (also called a soft link or a symlink) is a file that points to another file or folder, a hardlink is a exact copy of the pointer (it uses the same inode and points to the same place in the disk). Thus if the original file is removed a symlink stops working whereas a hard link doesn't. However, hardlinks only work for files. Try using the command `ln` to create hard links and compare them to symlinks created with `ln -s`. (In macOS you will need to install the gnu coreutils or the hln package). ================================================ FILE: _2019/command-line.md ================================================ --- layout: lecture title: "Command-line environment" presenter: Jose video: aspect: 62.5 id: i0rf1gpKL1E --- ## Aliases & Functions As you can imagine it can become tiresome typing long commands that involve many flags or verbose options. Nevertheless, most shells support **aliasing**. For instance, an alias in bash has the following structure (note there is no space around the `=` sign): ```bash alias alias_name="command_to_alias" ``` Alias have many convenient features ```bash # Alias can summarize good default flags alias ll="ls -lh" # Save a lot of typing for common commands alias gc="git commit" # Alias can overwrite existing commands alias mv="mv -i" alias mkdir="mkdir -p" # Alias can be composed alias la="ls -A" alias lla="la -l" # To ignore an alias run it prepended with \ \ls # Or can be disabled using unalias unalias la ``` However in many scenarios aliases can be limiting, specially when you are trying to write chain commands together that take the same arguments. An alternative exists which is **functions** which are a midpoint between aliases and custom shell scripts. Here is an example function that makes a directory and move into it. ```bash mcd () { mkdir -p $1 cd $1 } ``` Alias and functions will not persist shell sessions by default. To make an alias persistent you need to include it a one the shell startup script files like `.bashrc` or `.zshrc`. My suggestion is to write them separately in a `.alias` and `source` that file from your different shell config files. ## Shells & Frameworks During shell and scripting we covered the `bash` shell since it is by far the most ubiquitous shell and most systems have it as the default option. Nevertheless, it is not the only option. For example the `zsh` shell is a superset of `bash` and provides many convenient features out of the box such as: - Smarter globbing, `**` - Inline globbing/wildcard expansion - Spelling correction - Better tab completion/selection - Path expansion (`cd /u/lo/b` will expand as `/usr/local/bin`) Moreover many shells can be improved with **frameworks**, some popular general frameworks like [prezto](https://github.com/sorin-ionescu/prezto) or [oh-my-zsh](https://github.com/robbyrussell/oh-my-zsh), and smaller ones that focus on specific features like for example [zsh-syntax-highlighting](https://github.com/zsh-users/zsh-syntax-highlighting) or [zsh-history-substring-search](https://github.com/zsh-users/zsh-history-substring-search). Other shells like [fish](https://fishshell.com/) include a lot of these user-friendly features by default. Some of these features include: - Right prompt - Command syntax highlighting - History substring search - manpage based flag completions - Smarter autocompletion - Prompt themes One thing to note when using these frameworks is that if the code they run is not properly optimized or it is too much code, your shell can start slowing down. You can always profile it and disable the features that you do not use often or value over speed. ## Terminal Emulators & Multiplexers Along with customizing your shell it is worth spending some time figuring out your choice of **terminal emulator** and its settings. There are many many terminal emulators out there (here is a [comparison](https://anarc.at/blog/2018-04-12-terminal-emulators-1/)). Since you might be spending hundreds to thousands of hours in your terminal it pays off to look into its settings. Some of the aspects that you may want to modify in your terminal include: - Font choice - Color Scheme - Keyboard shortcuts - Tab/Pane support - Scrollback configuration - Performance (some newer terminals like [Alacritty](https://github.com/jwilm/alacritty) offer GPU acceleration) It is also worth mentioning **terminal multiplexers** like [tmux](https://github.com/tmux/tmux). `tmux` allows you to pane and tab multiple shell sessions. It also supports attaching and detaching which is a very common use-case when you are working on a remote server and want to keep you shell running without having to worry about disowning you current processes (by default when you log out your processes are terminated). This way, with `tmux` you can jump into and out of complex terminal layouts. Similar to terminal emulators `tmux` supports heavy customization by editing the `~/.tmux.conf` file. ## Command-line utilities The command line utilities that most UNIX based operating systems have by default are more than enough to do 99% of the stuff you usually need to do. In the next few subsections I will cover alternative tools for extremely common shell operations which are more convenient to use. Some of these tools add new improved functionality to the command whereas others just focus on providing a simpler, more intuitive interface with better defaults. ### `fasd` vs `cd` Even with improved path expansion and tab autocomplete, changing directories can become quite repetitive. [Fasd](https://github.com/clvv/fasd) (or [autojump](https://github.com/wting/autojump)) solves this issue by keeping track of recent and frequent folders you have been to and performing fuzzy matching. Thus if I have visited the path `/home/user/awesome_project/code` running `z code` will `cd` to it. If I have multiple folders called code I can disambiguate by running `z awe code` which will be closer match. Unlike autojump, fasd also provides commands that instead of performing `cd` just expand frequent and /or recent files,folders or both. ### `bat` vs `cat` Even though `cat` does it job perfectly, [bat](https://github.com/sharkdp/bat) improves it by providing syntax highlighting, paging, line numbers and git integration. ### `exa`/`ranger` vs `ls` `ls` is a great command but some of the defaults can be annoying such as displaying the size in raw bytes. [exa](https://github.com/ogham/exa) provides better defaults If you are in need of navigating many folders and/or previewing many files, [ranger](https://github.com/ranger/ranger) can be much more efficient than `cd` and `cat` due to its wonderful interface. It is quite customizable and with a correct setup you can even [preview images](https://github.com/ranger/ranger/wiki/Image-Previews) in your terminal ### `fd` vs `find` [fd](https://github.com/sharkdp/fd) is a simple, fast and user-friendly alternative to `find`. `find` defaults like having to use the `--name` flag (which is what you want to do 99% of the time) make it easier to use in an every day basis. It is also `git` aware and will skip files in your `.gitignore` and `.git` folder by default. It also has nice color coding by default. ### `rg/fzf` vs `grep` `grep` is a great tool but if you want to grep through many files at once, there are better tools for that purpose. [ack](https://github.com/beyondgrep/ack3), [ag](https://github.com/ggreer/the_silver_searcher) & [rg](https://github.com/BurntSushi/ripgrep) recursively search your current directory for a regex pattern while respecting your gitignore rules. They all work pretty similar but I favor `rg` due to how fast it can search my entire home directory. Similarly, it can be easy to find yourself doing `CMD | grep PATTERN` over an over again. [fzf](https://github.com/junegunn/fzf) is a command line fuzzy finder that enables you to interactively filter the output of pretty much any command. ### `rsync` vs `cp/scp` Whereas `mv` and `scp` are perfect for most scenarios, when copying/moving around large amounts of files, large files or when some of the data is already on the destination `rsync` is a huge improvement. `rsync` will skip files that have already been transferred and with the `--partial` flag it can resume from a previously interrupted copy. ### `trash` vs `rm` `rm` is a dangerous command in the sense that once you delete a file there is no turning back. However, modern OS do not behave like that when you delete something in the file explorer, they just move it to the Trash folder which is cleared periodically. Since how the trash is managed varies from OS to OS there is not a single CLI utility. In macOS there is [trash](https://hasseg.org/trash/) and in linux there is [trash-cli](https://github.com/andreafrancia/trash-cli/) among others. ### `mosh` vs `ssh` `ssh ` is a very handy tool but if you have a slow connection, the lag can become annoying and if the connection interrupts you have to reconnect. [mosh](https://mosh.org/) is a handy tool that works allows roaming, supports intermittent connectivity, and provides intelligent local echo. ### `tldr` vs `man` You can figure out what a commands does and what options it has using `man` and the `-h`/'--help' flag most of the time. However, in some cases it can be a bit daunting navigating these if they are detailed The [tldr](https://github.com/tldr-pages/tldr) command is a community driven documentation system that's available from the command line and gives a few simple illustrative examples of what the command does and the most common argument options. ### `aunpack` vs `tar/unzip/unrar` As [this xkcd](https://xkcd.com/1168/) references, it can be quite tricky to remember the options for `tar` and sometimes you need a different tool altogether such as `unrar` for .rar files. The [atool](https://www.nongnu.org/atool/) package provides the `aunpack` command which will figure out the correct options and always put the extracted archives in a new folder. ## Exercises 1. Run `cat .bash_history | sort | uniq -c | sort -rn | head -n 10` (or `cat .zhistory | sort | uniq -c | sort -rn | head -n 10` for zsh) to get top 10 most used commands and consider writing shorter aliases for them 1. Choose a terminal emulator and figure out how to change the following properties: - Font choice - Color scheme. How many colors does a standard scheme have? why? - Scrollback history size 1. Install `fasd` or some similar software and write a bash/zsh function called `v` that performs fuzzy matching on the passed arguments and opens up the top result in your editor of choice. Then, modify it so that if there are multiple matches you can select them with `fzf`. 1. Since `fzf` is quite convenient for performing fuzzy searches and the shell history is quite prone to those kind of searches, investigate how to bind `fzf` to `^R`. You can find some info [here](https://github.com/junegunn/fzf/wiki/Configuring-shell-key-bindings) 1. What does the `--bar` option do in `ack`? ================================================ FILE: _2019/course-overview.md ================================================ --- layout: lecture title: "Course Overview" presenter: Anish video: aspect: 56.25 id: qw2c6ffSVOM --- # Motivation This class is about [hacker](https://en.wikipedia.org/wiki/Hacker_culture) tools, not [hacker](https://en.wikipedia.org/wiki/Security_hacker) tools. MIT classes do not cover any of this content in detail. It's hugely beneficial to be proficient with your tools: it'll save you a lot of time (and the payoff time is very short). We want to teach you about new tools, how to make the most of your tools, how to customize your tools, and how to extend your tools. # Class structure We have 6 lectures covering a [variety of topics](/2019/). We have lecture notes online, but there will be a lot of content covered in class (e.g. in the form of demos) that may not be in the notes. We will be recording lectures. Each class is split into two 50-minute lectures with a 10-minute break in between. Lectures are mostly live demonstrations followed by hands-on exercises. We might have a short amount of time at the end of each class to get started on the exercises in an office-hours-style setting. To make the most of the class, you should go through all the exercises on your own. We'll inspire you to learn more about your tools, and we'll show you what's possible and cover some of the basics in detail, but we can't teach you everything in the time we have. ================================================ FILE: _2019/data-wrangling.md ================================================ --- layout: lecture title: "Data Wrangling" presenter: Jon video: aspect: 56.25 id: VW2jn9Okjhw --- Have you ever had a bunch of text and wanted to do something with it? Good. That's what data wrangling is all about! Specifically, adapting data from one format to another, until you end up with exactly what you wanted. We've already seen basic data wrangling: `journalctl | grep -i intel`. - find all system log entries that mention Intel (case insensitive) - really, most of data wrangling is about knowing what tools you have, and how to combine them. Let's start from the beginning: we need a data source, and something to do with it. Logs often make for a good use-case, because you often want to investigate things about them, and reading the whole thing isn't feasible. Let's figure out who's trying to log into my server by looking at my server's log: ```bash ssh myserver journalctl ``` That's far too much stuff. Let's limit it to ssh stuff: ```bash ssh myserver journalctl | grep sshd ``` Notice that we're using a pipe to stream a _remote_ file through `grep` on our local computer! `ssh` is magical. This is still way more stuff than we wanted though. And pretty hard to read. Let's do better: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" ``` There's still a lot of noise here. There are _a lot_ of ways to get rid of that, but let's look at one of the most powerful tools in your toolkit: `sed`. `sed` is a "stream editor" that builds on top of the old `ed` editor. In it, you basically give short commands for how to modify the file, rather than manipulate its contents directly (although you can do that too). There are tons of commands, but one of the most common ones is `s`: substitution. For example, we can write: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed 's/.*Disconnected from //' ``` What we just wrote was a simple _regular expression_; a powerful construct that lets you match text against patterns. The `s` command is written on the form: `s/REGEX/SUBSTITUTION/`, where `REGEX` is the regular expression you want to search for, and `SUBSTITUTION` is the text you want to substitute matching text with. ## Regular expressions Regular expressions are common and useful enough that it's worthwhile to take some time to understand how they work. Let's start by looking at the one we used above: `/.*Disconnected from /`. Regular expressions are usually (though not always) surrounded by `/`. Most ASCII characters just carry their normal meaning, but some characters have "special" matching behavior. Exactly which characters do what vary somewhat between different implementations of regular expressions, which is a source of great frustration. Very common patterns are: - `.` means "any single character" except newline - `*` zero or more of the preceding match - `+` one or more of the preceding match - `[abc]` any one character of `a`, `b`, and `c` - `(RX1|RX2)` either something that matches `RX1` or `RX2` - `^` the start of the line - `$` the end of the line `sed`'s regular expressions are somewhat weird, and will require you to put a `\` before most of these to give them their special meaning. Or you can pass `-E`. So, looking back at `/.*Disconnected from /`, we see that it matches any text that starts with any number of characters, followed by the literal string "Disconnected from ". Which is what we wanted. But beware, regular expressions are tricky. What if someone tried to log in with the username "Disconnected from"? We'd have: ``` Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth] ``` What would we end up with? Well, `*` and `+` are, by default, "greedy". They will match as much text as they can. So, in the above, we'd end up with just ``` 46.97.239.16 port 55920 [preauth] ``` Which may not be what we wanted. In some regular expression implementations, you can just suffix `*` or `+` with a `?` to make them non-greedy, but sadly `sed` doesn't support that. We _could_ switch to perl's command-line mode though, which _does_ support that construct: ```bash perl -pe 's/.*?Disconnected from //' ``` We'll stick to `sed` for the rest of this though, because it's by far the more common tool for these kinds of jobs. `sed` can also do other handy things like print lines following a given match, do multiple substitutions per invocation, search for things, etc. But we won't cover that too much here. `sed` is basically an entire topic in and of itself, but there are often better tools. Okay, so we also have a suffix we'd like to get rid of. How might we do that? It's a little tricky to match just the text that follows the username, especially if the username can have spaces and such! What we need to do is match the _whole_ line: ```bash | sed -E 's/.*Disconnected from (invalid |authenticating )?user .* [^ ]+ port [0-9]+( \[preauth\])?$//' ``` Let's look at what's going on with a [regex debugger](https://regex101.com/r/qqbZqh/2). Okay, so the start is still as before. Then, we're matching any of the "user" variants (there are two prefixes in the logs). Then we're matching on any string of characters where the username is. Then we're matching on any single word (`[^ ]+`; any non-empty sequence of non-space characters). Then the word "port" followed by a sequence of digits. Then possibly the suffix ` [preauth]`, and then the end of the line. Notice that with this technique, as username of "Disconnected from" won't confuse us any more. Can you see why? There is one problem with this though, and that is that the entire log becomes empty. We want to _keep_ the username after all. For this, we can use "capture groups". Any text matched by a regex surrounded by parentheses is stored in a numbered capture group. These are available in the substitution (and in some engines, even in the pattern itself!) as `\1`, `\2`, `\3`, etc. So: ```bash | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' ``` As you can probably imagine, you can come up with _really_ complicated regular expressions. For example, here's an article on how you might match an [e-mail address](https://www.regular-expressions.info/email.html). It's [not easy](https://web.archive.org/web/20221223174323/http://emailregex.com/). And there's [lots of discussion](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/1917982). And people have [written tests](https://fightingforalostcause.net/content/misc/2006/compare-email-regex.php). And [test matrices](https://mathiasbynens.be/demo/url-regex). You can even write a regex for determining if a given number [is a prime number](https://www.noulakaz.net/2007/03/18/a-regular-expression-to-check-for-prime-numbers/). Regular expressions are notoriously hard to get right, but they are also very handy to have in your toolbox! ## Back to data wrangling Okay, so we now have ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' ``` We could do it just with `sed`, but why would we? For fun is why. ```bash ssh myserver journalctl | sed -E -e '/Disconnected from/!d' -e 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' ``` This shows off some of `sed`'s capabilities. `sed` can also inject text (with the `i` command), explicitly print lines (with the `p` command), select lines by index, and lots of other things. Check `man sed`! Anyway. What we have now gives us a list of all the usernames that have attempted to log in. But this is pretty unhelpful. Let's look for common ones: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c ``` `sort` will, well, sort its input. `uniq -c` will collapse consecutive lines that are the same into a single line, prefixed with a count of the number of occurrences. We probably want to sort that too and only keep the most common logins: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 ``` `sort -n` will sort in numeric (instead of lexicographic) order. `-k1,1` means "sort by only the first whitespace-separated column". The `,n` part says "sort until the `n`th field, where the default is the end of the line. In this _particular_ example, sorting by the whole line wouldn't matter, but we're here to learn! If we wanted the _least_ common ones, we could use `head` instead of `tail`. There's also `sort -r`, which sorts in reverse order. Okay, so that's pretty cool, but we'd sort of like to only give the usernames, and maybe not one per line? ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | awk '{print $2}' | paste -sd, ``` Let's start with `paste`: it lets you combine lines (`-s`) by a given single-character delimiter (`-d`). But what's this `awk` business? ## awk -- another editor `awk` is a programming language that just happens to be really good at processing text streams. There is _a lot_ to say about `awk` if you were to learn it properly, but as with many other things here, we'll just go through the basics. First, what does `{print $2}` do? Well, `awk` programs take the form of an optional pattern plus a block saying what to do if the pattern matches a given line. The default pattern (which we used above) matches all lines. Inside the block, `$0` is set to the entire line's contents, and `$1` through `$n` are set to the `n`th _field_ of that line, when separated by the `awk` field separator (whitespace by default, change with `-F`). In this case, we're saying that, for every line, print the contents of the second field, which happens to be the username! Let's see if we can do something fancier. Let's compute the number of single-use usernames that start with `c` and end with `e`: ```bash | awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l ``` There's a lot to unpack here. First, notice that we now have a pattern (the stuff that goes before `{...}`). The pattern says that the first field of the line should be equal to 1 (that's the count from `uniq -c`), and that the second field should match the given regular expression. And the block just says to print the username. We then count the number of lines in the output with `wc -l`. However, `awk` is a programming language, remember? ```awk BEGIN { rows = 0 } $1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 } END { print rows } ``` `BEGIN` is a pattern that matches the start of the input (and `END` matches the end). Now, the per-line block just adds the count from the first field (although it'll always be 1 in this case), and then we print it out at the end. In fact, we _could_ get rid of `grep` and `sed` entirely, because `awk` [can do it all](https://backreference.org/2010/02/10/idiomatic-awk/), but we'll leave that as an exercise to the reader. ## Analyzing data You can do math! ```bash | paste -sd+ | bc -l ``` ```bash echo "2*($(data | paste -sd+))" | bc -l ``` You can get stats in a variety of ways. [`st`](https://github.com/nferraz/st) is pretty neat, but if you already have R: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)' ``` R is another (weird) programming language that's great at data analysis and [plotting](https://ggplot2.tidyverse.org/). We won't go into too much detail, but suffice to say that `summary` prints summary statistics about a matrix, and we computed a matrix from the input stream of numbers, so R gives us the statistics we wanted! If you just want some simple plotting, `gnuplot` is your friend: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes' ``` ## Data wrangling to make arguments Sometimes you want to do data wrangling to find things to install or remove based on some longer list. The data wrangling we've talked about so far + `xargs` can be a powerful combo: ```bash rustup toolchain list | grep nightly | grep -vE "nightly-x86|01-17" | sed 's/-x86.*//' | xargs rustup toolchain uninstall ``` # Exercises 1. If you are not familiar with Regular Expressions [here](https://regexone.com/) is a short interactive tutorial that covers most of the basics 1. How is `sed s/REGEX/SUBSTITUTION/g` different from the regular sed? What about `/I` or `/m`? 1. To do in-place substitution it is quite tempting to do something like `sed s/REGEX/SUBSTITUTION/ input.txt > input.txt`. However this is a bad idea, why? Is this particular to `sed`? 1. Implement a simple grep equivalent tool in a language you are familiar with using regex. If you want the output to be color highlighted like grep is, search for ANSI color escape sequences. 1. Sometimes some operations like renaming files can be tricky with raw commands like `mv` . `rename` is a nifty tool to achieve this and has a sed-like syntax. Try creating a bunch of files with spaces in their names and use `rename` to replace them with underscores. 1. Look for boot messages that are _not_ shared between your past three reboots (see `journalctl`'s `-b` flag). You may want to just mash all the boot logs together in a single file, as that may make things easier. 1. Produce some statistics of your system boot time over the last ten boots using the log timestamp of the messages ``` Logs begin at ... ``` and ``` systemd[577]: Startup finished in ... ``` 1. Find the number of words (in `/usr/share/dict/words`) that contain at least three `a`s and don't have a `'s` ending. What are the three most common last two letters of those words? `sed`'s `y` command, or the `tr` program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur? 1. Find an online data set like [this one](https://commons.wikimedia.org/wiki/Data:Wikipedia_statistics/data.tab) or [this one](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/topic-pages/tables/table-1). Maybe another one [from here](https://www.springboard.com/blog/data-science/free-public-data-sets-data-science-project/). Fetch it using `curl` and extract out just two columns of numerical data. If you're fetching HTML data, [`pup`](https://github.com/EricChiang/pup) might be helpful. For JSON data, try [`jq`](https://stedolan.github.io/jq/). Find the min and max of one column in a single command, and the sum of the difference between the two columns in another. ================================================ FILE: _2019/dotfiles.md ================================================ --- layout: lecture title: "Dotfiles" presenter: Anish video: aspect: 62.5 id: YSZBWWJw3mI --- Many programs are configured using plain-text files known as "dotfiles" (because the file names begin with a `.`, e.g. `~/.gitconfig`, so that they are hidden in the directory listing `ls` by default). A lot of the tools you use probably have a lot of settings that can be tuned pretty finely. Often times, tools are customized with specialized languages, e.g. Vimscript for Vim or the shell's own language for a shell. Customizing and adapting your tools to your preferred workflow will make you more productive. We advise you to invest time in customizing your tool yourself rather than cloning someone else's dotfiles from GitHub. You probably have some dotfiles set up already. Some places to look: - `~/.bashrc` - `~/.emacs` - `~/.vim` - `~/.gitconfig` Some programs don't put the files under your home folder directly and instead they put them in a folder under `~/.config`. Dotfiles are not exclusive to command line applications, for instance the [MPV](https://mpv.io/) video player can be configured editing files under `~/.config/mpv` # Learning to customize tools You can learn about your tool's settings by reading online documentation or [man pages](https://en.wikipedia.org/wiki/Man_page). Another great way is to search the internet for blog posts about specific programs, where authors will tell you about their preferred customizations. Yet another way to learn about customizations is to look through other people's dotfiles: you can find tons of [dotfiles repositories](https://github.com/search?o=desc&q=dotfiles&s=stars&type=Repositories) on GitHub --- see the most popular one [here](https://github.com/mathiasbynens/dotfiles) (we advise you not to blindly copy configurations though). # Organization How should you organize your dotfiles? They should be in their own folder, under version control, and **symlinked** into place using a script. This has the benefits of: - **Easy installation**: if you log in to a new machine, applying your customizations will only take a minute - **Portability**: your tools will work the same way everywhere - **Synchronization**: you can update your dotfiles anywhere and keep them all in sync - **Change tracking**: you're probably going to be maintaining your dotfiles for your entire programming career, and version history is nice to have for long-lived projects ```shell cd ~/src mkdir dotfiles cd dotfiles git init touch bashrc # create a bashrc with some settings, e.g.: # PS1='\w > ' touch install chmod +x install # insert the following into the install script: # #!/usr/bin/env bash # BASEDIR=$(dirname $0) # cd $BASEDIR # # ln -s ${PWD}/bashrc ~/.bashrc git add bashrc install git commit -m 'Initial commit' ``` # Advanced topics ## Machine-specific customizations Most of the time, you'll want the same configuration across machines, but sometimes, you'll want a small delta on a particular machine. Here are a couple ways you can handle this situation: ### Branch per machine Use version control to maintain a branch per machine. This approach is logically straightforward but can be pretty heavyweight. ### If statements If the configuration file supports it, use the equivalent of if-statements to apply machine specific customizations. For example, your shell could have something like: ```shell if [[ "$(uname)" == "Linux" ]]; then {do_something else}; fi # Darwin is the architecture name for macOS systems if [[ "$(uname)" == "Darwin" ]]; then {do_something}; fi # You can also make it machine specific if [[ "$(hostname)" == "myServer" ]]; then {do_something}; fi ``` ### Includes If the configuration file supports it, make use of includes. For example, a `~/.gitconfig` can have a setting: ``` [include] path = ~/.gitconfig_local ``` And then on each machine, `~/.gitconfig_local` can contain machine-specific settings. You could even track these in a separate repository for machine-specific settings. This idea is also useful if you want different programs to share some configurations. For instance if you want both `bash` and `zsh` to share the same set of aliases you can write them under `.aliases` and have the following block in both. ```bash # Test if ~/.aliases exists and source it if [ -f ~/.aliases ]; then source ~/.aliases fi ``` # Resources - Your instructors' dotfiles: [Anish](https://github.com/anishathalye/dotfiles), [Jon](https://github.com/jonhoo/configs), [Jose](https://github.com/jjgo/dotfiles) - [GitHub does dotfiles](http://dotfiles.github.io/): dotfile frameworks, utilities, examples, and tutorials - [Shell startup scripts](https://blog.flowblok.id.au/2013-02/shell-startup-scripts.html): an explanation of the different configuration files used for your shell # Exercises 1. Create a folder for your dotfiles and set up [version control](/2019/version-control/). 1. Add a configuration for at least one program, e.g. your shell, with some customization (to start off, it can be something as simple as customizing your shell prompt by setting `$PS1`). 1. Set up a method to install your dotfiles quickly (and without manual effort) on a new machine. This can be as simple as a shell script that calls `ln -s` for each file, or you could use a [specialized utility](http://dotfiles.github.io/utilities/). 1. Test your installation script on a fresh virtual machine. 1. Migrate all of your current tool configurations to your dotfiles repository. 1. Publish your dotfiles on GitHub. ================================================ FILE: _2019/editors.md ================================================ --- layout: lecture title: "Editors" presenter: Anish video: aspect: 62.5 id: 1vLcusYSrI4 --- # Importance of Editors As programmers, we spend most of our time editing plain-text files. It's worth investing time learning an editor that fits your needs. How do you learn a new editor? You force yourself to use that editor for a while, even if it temporarily hampers your productivity. It'll pay off soon enough (two weeks is enough to learn the basics). We are going to teach you Vim, but we encourage you to experiment with other editors. It's a very personal choice, and people have [strong opinions](https://en.wikipedia.org/wiki/Editor_war). We can't teach you how to use a powerful editor in 50 minutes, so we're going to focus on teaching you the basics, showing you some of the more advanced functionality, and giving you the resources to master the tool. We'll teach you lessons in the context of Vim, but most ideas will translate to any other powerful editor you use (and if they don't, then you probably shouldn't use that editor!). ![Editor Learning Curves](/2019/files/editor-learning-curves.jpg) The editor learning curves graph is a myth. Learning the basics of a powerful editor is quite easy (even though it might take years to master). Which editors are popular today? See this [Stack Overflow survey](https://insights.stackoverflow.com/survey/2018/#development-environments-and-tools) (there may be some bias because Stack Overflow users may not be representative of programmers as a whole). ## Command-line Editors Even if you eventually settle on using a GUI editor, it's worth learning a command-line editor for easily editing files on remote machines. # Nano Nano is a simple command-line editor. - Move with arrow keys - All other shortcuts (save, exit) shown at the bottom # Vim Vi/Vim is a powerful text editor. It's a command-line program that's usually installed everywhere, which makes it convenient for editing files on a remote machine. Vim also has graphical versions, such as GVim and [MacVim](https://macvim-dev.github.io/macvim/). These provide additional features such as 24-bit color, menus, and popups. ## Philosophy of Vim - When programming, you spend most of your time reading/editing, not writing - Vim is a **modal** editor: different modes for inserting text vs manipulating text - Vim is programmable (with Vimscript and also other languages like Python) - Vim's interface itself is like a programming language - Keystrokes (with mnemonic names) are commands - Commands are composable - Don't use the mouse: too slow - Editor should work at the speed you think ## Introductory Vim ### Modes Vim shows the current mode in the bottom left. - Normal mode: for moving around a file and making edits - Spend most of your time here - Insert mode: for inserting text - Visual (visual, line, or block) mode: for selecting blocks of text You change modes by pressing `` to switch from any mode back to normal mode. From normal mode, enter insert mode with `i`, visual mode with `v`, visual line mode with `V`, and visual block mode with ``. You use the `` key a lot when using Vim: consider remapping Caps Lock to Escape. ### Basics Vim ex commands are issued through `:{command}` in normal mode. - `:q` quit (close window) - `:w` save - `:wq` save and quit - `:e {name of file}` open file for editing - `:ls` show open buffers - `:help {topic}` open help - `:help :w` opens help for the `:w` ex command - `:help w` opens help for the `w` movement ### Movement Vim is all about efficient movement. Navigate the file in Normal mode. - Disable arrow keys to avoid bad habits ```vim nnoremap :echoe "Use h" nnoremap :echoe "Use l" nnoremap :echoe "Use k" nnoremap :echoe "Use j" ``` - Basic movement: `hjkl` (left, down, up, right) - Words: `w` (next word), `b` (beginning of word), `e` (end of word) - Lines: `0` (beginning of line), `^` (first non-blank character), `$` (end of line) - Screen: `H` (top of screen), `M` (middle of screen), `L` (bottom of screen) - File: `gg` (beginning of file), `G` (end of file) - Line numbers: `:{number}` or `{number}G` (line {number}) - Misc: `%` (corresponding item) - Find: `f{character}`, `t{character}`, `F{character}`, `T{character}` - find/to forward/backward {character} on the current line - Repeating N times: `{number}{movement}`, e.g. `10j` moves down 10 lines - Search: `/{regex}`, `n` / `N` for navigating matches ### Selection Visual modes: - Visual - Visual Line - Visual Block Can use movement keys to make selection. ### Manipulating text Everything that you used to do with the mouse, you now do with keyboards (and powerful, composable commands). - `i` enter insert mode - but for manipulating/deleting text, want to use something more than backspace - `o` / `O` insert line below / above - `d{motion}` delete {motion} - e.g. `dw` is delete word, `d$` is delete to end of line, `d0` is delete to beginning of line - `c{motion}` change {motion} - e.g. `cw` is change word - like `d{motion}` followed by `i` - `x` delete character (equal do `dl`) - `s` substitute character (equal to `xi`) - visual mode + manipulation - select text, `d` to delete it or `c` to change it - `u` to undo, `` to redo - Lots more to learn: e.g. `~` flips the case of a character ### Resources - `vimtutor` command-line program to teach you vim - [Vim Adventures](https://vim-adventures.com/) game to learn Vim ## Customizing Vim Vim is customized through a plain-text configuration file in `~/.vimrc` (containing Vimscript commands). There are probably lots of basic settings that you want to turn on. Look at people's dotfiles on GitHub for inspiration, but try not to copy-and-paste people's full configuration. Read it, understand it, and take what you need. Some customizations to consider: - Syntax highlighting: `syntax on` - Color schemes - Line numbers: `set nu` / `set rnu` - Backspacing through everything: `set backspace=indent,eol,start` ## Advanced Vim Here are a few examples to show you the power of the editor. We can't teach you all of these kinds of things, but you'll learn them as you go. A good heuristic: whenever you're using your editor and you think "there must be a better way of doing this", there probably is: look it up online. ### Search and replace `:s` (substitute) command ([documentation](http://vim.wikia.com/wiki/Search_and_replace)). - `%s/foo/bar/g` - replace foo with bar globally in file - `%s/\[.*\](\(.*\))/\1/g` - replace named Markdown links with plain URLs ### Multiple windows - `sp` / `vsp` to split windows - Can have multiple views of the same buffer. ### Mouse support - `set mouse+=a` - can click, scroll select ### Macros - `q{character}` to start recording a macro in register `{character}` - `q` to stop recording - `@{character}` replays the macro - Macro execution stops on error - `{number}@{character}` executes a macro {number} times - Macros can be recursive - first clear the macro with `q{character}q` - record the macro, with `@{character}` to invoke the macro recursively (will be a no-op until recording is complete) - Example: convert xml to json ([file](/2019/files/example-data.xml)) - Array of objects with keys "name" / "email" - Use a Python program? - Use sed / regexes - `g/people/d` - `%s//{/g` - `%s/\(.*\)<\/name>/"name": "\1",/g` - ... - Vim commands / macros - `Gdd`, `ggdd` delete first and last lines - Macro to format a single element (register `e`) - Go to line with `` - `qe^r"f>s": "fq` - Macro to format a person - Go to line with `` - `qpS{j@eA,j@ejS},q` - Macro to format a person and go to the next person - Go to line with `` - `qq@pjq` - Execute macro until end of file - `999@q` - Manually remove last `,` and add `[` and `]` delimiters ## Extending Vim There are tons of plugins for extending vim. First, get set up with a plugin manager like [vim-plug](https://github.com/junegunn/vim-plug), [Vundle](https://github.com/VundleVim/Vundle.vim), or [pathogen.vim](https://github.com/tpope/vim-pathogen). Some plugins to consider: - [ctrlp.vim](https://github.com/kien/ctrlp.vim): fuzzy file finder - [vim-fugitive](https://github.com/tpope/vim-fugitive): git integration - [vim-surround](https://github.com/tpope/vim-surround): manipulating "surroundings" - [gundo.vim](https://github.com/sjl/gundo.vim): navigate undo tree - [nerdtree](https://github.com/scrooloose/nerdtree): file explorer - [syntastic](https://github.com/vim-syntastic/syntastic): syntax checking - [vim-easymotion](https://github.com/easymotion/vim-easymotion): magic motions - [vim-over](https://github.com/osyo-manga/vim-over): substitute preview Lists of plugins: - [Vim Awesome](https://vimawesome.com/) ## Vim-mode in Other Programs For many popular editors (e.g. vim and emacs), many other tools support editor emulation. - Shell - bash: `set -o vi` - zsh: `bindkey -v` - `export EDITOR=vim` (environment variable used by programs like `git`) - `~/.inputrc` - `set editing-mode vi` There are even vim keybinding extensions for web [browsers](http://vim.wikia.com/wiki/Vim_key_bindings_for_web_browsers), some popular ones are [Vimium](https://chrome.google.com/webstore/detail/vimium/dbepggeogbaibhgnhhndojpepiihcmeb?hl=en) for Google Chrome and [Tridactyl](https://github.com/tridactyl/tridactyl) for Firefox. ## Resources - [Vim Tips Wiki](http://vim.wikia.com/wiki/Vim_Tips_Wiki) - [Vim Advent Calendar](https://vimways.org/2018/): various Vim tips - [Neovim](https://neovim.io/) is a modern vim reimplementation with more active development. - [Vim Golf](http://www.vimgolf.com/): Various Vim challenges {% comment %} # Resources TODO resources for other editors? {% endcomment %} # Exercises 1. Experiment with some editors. Try at least one command-line editor (e.g. Vim) and at least one GUI editor (e.g. Atom). Learn through tutorials like `vimtutor` (or the equivalents for other editors). To get a real feel for a new editor, commit to using it exclusively for a couple days while going about your work. 1. Customize your editor. Look through tips and tricks online, and look through other people's configurations (often, they are well-documented). 1. Experiment with plugins for your editor. 1. Commit to using a powerful editor for at least a couple weeks: you should start seeing the benefits by then. At some point, you should be able to get your editor to work as fast as you think. 1. Install a linter (e.g. pyflakes for python) link it to your editor and test it is working. ================================================ FILE: _2019/files/example-data.xml ================================================ Johnny Zhang Jr. amyalvarez@cole.com Edward Cook dsparks@alvarez-dunn.com Stephen Sweeney dlewis@gmail.com Krystal Riley jflores@wright.biz Ashley Robinson robertsmichael@yahoo.com Kimberly Brooks sharoncunningham@larson.com Brent Proctor edward86@stewart.com William Roberts parkertodd@webb.com Amanda Morales lorizavala@hodges.com Bryan Poole Jr. carolyn56@gray-campos.net Dale Hall martinjames@yahoo.com Isabella Reynolds wbowen@wallace.com Ann Rodriguez charles37@taylor-riley.biz Bryan Davis jessica60@hotmail.com Dalton Powell piercenatasha@yahoo.com Scott Turner harold68@yahoo.com Nicholas Castillo dawnstephens@robinson.info Joseph Pierce lukepatterson@hotmail.com Robyn White jenniferrobinson@hotmail.com Justin Rice brandi76@gmail.com Jamie Graham harrisdavid@yahoo.com Phillip Schmidt stephanie33@gmail.com John Baker todd86@hotmail.com Sharon Austin srivera@yahoo.com Erica Avila jenniferreed@bowers-wilson.com Jeremy Bass jdavis@collins.com Joshua Parsons stephaniecoleman@miller-barker.com Emma Mccoy taylorjohn@wagner.net Megan Williams ronnie54@gmail.com Michael Sutton connie58@mendoza.net Nicholas York kennedykevin@collins.com Donald Robles williamsbrandon@gmail.com Melissa Allen pproctor@ramos-patel.com Shannon Jones beckkathleen@johnson.com David White sandra73@thompson.com Jonathan Thomas johnsonjeremy@gmail.com Rachael Floyd amanda78@johnson.info Tina Carter josewells@jones.net Eric Johnson bowersaustin@hernandez-edwards.com William Kramer rhunt@johnson.com Nathan Williams cynthiayoung@hotmail.com Patty Schwartz salinasdavid@sheppard.biz David Collins pcalhoun@yahoo.com James Thomas brianfox@rogers-cruz.com Mark Casey jerry88@graham.com Robert Galloway cherylmcgee@hotmail.com Caitlin Dunn nicholemartin@yahoo.com Nancy Allison martha33@molina-bullock.com Marvin Burns wrocha@gmail.com Kimberly Jones anitamunoz@french-christian.com Caitlin Wood thomasrandall@bowers-sullivan.org Sara Burton riosangelica@gmail.com Jessica Roberson theresa11@hotmail.com Nicole Macias kevinhodge@martin.biz Christina Williams shawn35@rice-bailey.org Cody Winters nicholassmith@barron-wu.com Patricia Miller DDS pierceraymond@watkins.org Jennifer Lyons vrivera@gmail.com Jerry Rojas jacobalexander@yahoo.com Matthew Perez jrivas@hotmail.com Patrick Hogan moorelisa@yahoo.com Lisa Howard stephen90@smith.biz Justin Sloan edwardsmichael@hotmail.com Suzanne Morrow shane74@yahoo.com Theresa Lara maryrichardson@clark.com Christopher Powers yfowler@davis-lee.net Teresa Howell amy15@yahoo.com Richard Shelton ksmith@yahoo.com Jeremy Cole bleach@gmail.com Melissa Clark rosejeffrey@yahoo.com Kimberly Mcdaniel ularson@ross-david.com Kelly Dixon gatesstephen@hotmail.com Devin Quinn wjohnson@hotmail.com Kevin Greene lhanson@hotmail.com Jeffery Wiggins amy76@gmail.com Latoya Allen vking@yahoo.com Zachary Walker diazjames@hotmail.com Alyssa Molina elizabeth59@gmail.com Heather Miranda davidturner@cortez-martinez.biz Lori Gardner murphytaylor@yahoo.com Jessica Simpson jamesdean@rosales.com Anna Dickerson abigailmurphy@hotmail.com Molly Oconnor morrisrhonda@yahoo.com Brandi Braun ericksonmatthew@jenkins.org Renee Flowers brownantonio@yang-crosby.org Cassandra Compton progers@yahoo.com David Gilbert vickie78@gmail.com Brenda Davis cynthiajones@thornton.com Nicholas Rivera longalyssa@yahoo.com Dustin Hodges sgolden@lee.com Chad Wong williambernard@mccarty.net Robin Craig xbyrd@austin.com Heather Parker allenjoshua@rodriguez.com Jennifer Roberts manningtravis@gmail.com James Andrews ginaromero@hotmail.com Dorothy Hines dsmith@thomas.com Stephen Garcia hughesbrendan@hotmail.com Alfred Ellis elizabeth41@crawford.info Marilyn White victoriaford@hotmail.com Brian Graves cpatel@gmail.com Elizabeth Wagner newtonwesley@cohen.com Michelle Flores shelbygross@duke-thomas.info Larry Russell richard99@meyer.com Terrence Boyd markmartin@flores.com Jessica Carroll eric30@yahoo.com Erin Dean toddmartin@guerra.biz Craig Hernandez joshualang@gonzalez.com Amber Choi doughertynancy@harmon.org Renee Brown terribeard@archer-gibson.info Curtis Turner pjohnson@hotmail.com Benjamin Reed marksmith@austin.net Christina Fernandez richardjoseph@esparza-peters.com Jasmine Campbell thomasmatthew@gmail.com Catherine Bond coreyroberts@gonzalez.com Connie Jones koneal@riley.com Cody Taylor kelsey99@hotmail.com Kendra Gray walkerrussell@hotmail.com Alexander Murray grossrobert@hotmail.com Arthur Jackson travis73@hotmail.com Dr. William Vasquez DDS gonzalezdaniel@hotmail.com April Hampton desireemorris@mcguire.info Gerald Hunter justin91@ross-scott.biz Morgan Bolton erika30@lloyd-smith.biz Angela Barker daniel17@carr.com Angela Montgomery jonathangoodwin@smith-perez.com Yolanda Henry shawnmcguire@gmail.com Susan Hines sarahbailey@wallace.com Michelle Young lewismichele@yahoo.com Glen Hood ljackson@vazquez.com Christopher Wright evansjulie@walton.com Susan Guzman DDS medinaelizabeth@gmail.com Barbara Cortez bchavez@cameron.com Stacey Hammond nancyturner@stewart.com Amanda Stout macdonaldlatoya@hotmail.com Lisa Johnson wnolan@gmail.com Carlos Wyatt iperez@cohen.com Samantha Brewer thomas47@hotmail.com Brett Jackson zpowell@cruz-rivera.com Johnny Guzman tmerritt@yahoo.com Mary Davis collinslisa@hotmail.com Willie Mccoy joshua20@terrell.biz Kelsey Rivera randy72@gmail.com Melissa Maddox christopher13@gmail.com Jason Rodriguez kellypierce@harris.com Donna Walsh wardraymond@martinez.com Monique Patel cynthia75@james.net Dr. Lindsay Farrell PhD brownmaria@gmail.com Ann Ruiz jeremiah94@pennington.org Mary Alexander catherineharper@munoz.org Brittany Russell haileywinters@russell-coffey.net Dominique Rosales matthewpatterson@carr.com Henry Waters karen72@logan.com Jared Weaver karlafletcher@baldwin.org Mr. Thomas Atkins gboone@gmail.com Carla Cohen ibarron@gmail.com Tricia Lewis pperez@hotmail.com Mario Gill lisa43@brown.org James Olsen vickie82@hotmail.com Michael Perry rdavis@yahoo.com Matthew Lucas joshuagray@carpenter-stanley.com Christine Torres samanthayoung@smith-aguilar.biz Lindsay Miller randyevans@yahoo.com Margaret Jones kevincantu@alexander-carson.org Cameron Mcdonald deckerjerome@garcia.com Brittany Sanders dennis55@leonard-turner.com Daniel Patterson timothy36@novak.com David Chaney kristen02@hotmail.com Sheri Silva idawson@alvarez.com Holly Ward saraallen@dunn-smith.net Bryan Solis stacey30@lam.biz Diane Carter paulvargas@gmail.com David Brown james98@gmail.com Bridget Fritz beth24@hotmail.com Paul Boyd johngutierrez@hotmail.com Ernest Baker phillipwhite@hotmail.com George Myers frank52@hammond.com Daniel Miller joshua96@gmail.com Jonathan Ayala jerryharris@davis.net Jill Stone pwright@hotmail.com Trevor Richard mreed@thompson.org Jason Thomas josephflowers@hotmail.com Arthur Thomas lnelson@hicks.com Austin Collins ambermann@barnes.com Jason Diaz ericreyes@hotmail.com Darryl Hall faithdixon@barnes-burgess.org Jason Thomas brittany32@yahoo.com John Sanders waltontheresa@hotmail.com Lisa Hayes victor14@hotmail.com Chelsea Wong iwatkins@williams-solomon.com Joseph Fitzgerald mary86@hotmail.com Crystal Schroeder kbarron@wilson-flynn.org Denise Bean noah23@gmail.com Jamie Atkins cwebb@hotmail.com Joshua Kim esmith@ramirez.com Deanna Mooney jason13@turner.com Jasmine Baker torresjacob@braun.com Victoria Williams rwilliams@hotmail.com Sandra Hall williamsonrichard@gmail.com Miranda Mcpherson xrussell@barajas.biz Samantha Walton danielle73@gmail.com Kyle Serrano stonecassandra@mcfarland.info Mr. Bruce Maldonado DDS diazmatthew@yahoo.com Amber Fisher jonesdavid@rubio.info Brett Berry millerteresa@gmail.com Cory Bradley umatthews@summers.com Ryan Peters shepherdmonique@gmail.com Laura Lee lfleming@higgins.com Christian Smith johnnymartinez@castro-miller.com Kelly Hanson velazquezsandra@chavez-malone.info Brian King hwood@yahoo.com Cynthia Owens sbrown@hotmail.com Lisa Clark derek74@bell-martinez.com Brenda Ford kevin55@hotmail.com Daniel Brady wbennett@hotmail.com Jake Wilson lorraine60@solis.biz April Cole halltyler@yahoo.com Melissa Callahan cmckenzie@rodriguez.info Taylor Brown davisadam@gmail.com Patrick Guerrero hannah48@delgado.net Brian Gonzalez burchmalik@johnson.com Robert Bailey debbiemoore@hotmail.com Jesus Maynard gene45@gmail.com Linda Greer johnharris@reed-allen.net Travis Thomas bryantrachel@gmail.com Vicki Mitchell edaniels@hotmail.com Paula Espinoza donnameyer@dennis.org James Hoffman haustin@larson-wiggins.biz Ashlee Perkins stevenknapp@miller.com Rebecca Leon smitchell@simpson-johnson.com Jorge Williams shawn36@peters-meadows.com Bob Flores kellercourtney@yahoo.com Lisa Miller johnsoncrystal@gmail.com Brandon Davis bryanpetersen@hotmail.com Joshua Daugherty josehayes@carey.com Justin Wise pamelacosta@simmons-morrow.com Kimberly Johnson combssandra@deleon.com Toni Stone eestrada@charles.com Julie Rivers rwilliams@castillo-nelson.org Kelly Scott danielsmith@hotmail.com Michael Carr clarklisa@newman-barrett.com Jonathan Vaughn dennisrebecca@lawrence-harris.com Erica Lowe wilsonkelly@hotmail.com Kimberly Clark jose15@gmail.com Lindsey Robertson rdickerson@yahoo.com Cindy Anderson gmorton@daniels.com Tami Barber harveykaren@hotmail.com Tiffany Wu jessica90@gmail.com Edward Bowers hallkathy@gmail.com Shawn Collier rhondasmith@hotmail.com Michael Cox usimpson@graham-cunningham.net ================================================ FILE: _2019/files/example.c ================================================ #include const char *numbers[] = { "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten" }; void say(int i) { const char *msg = numbers[i-1]; printf("%s\n", msg); } int main() { for (int i = 1; i <= 10; i++) { say(i); } } ================================================ FILE: _2019/index.html ================================================ --- layout: page title: "2019 Lectures" permalink: /2019/ ---

Click on specific topics below to see lecture videos and lecture notes.

Tuesday, 1/15

Thursday, 1/17

Tuesday, 1/22

Thursday, 1/24

Tuesday, 1/29

Thursday, 1/31


Discussion

We've also shared this class beyond MIT in the hopes that others may benefit from these resources. You can find posts and discussion on

================================================ FILE: _2019/machine-introspection.md ================================================ --- layout: lecture title: "Machine Introspection" presenter: Jon video: aspect: 56.25 id: eNYT2Oq3PF8 --- Sometimes, computers misbehave. And very often, you want to know why. Let's look at some tools that help you do that! But first, let's make sure you're able to do introspection. Often, system introspection requires that you have certain privileges, like being the member of a group (like `power` for shutdown). The `root` user is the ultimate privilege; they can do pretty much anything. You can run a command as `root` (but be careful!) using `sudo`. ## What happened? If something goes wrong, the first place to start is to look at what happened around the time when things went wrong. For this, we need to look at logs. Traditionally, logs were all stored in `/var/log`, and many still are. Usually there's a file or folder per program. Use `grep` or `less` to find your way through them. There's also a kernel log that you can see using the `dmesg` command. This used to be available as a plain-text file, but nowadays you often have to go through `dmesg` to get at it. Finally, there is the "system log", which is increasingly where all of your log messages go. On _most_, though not all, Linux systems, that log is managed by `systemd`, the "system daemon", which controls all the services that run in the background (and much much more at this point). That log is accessible through the somewhat inconvenient `journalctl` tool if you are root, or part of the `admin` or `wheel` groups. For `journalctl`, you should be aware of these flags in particular: - `-u UNIT`: show only messages related to the given systemd service - `--full`: don't truncate long lines (the stupidest feature) - `-b`: only show messages from the latest boot (see also `-b -2`) - `-n100`: only show last 100 entries ## What is happening? If something _is_ wrong, or you just want to get a feel for what's going on in your system, you have a number of tools at your disposal for inspecting the currently running system: First, there's `top`, and the improved version `htop`, which show you various statistics for the currently running processes on the system. CPU use, memory use, process trees, etc. There are lots of shortcuts, but `t` is particularly useful for enabling the tree view. You can also see the process tree with `pstree` (+ `-p` to include PIDs). If you want to know what those programs are doing, you'll often want to tail their log files. `journalctl -f`, `dmesg -w`, and `tail -f` are you friends here. Sometimes, you want to know more about the resources being used overall on your system. [`dstat`](http://dag.wiee.rs/home-made/dstat/) is excellent for that. It gives you real-time resource metrics for lots of different subsystems like I/O, networking, CPU utilization, context switches, and the like. `man dstat` is the place to start. If you're running out of disk space, there are two primary utilities you'll want to know about: `df` and `du`. The former shows you the status of all the partitions on your system (try it with `-h`), whereas the latter measures the size of all the folders you give it, including their contents (see also `-h` and `-s`). To figure out what network connections you have open, `ss` is the way to go. `ss -t` will show all open TCP connections. `ss -tl` will show all listening (i.e., server) ports on your system. `-p` will also include which process is using that connection, and `-n` will give you the raw port numbers. ## System configuration There are _many_ ways to configure your system, but we'll go through two very common ones: networking and services. Most applications on your system tell you how to configure them in their manpage, and usually it will involve editing files in `/etc`; the system configuration directory. If you want to configure your network, the `ip` command lets you do that. Its arguments take on a slightly weird form, but `ip help command` will get you pretty far. `ip addr` shows you information about your network interfaces and how they're configured (IP addresses and such), and `ip route` shows you how network traffic is routed to different network hosts. Network problems can often be resolved purely through the `ip` tool. There's also `iw` for managing wireless network interfaces. `ping` is a handy tool for checking how deeply things are broken. Try pinging a hostname (google.com), an external IP address (1.1.1.1), and an internal IP address (192.168.1.1 or default gw). You may also want to fiddle with `/etc/resolv.conf` to check your DNS settings (how hostnames are resolved to IP addresses). To configure services, you pretty much have to interact with `systemd` these days, for better or for worse. Most services on your system will have a systemd service file that defines a systemd _unit_. These files define what command to run when that services is started, how to stop it, where to log things, etc. They're usually not too bad to read, and you can find most of them in `/usr/lib/systemd/system/`. You can also define your own in `/etc/systemd/system` . Once you have a systemd service in mind, you use the `systemctl` command to interact with it. `systemctl enable UNIT` will set the service to start on boot (`disable` removes it again), and `start`, `stop`, and `restart` will do what you expect. If something goes wrong, systemd will let you know, and you can use `journalctl -u UNIT` to see the application's log. You can also use `systemctl status` to see how all your system services are doing. If your boot feels slow, it's probably due to a couple of slow services, and you can use `systemd-analyze` (try it with `blame`) to figure out which ones. # Exercises `locate`? `dmidecode`? `tcpdump`? `/boot`? `iptables`? `/proc`? ================================================ FILE: _2019/os-customization.md ================================================ --- layout: lecture title: "OS Customization" presenter: Anish video: aspect: 62.5 id: epSRVqQzeDo --- There is a lot you can do to customize your operating system beyond what is available in the settings menus. # Keyboard remapping Your keyboard probably has keys that you aren't using very much. Instead of having useless keys, you can remap them to do useful things. ## Remapping to other keys The simplest thing is to remap keys to other keys. For example, if you don't use the caps lock key very much, then you can remap it to something more useful. If you are a Vim user, for example, you might want to remap caps lock to escape. On macOS, you can do some remappings through Keyboard settings in System Preferences; for more complicated mappings, you need special software. ## Remapping to arbitrary commands You don't just have to remap keys to other keys: there are tools that will let you remap keys (or combinations of keys) to arbitrary commands. For example, you could make command-shift-t open a new terminal window. # Customizing hidden OS settings ## macOS macOS exposes a lot of useful settings through the `defaults` command. For example, you can make Dock icons of hidden applications translucent: ```shell defaults write com.apple.dock showhidden -bool true ``` There is no single list of all possible settings, but you can find lists of specific customizations online, such as Mathias Bynens' [.macos](https://github.com/mathiasbynens/dotfiles/blob/master/.macos). # Window management ## Tiling window management [Tiling window management](https://en.wikipedia.org/wiki/Tiling_window_manager) is one approach to window management, where you organize windows into non-overlapping frames. If you're using a Unix-based operating system, you can install a tiling window manager; if you're using something like Windows or macOS, you can install applications that let you approximate this behavior. ## Screen management You can set up keyboard shortcuts to help you manipulate windows across screens. ## Layouts If there are specific ways you lay out windows on a screen, rather than "executing" that layout manually, you can script it, making instantiating a layout trivial. # Resources - [Hammerspoon](https://www.hammerspoon.org/) - macOS desktop automation - [Rectangle](https://rectangleapp.com/) - macOS window manager - [Karabiner](https://karabiner-elements.pqrs.org/) - sophisticated macOS keyboard remapping - [r/unixporn](https://www.reddit.com/r/unixporn/) - screenshots and documentation of people's fancy configurations # Exercises 1. Figure out how to remap your Caps Lock key to something you use more often (such as Escape or Ctrl or Backspace). 1. Make a custom global keyboard shortcut to open a new terminal window or a new browser window. {% comment %} TODO - Bitbar / Polybar - Clipboard Manager (stack/searchable history) {% endcomment %} ================================================ FILE: _2019/package-management.md ================================================ --- layout: lecture title: "Package Management and Dependency Management" presenter: Anish video: aspect: 56.25 id: tgvt473T8xA --- Software usually builds on (a collection of) other software, which necessitates dependency management. Package/dependency management programs are language-specific, but many share common ideas. # Package repositories Packages are hosted in _package repositories_. There are different repositories for different languages (and sometimes multiple for a particular language), such as [PyPI](https://pypi.org/) for Python, [RubyGems](https://rubygems.org/) for Ruby, and [crates.io](https://crates.io/) for Rust. They generally store software (source code and sometimes pre-compiled binaries for specific platforms) for all versions of a package. # Semantic versioning Software evolves over time, and we need a way to refer to software versions. Some simple ways could be to refer to software by a sequence number or a commit hash, but we can do better in terms of communicating more information: using version numbers. There are many approaches; one popular one is [Semantic Versioning](https://semver.org/): ``` x.y.z ^ ^ ^ | | +- patch | +--- minor +----- major ``` Increment **major** version when you make incompatible API changes. Increment **minor** version when you add functionality in a backward-compatible manner. Increment **patch** when you make backward-compatible bug fixes. For example, if you depend on a feature introduced in `v1.2.0` of some software, then you can install `v1.x.y` for any minor version `x >= 2` and any patch version `y`. You need to install major version `1` (because `2` can introduce backward-incompatible changes), and you need to install a minor version `>= 2` (because you depend on a feature introduced in that minor version). You can use any newer minor version or patch version because they should not introduce any backward-incompatible changes. # Lock files In addition to specifying versions, it can be nice to enforce that the _contents_ of the dependency have not changed to prevent tampering. Some tools use _lock files_ to specify cryptographic hashes of dependencies (along with versions) that are checked on package install. # Specifying versions Tools often let you specify versions in multiple ways, such as: - exact version, e.g. `2.3.12` - minimum major version, e.g. `>= 2` - specific major version and minimum patch version, e.g. `>= 2.3, <3.0` Specifying an exact version can be advantageous to avoid different behaviors based on installed dependencies (this shouldn't happen if all dependencies faithfully follow semver, but sometimes people make mistakes). Specifying a minimum requirement has the advantage of allowing bug fixes to be installed (e.g. patch upgrades). # Dependency resolution Package managers use various dependency resolution algorithms to satisfy dependency requirements. This often gets challenging with complex dependencies (e.g. a package can be indirectly depended on by multiple top-level dependencies, and different versions could be required). Different package managers have different levels of sophistication in their dependency resolution, but it's something to be aware of: you may need to understand this if you are debugging dependencies. # Virtual environments If you're developing multiple software projects, they may depend on different versions of a particular piece of software. Sometimes, your build tool will handle this naturally (e.g. by building a static binary). For other build tools and programming languages, one approach is handling this with virtual environments (e.g. with the [virtualenv](https://docs.python-guide.org/dev/virtualenvs/) tool for Python). Instead of installing dependencies system-wide, you can install dependencies per-project in a virtual environment, and _activate_ the virtual environment that you want to use when you're working on a specific project. # Vendoring Another very different approach to dependency management is _vendoring_. Instead of using a dependency manager or build tool to fetch software, you copy the entire source code for a dependency into your software's repository. This has the advantage that you're always building against the same version of the dependency and you don't need to rely on a package repository, but it is more effort to upgrade dependencies. ================================================ FILE: _2019/program-introspection.md ================================================ --- layout: lecture title: "Program Introspection" presenter: Anish video: aspect: 62.5 id: 74MhV-7hYzg --- # Debugging When printf-debugging isn't good enough: use a debugger. Debuggers let you interact with the execution of a program, letting you do things like: - halt execution of the program when it reaches a certain line - single-step through the program - inspect values of variables - many more advanced features ## GDB/LLDB [GDB](https://www.gnu.org/software/gdb/) and [LLDB](https://lldb.llvm.org/). Supports many C-like languages. Let's look at [example.c](/2019/files/example.c). Compile with debug flags: `gcc -g -o example example.c`. Open GDB: `gdb example` Some commands: - `run` - `b {name of function}` - set a breakpoint - `b {file}:{line}` - set a breakpoint - `c` - continue - `step` / `next` / `finish` - step in / step over / step out - `p {variable}` - print value of variable - `watch {expression}` - set a watchpoint that triggers when the value of the expression changes - `rwatch {expression}` - set a watchpoint that triggers when the value is read - `layout` ## PDB [PDB](https://docs.python.org/3/library/pdb.html) is the Python debugger. Insert `import pdb; pdb.set_trace()` where you want to drop into PDB, basically a hybrid of a debugger (like GDB) and a Python shell. ## Web browser Developer Tools Another example of a debugger, this time with a graphical interface. # strace Observe system calls a program makes: `strace {program}`. # Profiling Types of profiling: CPU, memory, etc. Simplest profiler: `time`. ## Go Run test code with CPU profiler: `go test -cpuprofile=cpu.out` Analyze profile: `go tool pprof -web cpu.out` Run test code with Memory profiler: `go test -memprofile=mem.out` Analyze profile: `go tool pprof -web mem.out` ## Perf Basic performance stats: `perf stat {command}` Run a program with the profiler: `perf record {command}` Analyze profile: `perf report` ================================================ FILE: _2019/remote-machines.md ================================================ --- layout: lecture title: "Remote Machines" presenter: Jose video: aspect: 62.5 id: X5c2Y8BCowM --- It has become more and more common for programmers to use remote servers in their everyday work. If you need to use remote servers in order to deploy backend software or you need a server with higher computational capabilities, you will end up using a Secure Shell (SSH). As with most tools covered, SSH is highly configurable so it is worth learning about it. ## Executing commands An often overlooked feature of `ssh` is the ability to run commands directly. - `ssh foobar@server ls` will execute ls in the home folder of foobar - It works with pipes, so `ssh foobar@server ls | grep PATTERN` will grep locally the remote output of `ls` and `ls | ssh foobar@server grep PATTERN` will grep remotely the local output of `ls`. ## SSH Keys Key-based authentication exploits public-key cryptography to prove to the server that the client owns the secret private key without revealing the key. This way you do not need to reenter your password every time. Nevertheless the private key (e.g. `~/.ssh/id_rsa`) is effectively your password so treat it like so. - Key generation. To generate a pair you can simply run `ssh-keygen -t rsa -b 4096`. If you do not choose a passphrase anyone that gets hold of your private key will be able to access authorized servers so it is recommended to choose one and use `ssh-agent` to manage shell sessions. If you have configured pushing to Github using SSH keys you have probably done the steps outlined [here](https://help.github.com/articles/connecting-to-github-with-ssh/) and have a valid pair already. To check if you have a passphrase and validate it you can run `ssh-keygen -y -f /path/to/key`. - Key based authentication. `ssh` will look into `.ssh/authorized_keys` to determine which clients it should let in. To copy a public key over we can use the ```bash cat .ssh/id_dsa.pub | ssh foobar@remote 'cat >> ~/.ssh/authorized_keys' ``` A simpler solution can be achieved with `ssh-copy-id` where available. ```bash ssh-copy-id -i .ssh/id_dsa.pub foobar@remote ``` ## Copying files over ssh There are many ways to copy files over ssh - `ssh+tee`, the simplest is to use `ssh` command execution and stdin input by doing `cat localfile | ssh remote_server tee serverfile` - `scp` when copying large amounts of files/directories, the secure copy `scp` command is more convenient since it can easily recurse over paths. The syntax is `scp path/to/local_file remote_host:path/to/remote_file` - `rsync` improves upon `scp` by detecting identical files in local and remote and preventing copying them again. It also provides more fine grained control over symlinks, permissions and has extra features like the `--partial` flag that can resume from a previously interrupted copy. `rsync` has a similar syntax to `scp`. ## Backgrounding processes By default when interrupting a ssh connection, child processes of the parent shell are killed along with it. There are a couple of alternatives - `nohup` - the `nohup` tool effectively allows for a process to live when the terminal gets killed. Although this can sometimes be achieved with `&` and `disown`, nohup is a better default. More details can be found [here](https://unix.stackexchange.com/questions/3886/difference-between-nohup-disown-and). - `tmux`, `screen` - whereas `nohup` effectively backgrounds the process it is not convenient for interactive shell sessions. In that case using a terminal multiplexer like `screen` or `tmux` is a convenient choice since one can easily detach and reattach the associated shells. Lastly, if you disown a program and want to reattach it to the current terminal, you can look into [reptyr](https://github.com/nelhage/reptyr). `reptyr PID` will grab the process with id PID and attach it to your current terminal. ## Port Forwarding In many scenarios you will run into software that works by listening to ports in the machine. When this happens in your local machine you can simply do `localhost:PORT` or `127.0.0.1:PORT`, but what do you do with a remote server that does not have its ports directly available through the network/internet?. This is called port forwarding and it comes in two flavors: Local Port Forwarding and Remote Port Forwarding (see the pictures for more details, credit of the pictures from [this SO post](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot)). **Local Port Forwarding** ![Local Port Forwarding](https://i.stack.imgur.com/a28N8.png) **Remote Port Forwarding** ![Remote Port Forwarding](https://i.stack.imgur.com/4iK3b.png) The most common scenario is local port forwarding where a service in the remote machine listens in a port and you want to link a port in your local machine to forward to the remote port. For example if we execute `jupyter notebook` in the remote server that listens to the port `8888`. Thus to forward that to the local port `9999` we would do `ssh -L 9999:localhost:8888 foobar@remote_server` and then navigate to `localhost:9999` in our local machine. ## Graphics Forwarding Sometimes forwarding ports is not enough since we want to run a GUI based program in the server. You can always resort to Remote Desktop Software that sends the entire Desktop Environment (ie. options like RealVNC, Teamviewer, &c). However for a single GUI tool, SSH provides a good alternative: Graphics Forwarding. Using the `-X` flag tells SSH to forward For trusted X11 forwarding the `-Y` flag can be used. Final note is that for this to work the `sshd_config` on the server must have the following options ```bash X11Forwarding yes X11DisplayOffset 10 ``` ## Roaming A common pain when connecting to a remote server are disconnections due to shutting down/sleeping your computer or changing a network. Moreover if one has a connection with significant lag using ssh can become quite frustrating. [Mosh](https://mosh.org/), the mobile shell, improves upon ssh, allowing roaming connections, intermittent connectivity and providing intelligent local echo. Mosh is present in all common distributions and package managers. Mosh requires an ssh server to be working in the server. You do not need to be superuser to install mosh but it does require that ports 60000 through 60010 to be open in the server (they usually are since they are not in the privileged range). A downside of `mosh` is that is does not support roaming port/graphics forwarding so if you use those often `mosh` won't be of much help. ## SSH Configuration #### Client We have covered many many arguments that we can pass. A tempting alternative is to create shell aliases that look like `alias my_serer="ssh -X -i ~/.id_rsa -L 9999:localhost:8888 foobar@remote_server`, however there is a better alternative, using `~/.ssh/config`. ```bash Host vm User foobar HostName 172.16.174.141 Port 22 IdentityFile ~/.ssh/id_rsa RemoteForward 9999 localhost:8888 # Configs can also take wildcards Host *.mit.edu User foobaz ``` An additional advantage of using the `~/.ssh/config` file over aliases is that other programs like `scp`, `rsync`, `mosh`, &c are able to read it as well and convert the settings into the corresponding flags. Note that the `~/.ssh/config` file can be considered a dotfile, and in general it is fine for it to be included with the rest of your dotfiles. However if you make it public, think about the information that you are potentially providing strangers on the internet: the addresses of your servers, the users you are using, the open ports, &c. This may facilitate some types of attacks so be thoughtful about sharing your SSH configuration. Warning: Never include your RSA keys ( `~/.ssh/id_rsa*` ) in a public repository! #### Server side Server side configuration is usually specified in `/etc/ssh/sshd_config`. Here you can make changes like disabling password authentication, changing ssh ports, enabling X11 forwarding, &c. You can specify config settings in a per user basis. ## Remote Filesystem Sometimes it is convenient to mount a remote folder. [sshfs](https://github.com/libfuse/sshfs) can mount a folder on a remote server locally, and then you can use a local editor. ## Exercises 1. For SSH to work the host needs to be running an SSH server. Install an SSH server (such as OpenSSH) in a virtual machine so you can do the rest of the exercises. To figure out what is the ip of the machine run the command `ip addr` and look for the inet field (ignore the `127.0.0.1` entry, that corresponds to the loopback interface). 1. Go to `~/.ssh/` and check if you have a pair of SSH keys there. If not, generate them with `ssh-keygen -t rsa -b 4096`. It is recommended that you use a password and use `ssh-agent` , more info [here](https://www.ssh.com/ssh/agent). 1. Use `ssh-copy-id` to copy the key to your virtual machine. Test that you can ssh without a password. Then, edit your `sshd_config` in the server to disable password authentication by editing the value of `PasswordAuthentication`. Disable root login by editing the value of `PermitRootLogin`. 1. Edit the `sshd_config` in the server to change the ssh port and check that you can still ssh. If you ever have a public facing server, a non default port and key only login will throttle a significant amount of malicious attacks. 1. Install mosh in your server/VM, establish a connection and then disconnect the network adapter of the server/VM. Can mosh properly recover from it? 1. Another use of local port forwarding is to tunnel certain host to the server. If your network filters some website like for example `reddit.com` you can tunnel it through the server as follows: - Run `ssh remote_server -L 80:reddit.com:80` - Set `reddit.com` and `www.reddit.com` to `127.0.0.1` in `/etc/hosts` - Check that you are accessing that website through the server - If it is not obvious use a website such as [ipinfo.io](https://ipinfo.io/) which will change depending on your host public ip. 1. Background port forwarding can easily be achieved with a couple of extra flags. Look into what the `-N` and `-f` flags do in `ssh` and figure out what a command such as this `ssh -N -f -L 9999:localhost:8888 foobar@remote_server` does. ## References - [SSH Hacks](http://matt.might.net/articles/ssh-hacks/) - [Secure Secure Shell](https://stribika.github.io/2015/01/04/secure-secure-shell.html) {% comment %} Lecture notes will be available by the start of lecture. {% endcomment %} ================================================ FILE: _2019/security.md ================================================ --- layout: lecture title: "Security and Privacy" presenter: Jon video: aspect: 56.25 id: OBx_c-i-M8s --- The world is a scary place, and everyone's out to get you. Okay, maybe not, but that doesn't mean you want to flaunt all your secrets. Security (and privacy) is generally all about raising the bar for attackers. Find out what your threat model is, and then design your security mechanisms around that! If the threat model is the NSA or Mossad, you're _probably_ going to have a bad time. There are _many_ ways to make your technical persona more secure. We'll touch on a lot of high-level things here, but this is a process, and educating yourself is one of the best things you can do. So: ## Follow the Right People One of the best ways to improve your security know-how is to follow other people who are vocal about security. Some suggestions: - [@TroyHunt](https://twitter.com/TroyHunt) - [@SwiftOnSecurity](https://twitter.com/SwiftOnSecurity) - [@taviso](https://twitter.com/taviso) - [@thegrugq](https://twitter.com/thegrugq) - [@tqbf](https://twitter.com/tqbf) - [@mattblaze](https://twitter.com/mattblaze) - [@moxie](https://twitter.com/moxie) See also [this list](https://heimdalsecurity.com/blog/best-twitter-cybersec-accounts/) for more suggestions. ## General Security Advice Tech Solidarity has a pretty great list of [do's and don'ts for journalists](https://web.archive.org/web/20221123204419/https://techsolidarity.org/resources/basic_security.htm) that has a lot of sane advice, and is decently up-to-date. [@thegrugq](https://medium.com/@thegrugq) also has a good blog post on [travel security advice](https://medium.com/@thegrugq/stop-fabricating-travel-security-advice-35259bf0e869) that's worth reading. We'll repeat much of the advice from those sources here, plus some more. Also, get a [USB data blocker](https://www.amazon.com/dp/B00QRRZ2QM/), because [USB is scary](https://www.bleepingcomputer.com/news/security/heres-a-list-of-29-different-types-of-usb-attacks/). ## Authentication The very first thing you should do, if you haven't already, is download a password manager. Some good ones are: - [1password](https://1password.com/) - [KeePass](https://keepass.info/) - [BitWarden](https://bitwarden.com/) - [`pass`](https://git.zx2c4.com/password-store/about/) If you're particularly paranoid, use one that encrypts the passwords locally on your computer, as opposed to storing them in plain-text at the server. Use it to generate passwords for all the web sites you care about right now. Then, switch on two-factor authentication, ideally with a [FIDO/U2F](https://fidoalliance.org/) dongle (a [YubiKey](https://www.yubico.com/quiz/) for example, which has [20% off for students](https://www.yubico.com/why-yubico/for-education/)). TOTP (like Google Authenticator or Duo) will also work in a pinch, but [doesn't protect against phishing](https://twitter.com/taviso/status/1082015009348104192). SMS is pretty much useless unless your threat model only includes random strangers picking up your password in transit. Also, a note about paper keys. Often, services will give you a "backup key" that you can use as a second factor if you lose your real second factor (btw, always keep a backup dongle somewhere safe!). While you _can_ stick those in your password managers, that means that should someone get access to your password manager, you're totally hosed (but maybe you're okay with that thread model). If you are truly paranoid, print out these paper keys, never store them digitally, and place them in a safe in the real world. ## Private Communication Use [Signal](https://www.signal.org/) ([setup instructions](https://medium.com/@mshelton/signal-for-beginners-c6b44f76a1f0). [Wire](https://wire.com/en/) is [fine too](https://www.securemessagingapps.com/); WhatsApp is okay; [don't use Telegram](https://twitter.com/bascule/status/897187286554628096)). Desktop messengers are pretty broken (partially due to usually relying on Electron, which is a huge trust stack). E-mail is particularly problematic, even if PGP signed. It's not generally forward-secure, and the key-distribution problem is pretty severe. [keybase.io](https://keybase.io/) helps, and is useful for a number of other reasons. Also, PGP keys are generally handled on desktop computers, which is one of the least secure computing environments. Relatedly, consider getting a Chromebook, or just work on a tablet with a keyboard. ## File Security File security is hard, and operates on many level. What is it you're trying to secure against? [![$5 wrench](https://imgs.xkcd.com/comics/security.png)](https://xkcd.com/538/) - Offline attacks (someone steals your laptop while it's off): turn on full disk encryption. ([cryptsetup + LUKS](https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_a_non-root_file_system) on Linux, [BitLocker](https://fossbytes.com/enable-full-disk-encryption-windows-10/) on Windows, [FileVault](https://support.apple.com/en-us/HT204837) on macOS. Note that this won't help if the attacker _also_ has you and really wants your secrets. - Online attacks (someone has your laptop and it's on): use file encryption. There are two primary mechanisms for doing so - Encrypted filesystems: stacked filesystem encryption software encrypts files individually rather than having encrypted block devices. You can "mount" these filesystems by providing the decryption key, and then browse the files inside it freely. When you unmount it, those files are all unavailable. Modern solutions include [gocryptfs](https://github.com/rfjakob/gocryptfs) and [eCryptFS](http://ecryptfs.org/). More detailed comparisons can be found [here](https://nuetzlich.net/gocryptfs/comparison/) and [here](https://wiki.archlinux.org/index.php/disk_encryption#Comparison_table) - Encrypted files: encrypt individual files with symmetric encryption (see `gpg -c`) and a secret key. Or, like `pass`, also encrypt the key with your public key so only you can read it back later with your private key. Exact encryption settings matter a lot! - [Plausible deniability](https://en.wikipedia.org/wiki/Plausible_deniability) (what seems to be the problem officer?): usually lower performance, and easier to lose data. Hard to actually prove that it provides [deniable encryption](https://en.wikipedia.org/wiki/Deniable_encryption)! See the [discussion here](https://security.stackexchange.com/questions/135846/is-plausible-deniability-actually-feasible-for-encrypted-volumes-disks), and then consider whether you may want to try [VeraCrypt](https://www.veracrypt.fr/en/Home.html) (the maintained fork of good ol' TrueCrypt). - Encrypted backups: use [Tarsnap](https://www.tarsnap.com/) or [Borgbase](https://www.borgbase.com/) - Think about whether an attacker can delete your backups if they get a hold of your laptop! ## Internet Security & Privacy The internet is a _very_ scary place. Open WiFi networks [are](https://www.troyhunt.com/the-beginners-guide-to-breaking-website/) [scary](https://www.troyhunt.com/talking-with-scott-hanselman-on/). Make sure you delete them afterwards, otherwise your phone will happily announce and re-connect to something with the same name later! If you're ever on a network you don't trust, a VPN _may_ be worthwhile, but keep in mind that you're trusting the VPN provider _a lot_. Do you really trust them more than your ISP? If you truly want a VPN, use a provider you're sure you trust, and you should probably pay for it. Or set up [WireGuard](https://www.wireguard.com/) for yourself -- it's [excellent](https://web.archive.org/web/20210526211307/https://latacora.micro.blog/there-will-be/)! There are also secure configuration settings for a lot of internet-enabled applications at [cipherlist.eu](https://cipherlist.eu/). If you're particularly privacy-oriented, [privacytools.io](https://privacytools.io) is also a good resource. Some of you may wonder about [Tor](https://www.torproject.org/). Keep in mind that Tor is _not_ particularly resistant to powerful global attackers, and is weak against traffic analysis attacks. It may be useful for hiding traffic on a small scale, but won't really buy you all that much in terms of privacy. You're better off using more secure services in the first place (Signal, TLS + certificate pinning, etc.). ## Web Security So, you want to go on the Web too? Jeez, you're really pushing your luck here. Install [HTTPS Everywhere](https://www.eff.org/https-everywhere). SSL/TLS is [critical](https://www.troyhunt.com/ssl-is-not-about-encryption/), and it's _not_ just about encryption, but also about being able to verify that you're talking to the right service in the first place! If you run your own web server, [test it](https://www.ssllabs.com/ssltest/index.html). TLS configuration [can get hairy](https://wiki.mozilla.org/Security/Server_Side_TLS). HTTPS Everywhere will do its very best to never navigate you to HTTP sites when there's an alternative. That doesn't save you, but it helps. If you're truly paranoid, blacklist any SSL/TLS CAs that you don't absolutely need. Install [uBlock Origin](https://github.com/gorhill/uBlock). It is a [wide-spectrum blocker](https://github.com/gorhill/uBlock/wiki/Blocking-mode) that doesn't just stop ads, but all sorts of third-party communication a page may try to do. And inline scripts and such. If you're willing to spend some time on configuration to make things work, go to [medium mode](https://github.com/gorhill/uBlock/wiki/Blocking-mode:-medium-mode) or even [hard mode](https://github.com/gorhill/uBlock/wiki/Blocking-mode:-hard-mode). Those _will_ make some sites not work until you've fiddled with the settings enough, but will also significantly improve your online security. If you're using Firefox, enable [Multi-Account Containers](https://support.mozilla.org/en-US/kb/containers). Create separate containers for social networks, banking, shopping, etc. Firefox will keep the cookies and other state for each of the containers totally separate, so sites you visit in one container can't snoop on sensitive data from the others. In Google Chrome, you can use [Chrome Profiles](https://support.google.com/chrome/answer/2364824) to achieve similar results. Exercises TODO 1. Encrypt a file using PGP 1. Use veracrypt to create a simple encrypted volume 1. Enable 2FA for your most data sensitive accounts i.e. GMail, Dropbox, Github, &c ================================================ FILE: _2019/shell.md ================================================ --- layout: lecture title: "Shell and Scripting" presenter: Jon video: aspect: 56.25 id: dbDRfmH5uSI --- The shell is an efficient, textual interface to your computer. The shell prompt: what greets you when you open a terminal. Lets you run programs and commands; common ones are: - `cd` to change directory - `ls` to list files and directories - `mv` and `cp` to move and copy files But the shell lets you do _so_ much more; you can invoke any program on your computer, and command-line tools exist for doing pretty much anything you may want to do. And they're often more efficient than their graphical counterparts. We'll go through a bunch of those in this class. The shell provides an interactive programming language ("scripting"). There are many shells: - You've probably used `sh` or `bash`. - Also shells that match languages: `csh`. - Or "better" shells: `fish`, `zsh`, `ksh`. In this class we'll focus on the ubiquitous `sh` and `bash`, but feel free to play around with others. I like `fish`. Shell programming is a *very* useful tool in your toolbox. Can either write programs directly at the prompt, or into a file. `#!/bin/sh` + `chmod +x` to make shell executable. ## Working with the shell Run a command a bunch of times: ```bash for i in $(seq 1 5); do echo hello; done ``` There's a lot to unpack: - `for x in list; do BODY; done` - `;` terminates a command -- equivalent to newline - split `list`, assign each to `x`, and run body - splitting is "whitespace splitting", which we'll get back to - no curly braces in shell, so `do` + `done` - `$(seq 1 5)` - run the program `seq` with arguments `1` and `5` - substitute entire `$()` with the output of that program - equivalent to ```bash for i in 1 2 3 4 5 ``` - `echo hello` - everything in a shell script is a command - in this case, run the `echo` command, which prints its arguments with the argument `hello`. - all commands are searched for in `$PATH` (colon-separated) We have variables: ```bash for f in $(ls); do echo $f; done ``` Will print each file name in the current directory. Can also set variables using `=` (no space!): ```bash foo=bar echo $foo ``` There are a bunch of "special" variables too: - `$1` to `$9`: arguments to the script - `$0` name of the script itself - `$#` number of arguments - `$$` process ID of current shell To only print directories ```bash for f in $(ls); do if test -d $f; then echo dir $f; fi; done ``` More to unpack here: - `if CONDITION; then BODY; fi` - `CONDITION` is a command; if it returns with exit status 0 (success), then `BODY` is run. - can also hook in an `else` or `elif` - again, no curly braces, so `then` + `fi` - `test` is another program that provides various checks and comparisons, and exits with 0 if they're true (`$?`) - `man COMMAND` is your friend: `man test` - can also be invoked with `[` + `]`: `[ -d $f ]` - take a look at `man test` and `which "["` But wait! This is wrong! What if a file is called "My Documents"? - `for f in $(ls)` expands to `for f in My Documents` - first do the test on `My`, then on `Documents` - not what we wanted! - biggest source of bugs in shell scripts ## Argument splitting Bash splits arguments by whitespace; not always what you want! - need to use quoting to handle spaces in arguments `for f in "My Documents"` would work correctly - same problem somewhere else -- do you see where? `test -d $f`: if `$f` contains whitespace, `test` will error! - `echo` happens to be okay, because split + join by space but what if a filename contains a newline?! turns into space! - quote all use of variables that you don't want split - but how do we fix our script above? what does `for f in "$(ls)"` do do you think? Globbing is the answer! - bash knows how to look for files using patterns: - `*` any string of characters - `?` any single character - `{a,b,c}` any of these characters - `for f in *`: all files in this directory - when globbing, each matching file becomes its own argument - still need to make sure to quote when _using_: `test -d "$f"` - can make advanced patterns: - `for f in a*`: all files starting with `a` in the current directory - `for f in foo/*.txt`: all `.txt` files in `foo` - `for f in foo/*/p??.txt` all three-letter text files starting with p in subdirs of `foo` Whitespace issues don't stop there: - `if [ $foo = "bar" ]; then` -- see the issue? - what if `$foo` is empty? arguments to `[` are `=` and `bar`... - _can_ work around this with `[ x$foo = "xbar" ]`, but bleh - instead, use `[[`: bash built-in comparator that has special parsing - also allows `&&` instead of `-a`, `||` over `-o`, etc. ## Composability Shell is powerful in part because of composability. Can chain multiple programs together rather than have one program that does everything. The key character is `|` (pipe). - `a | b` means run both `a` and `b` send all output of `a` as input to `b` print the output of `b` All programs you launch ("processes") have three "streams": - `STDIN`: when the program reads input, it comes from here - `STDOUT`: when the program prints something, it goes here - `STDERR`: a 2nd output the program can choose to use - by default, `STDIN` is your keyboard, `STDOUT` and `STDERR` are both your terminal. but you can change that! - `a | b` makes `STDOUT` of `a` `STDIN` of `b`. - also have: - `a > foo` (`STDOUT` of `a` goes to the file `foo`) - `a 2> foo` (`STDERR` of `a` goes to the file `foo`) - `a < foo` (`STDIN` of `a` is read from the file `foo`) - hint: `tail -f` will print a file as it's being written - why is this useful? lets you manipulate output of a program! - `ls | grep foo`: all files that contain the word `foo` - `ps | grep foo`: all processes that contain the word `foo` - `journalctl | grep -i intel | tail -n5`: last 5 system log messages with the word intel (case insensitive) - `who | sendmail -t me@example.com` send the list of logged-in users to `me@example.com` - forms the basis for much data-wrangling, as we'll cover later Bash also provides a number of other ways to compose programs. You can group commands with `(a; b) | tac`: run `a`, then `b`, and send all their output to `tac`, which prints its input in reverse order. A lesser-known, but super useful one is _process substitution_. `b <(a)` will run `a`, generate a temporary file-name for its output stream, and pass that file-name to `b`. For example: ```bash diff <(journalctl -b -1 | head -n20) <(journalctl -b -2 | head -n20) ``` will show you the difference between the first 20 lines of the last boot log and the one before that. ## Job and process control What if you want to run longer-term things in the background? - the `&` suffix runs a program "in the background" - it will give you back your prompt immediately - handy if you want to run two programs at the same time like a server and client: `server & client` - note that the running program still has your terminal as `STDOUT`! try: `server > server.log & client` - see all such processes with `jobs` - notice that it shows "Running" - bring it to the foreground with `fg %JOB` (no argument is latest) - if you want to background the current program: `^Z` + `bg` (Here `^Z` means pressing `Ctrl+Z`) - `^Z` stops the current process and makes it a "job" - `bg` runs the last job in the background (as if you did `&`) - background jobs are still tied to your current session, and exit if you log out. `disown` lets you sever that connection. or use `nohup`. - `$!` is pid of last background process What about other stuff running on your computer? - `ps` is your friend: lists running processes - `ps -A`: print processes from all users (also `ps ax`) - `ps` has *many* arguments: see `man ps` - `pgrep`: find processes by searching (like `ps -A | grep`) - `pgrep -af`: search and display with arguments - `kill`: send a _signal_ to a process by ID (`pkill` by search + `-f`) - signals tell a process to "do something" - most common: `SIGKILL` (`-9` or `-KILL`): tell it to exit *now* equivalent to `^\` - also `SIGTERM` (`-15` or `-TERM`): tell it to exit gracefully equivalent to `^C` ## Flags Most command line utilities take parameters using **flags**. Flags usually come in short form (`-h`) and long form (`--help`). Usually running `CMD -h` or `man CMD` will give you a list of the flags the program takes. Short flags can usually be combined, running `rm -r -f` is equivalent to running `rm -rf` or `rm -fr`. Some common flags are a de facto standard and you will seem them in many applications: * `-a` commonly refers to all files (i.e. also including those that start with a period) * `-f` usually refers to forcing something, like `rm -f` * `-h` displays the help for most commands * `-v` usually enables a verbose output * `-V` usually prints the version of the command Also, a double dash `--` is used in built-in commands and many other commands to signify the end of command options, after which only positional parameters are accepted. So if you have a file called `-v` (which you can) and want to grep it `grep pattern -- -v` will work whereas `grep pattern -v` won't. In fact, one way to create such file is to do `touch -- -v`. ## Exercises 1. If you are completely new to the shell you may want to read a more comprehensive guide about it such as [BashGuide](http://mywiki.wooledge.org/BashGuide). If you want a more in-depth introduction [The Linux Command Line](http://linuxcommand.org/tlcl.php) is a good resource. 1. **PATH, which, type** We briefly discussed that the `PATH` environment variable is used to locate the programs that you run through the command line. Let's explore that a little further - Run `echo $PATH` (or `echo $PATH | tr -s ':' '\n'` for pretty printing) and examine its contents, what locations are listed? - The command `which` locates a program in the user PATH. Try running `which` for common commands like `echo`, `ls` or `mv`. Note that `which` is a bit limited since it does not understand shell aliases. Try running `type` and `command -v` for those same commands. How is the output different? - Run `PATH=` and try running the previous commands again, some work and some don't, can you figure out why? 1. **Special Variables** - What does the variable `~` expands as? What about `.`? And `..`? - What does the variable `$?` do? - What does the variable `$_` do? - What does the variable `!!` expand to? What about `!!*`? And `!l`? - Look for documentation for these options and familiarize yourself with them 1. **xargs** Sometimes piping doesn't quite work because the command being piped into does not expect the newline separated format. For example `file` command tells you properties of the file. Try running `ls | file` and `ls | xargs file`. What is `xargs` doing? 1. **Shebang** When you write a script you can specify to your shell what interpreter should be used to interpret the script by using a [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) line. Write a script called `hello` with the following contentsmake it executable with `chmod +x hello`. Then execute it with `./hello`. Then remove the first line and execute it again? How is the shell using that first line? ```bash #! /usr/bin/python print("Hello World!") ``` You will often see programs that have a shebang that looks like `#! usr/bin/env bash`. This is a more portable solution with it own set of [advantages and disadvantages](https://unix.stackexchange.com/questions/29608/why-is-it-better-to-use-usr-bin-env-name-instead-of-path-to-name-as-my). How is `env` different from `which`? What environment variable does `env` use to decide what program to run? 1. **Pipes, process substitution, subshell** Create a script called `slow_seq.sh` with the following contents and do `chmod +x slow_seq.sh` to make it executable. ```bash #! /usr/bin/env bash for i in $(seq 1 10); do echo $i; sleep 1; done ``` There is a way in which pipes (and process substitution) differ from using subshell execution, i.e. `$()`. Run the following commands and observe the differences: - `./slow_seq.sh | grep -P "[3-6]"` - `grep -P "[3-6]" <(./slow_seq.sh)` - `echo $(./slow_seq.sh) | grep -P "[3-6]"` 1. **Misc** - Try running `touch {a,b}{a,b}` then `ls` what did appear? - Sometimes you want to keep STDIN and still pipe it to a file. Try running `echo HELLO | tee hello.txt` - Try running `cat hello.txt > hello.txt ` what do you expect to happen? What does happen? - Run `echo HELLO > hello.txt` and then run `echo WORLD >> hello.txt`. What are the contents of `hello.txt`? How is `>` different from `>>`? - Run `printf "\e[38;5;81mfoo\e[0m\n"`. How was the output different? If you want to know more, search for ANSI color escape sequences. - Run `touch a.txt` then run `^txt^log` what did bash do for you? In the same vein, run `fc`. What does it do? {% comment %} TODO 1. **parallel** - set -e, set -x - traps {% endcomment %} 1. **Keyboard shortcuts** As with any application you use frequently is worth familiarising yourself with its keyboard shortcuts. Type the following ones and try figuring out what they do and in what scenarios it might be convenient knowing about them. For some of them it might be easier searching online about what they do. (remember that `^X` means pressing `Ctrl+X`) - `^A`, `^E` - `^R` - `^L` - `^C`, `^\` and `^D` - `^U` and `^Y` ================================================ FILE: _2019/version-control.md ================================================ --- layout: lecture title: "Version Control" presenter: Jon video: aspect: 56.25 id: 3fig2Vz8QXs --- Whenever you are working on something that changes over time, it's useful to be able to _track_ those changes. This can be for a number of reasons: it gives you a record of what changed, how to undo it, who changed it, and possibly even why. Version control systems (VCS) give you that ability. They let you _commit_ changes to a set of files, along with a message describing the change, as well as look at and undo changes you've made in the past. Most VCS support sharing the commit history between multiple users. This allows for convenient collaboration: you can see the changes I've made, and I can see the changes you've made. And since the VCS tracks _changes_, it can often (though not always) figure out how to combine our changes as long as they touch relatively disjoint things. There [_a lot_](https://en.wikipedia.org/wiki/Comparison_of_version-control_software) of VCSes out there that differ a lot in what they support, how they function, and how you interact with them. Here, we'll focus on [git](https://git-scm.com/), one of the more commonly used ones, but I recommend you also take a look at [Mercurial](https://www.mercurial-scm.org/). With that all said -- to the cliffnotes! ## Is git dark magic? not quite.. you need to understand the data model. we're going to skip over some of the details, but roughly speaking, the _core_ "thing" in git is a commit. - every commit has a unique name, "revision hash" a long hash like `998622294a6c520db718867354bf98348ae3c7e2` often shortened to a short (unique-ish) prefix: `9986222` - commit has author + commit message - also has the hash of any _ancestor commits_ usually just the hash of the previous commit - commit also represents a _diff_, a representation of how you get from the commit's ancestors to the commit (e.g., remove this line in this file, add these lines to this file, rename that file, etc.) - in reality, git stores the full before and after state - probably don't want to store big files that change! initially, the _repository_ (roughly: the folder that git manages) has no content, and no commits. let's set that up: ```console $ git init hackers $ cd hackers $ git status ``` the output here actually gives us a good starting point. let's dig in and make sure we understand it all. first, "On branch master". - don't want to use hashes all the time. - branches are names that point to hashes. - master is traditionally the name for the "latest" commit. every time a new commit is made, the master name will be made to point to the new commit's hash. - special name `HEAD` refers to "current" name - you can also make your own names with `git branch` (or `git tag`) we'll get back to that let's skip over "No commits yet" because that's all there is to it. then, "nothing to commit". - every commit contains a diff with all the changes you made. but how is that diff constructed in the first place? - _could_ just always commit _all_ changes you've made since the last commit - sometimes you want to only commit some of them (e.g., not `TODO`s) - sometimes you want to break up a change into multiple commits to give a separate commit message for each one - git lets you _stage_ changes to construct a commit - add changes to a file or files to the staged changes with `git add` - add only some changes in a file with `git add -p` - without argument `git add` operates on "all known files" - remove a file and stage its removal with `git rm` - empty the set of staged changes `git reset` - note that this does *not* change any of your files! it *only* means that no changes will be included in a commit - to remove only some staged changes: `git reset FILE` or `git reset -p` - check staged changes with `git diff --staged` - see remaining changes with `git diff` - when you're happy with the stage, make a commit with `git commit` - if you just want to commit *all* changes: `git commit -a` - `git help add` has a bunch more helpful info while you're playing with the above, try to run `git status` to see what git thinks you're doing -- it's surprisingly helpful! ## A commit you say... okay, we have a commit, now what? - we can look at recent changes: `git log` (or `git log --oneline`) - we can look at the full changes: `git log -p` - we can show a particular commit: `git show master` - or with `-p` for full diff/patch - we can go back to the state at a commit using `git checkout NAME` - if `NAME` is a commit hash, git says we're "detached". this just means there's no `NAME` that refers to this commit, so if we make commits, no-one will know about them. - we can revert a change with `git revert NAME` - applies the diff in the commit at `NAME` in reverse. - we can compare an older version to this one using `git diff NAME..` - `a..b` is a commit _range_. if either is left out, it means `HEAD`. - we can show all the commits between using `git log NAME..` - `-p` works here too - we can change `master` to point to a particular commit (effectively undoing everything since) with `git reset NAME`: - huh, why? wasn't `reset` to change staged changes? reset has a "second" form (see `git help reset`) which sets `HEAD` to the commit pointed to by the given name. - notice that this didn't change any files -- `git diff` now effectively shows `git diff NAME..`. ## What's in a name? clearly, names are important in git. and they're the key to understanding *a lot* of what goes on in git. so far, we've talked about commit hashes, master, and `HEAD`. but there's more! - you can make your own branches (like master) with `git branch b` - creates a new name, `b`, which points to the commit at `HEAD` - you're still "on" master though, so if you make a new commit, master will point to that new commit, `b` will not. - switch to a branch with `git checkout b` - any commits you make will now update the `b` name - switch back to master with `git checkout master` - all your changes in `b` are hidden away - a very handy way to be able to easily test out changes - tags are other names that never change, and that have their own message. often used to mark releases + changelogs. - `NAME^` means "the commit before `NAME` - can apply recursively: `NAME^^^` - you _most likely_ mean `~` when you use `~` - `~` is "temporal", whereas `^` goes by ancestors - `~~` is the same as `^^` - with `~` you can also write `X~3` for "3 commits older than `X` - you don't want `^3` - `git diff HEAD^` - `-` means "the previous name" - most commands operate on `HEAD` unless you give another argument ## Clean up your mess your commit history will _very_ often end up as: - `add feature x` -- maybe even with a commit message about `x`! - `forgot to add file` - `fix bug` - `typo` - `typo2` - `actually fix` - `actually actually fix` - `tests pass` - `fix example code` - `typo` - `x` - `x` - `x` - `x` that's _fine_ as far as git is concerned, but is not very helpful to your future self, or to other people who are curious about what has changed. git lets you clean up these things: - `git commit --amend`: fold staged changes into previous commit - note that this _changes_ the previous commit, giving it a new hash! - `git rebase -i HEAD~13` is _magical_. for each commit from past 13, choose what to do: - default is `pick`; do nothing - `r`: change commit message - `e`: change commit (add or remove files) - `s`: combine commit with previous and edit commit message - `f`: "fixup" -- combine commit with previous; discard commit msg - at the end, `HEAD` is made to point to what is now the last commit - often referred to as _squashing_ commits - what it really does: rewind `HEAD` to rebase start point, then re-apply the commits in order as directed. - `git reset --hard NAME`: reset the state of all files to that of `NAME` (or `HEAD` if no name is given). handy for undoing changes. ## Playing with others a common use-case for version control is to allow multiple people to make changes to a set of files without stepping on each other's toes. or rather, to make sure that _if_ they step on each other's toes, they won't just silently overwrite each other's changes. git is a _distributed_ VCS: everyone has a local copy of the entire repository (well, of everything others have chosen to publish). some VCSes are _centralized_ (e.g., subversion): a server has all the commits, clients only have the files they have "checked out". basically, they only have the _current_ files, and need to ask the server if they want anything else. every copy of a git repository can be listed as a "remote". you can copy an existing git repository using `git clone ADDRESS` (instead of `git init`). this creates a remote called _origin_ that points to `ADDRESS`. you can fetch names and the commits they point to from a remote with `git fetch REMOTE`. all names at a remote are available to you as `REMOTE/NAME`, and you can use them just like local names. if you have write access to a remote, you can change names at the remote to point to commits you've made using `git push`. for example, let's make the master name (branch) at the remote `origin` point to the commit that our master branch currently points to: - `git push origin master:master` - for convenience, you can set `origin/master` as the default target for when you `git push` from the current branch with `-u` - consider: what does this do? `git push origin master:HEAD^` often you'll use GitHub, GitLab, BitBucket, or something else as your remote. there's nothing "special" about that as far as git is concerned. it's all just names and commits. if someone makes a change to master and updates `github/master` to point to their commit (we'll get back to that in a second), then when you `git fetch github`, you'll be able to see their changes with `git log github/master`. ## Working with others so far, branches seem pretty useless: you can create them, do work on them, but then what? eventually, you'll just make master point to them anyway, right? - what if you had to fix something while working on a big feature? - what if someone else made a change to master in the meantime? inevitably, you will have to _merge_ changes in one branch with changes in another, whether those changes are made by you or someone else. git lets you do this with, unsurprisingly, `git merge NAME`. `merge` will: - look for the latest point where `HEAD` and `NAME` shared a commit ancestor (i.e., where they diverged) - (try to) apply all those changes to the current `HEAD` - produce a commit that contains all those changes, and lists both `HEAD` and `NAME` as its ancestors - set `HEAD` to that commit's hash once your big feature has been finished, you can merge its branch into master, and git will ensure that you don't lose any changes from either branch! if you've used git in the past, you may recognize `merge` by a different name: `pull`. when you do `git pull REMOTE BRANCH`, that is: - `git fetch REMOTE` - `git merge REMOTE/BRANCH` - where, like `push`, `REMOTE` and `BRANCH` are often omitted and use the "tracking" remote branch (remember `-u`?) this usually works _great_. as long as the changes to the branches being merged are disjoint. if they are not, you get a _merge conflict_. sounds scary... - a merge conflict is just git telling you that it doesn't know what the final diff should look like - git pauses and asks you to finish staging the "merge commit" - open the conflicted file in your editor and look for lots of angle brackets (`<<<<<<<`). the stuff above `=======` is the change made in the `HEAD` since the shared ancestor commit. the stuff below is the change made in the `NAME` since the shared commit. - `git mergetool` is pretty handy -- opens a diff editor - once you've _resolved_ the conflict by figuring out what the file should now look like, stage those changes with `git add`. - when all the conflicts are resolved, finish with `git commit` - you can give up with `git merge --abort` you've just resolved your first git merge conflict! \o/ now you can publish your finished changes with `git push` ## When worlds collide when you `push`, git checks that no-one else's work is lost if you update the remote name you're pushing too. it does this by checking that the current commit of the remote name is an ancestor of the commit you are pushing. if it is, git can safely just update the name; this is called _fast-forwarding_. if it is not, git will refuse to update the remote name, and tell you there have been changes. if your push is rejected, what do you do? - merge remote changes with `git pull` (i.e., `fetch` + `merge`) - force the push with `--force`: this will lose other people's changes! - there's also `--force-with-lease`, which will only force the change if the remote name hasn't changed since the last time you fetched from that remote. much safer! - if you've rebased local commits that you've previously pushed ("history rewriting"; probably don't do this), you'll have to force push. think about why! - try to re-apply your changes "on top of" the changes made remotely - this is a `rebase`! - rewind all local commits since shared ancestor - fast-forward `HEAD` to commit at remote name - apply local commits in-order - may have conflicts you have to manually resolve - `git rebase --continue` or `--abort` - lots more [here](https://git-scm.com/book/en/v2/Git-Branching-Rebasing) - `git pull --rebase` will start this process for you - whether you should merge or rebase is a hot topic! some good reads: - [this](https://www.atlassian.com/git/tutorials/merging-vs-rebasing) - [this](http://web.archive.org/web/20210106220723/https://derekgourlay.com/blog/git-when-to-merge-vs-when-to-rebase/) - [this](https://stackoverflow.com/questions/804115/when-do-you-use-git-rebase-instead-of-git-merge) # Further reading [![XKCD on git](https://imgs.xkcd.com/comics/git.png)](https://xkcd.com/1597/) - [Learn git branching](https://learngitbranching.js.org/) - [How to explain git in simple words](https://smusamashah.github.io/blog/2017/10/14/explain-git-in-simple-words) - [Git from the bottom up](https://jwiegley.github.io/git-from-the-bottom-up/) - [Git for computer scientists](http://eagain.net/articles/git-for-computer-scientists/) - [Oh shit, git!](https://ohshitgit.com/) - [The Pro Git book](https://git-scm.com/book/en/v2) # Exercises 1. On a repo try modifying an existing file. What happens when you do `git stash`? What do you see when running `git log --all --oneline`? Run `git stash pop` to undo what you did with `git stash`. In what scenario might this be useful? 1. One common mistake when learning git is to commit large files that should not be managed by git or adding sensitive information. Try adding a file to a repository, making some commits and then deleting that file from history (you may want to look at [this](https://help.github.com/articles/removing-sensitive-data-from-a-repository/)). Also if you do want git to manage large files for you, look into [Git-LFS](https://git-lfs.github.com/) 1. Git is really convenient for undoing changes but one has to be familiar even with the most unlikely changes 1. If a file is mistakenly modified in some commit it can be reverted with `git revert`. However if a commit involves several changes `revert` might not be the best option. How can we use `git checkout` to recover a file version from a specific commit? 1. Create a branch, make a commit in said branch and then delete it. Can you still recover said commit? Try looking into `git reflog`. (Note: Recover dangling things quickly, git will periodically automatically clean up commits that nothing points to.) 1. If one is too trigger happy with `git reset --hard` instead of `git reset` changes can be easily lost. However since the changes were staged, we can recover them. (look into `git fsck --lost-found` and `.git/lost-found`) 1. In any git repo look under the folder `.git/hooks` you will find a bunch of scripts that end with `.sample`. If you rename them without the `.sample` they will run based on their name. For instance `pre-commit` will execute before doing a commit. Experiment with them 1. Like many command line tools `git` provides a configuration file (or dotfile) called `~/.gitconfig` . Create and alias using `~/.gitconfig` so that when you run `git graph` you get the output of `git log --oneline --decorate --all --graph` (this is a good command to quickly visualize the commit graph) 1. Git also lets you define global ignore patterns under `~/.gitignore_global`, this is useful to prevent common errors like adding RSA keys. Create a `~/.gitignore_global` file and add the pattern `*rsa`, then test that it works in a repo. 1. Once you start to get more familiar with `git`, you will find yourself running into common tasks, such as editing your `.gitignore`. [git extras](https://github.com/tj/git-extras/blob/master/Commands.md) provides a bunch of little utilities that integrate with `git`. For example `git ignore PATTERN` will add the specified pattern to the `.gitignore` file in your repo and `git ignore-io LANGUAGE` will fetch the common ignore patterns for that language from [gitignore.io](https://www.gitignore.io). Install `git extras` and try using some tools like `git alias` or `git ignore`. 1. Git GUI programs can be a great resource sometimes. Try running [gitk](https://git-scm.com/docs/gitk) in a git repo an explore the different parts of the interface. Then run `gitk --all` what are the differences? 1. Once you get used to command line applications GUI tools can feel cumbersome/bloated. A nice compromise between the two are ncurses based tools which can be navigated from the command line and still provide an interactive interface. Git has [tig](https://github.com/jonas/tig), try installing it and running it in a repo. You can find some usage examples [here](https://www.atlassian.com/blog/git/git-tig). {% comment %} - forced push + `--force-with-lease` - git merge/rebase --abort - git blame - exercise about why rebasing public commits is bad {% endcomment %} ================================================ FILE: _2019/virtual-machines.md ================================================ --- layout: lecture title: "Virtual Machines and Containers" presenter: Anish, Jon video: aspect: 56.25 id: LJ9ki5zq6Ik --- # Virtual Machines Virtual machines are simulated computers. You can configure a guest virtual machine with some operating system and configuration and use it without affecting your host environment. For this class, you can use VMs to experiment with operating systems, software, and configurations without risk: you won't affect your primary development environment. In general, VMs have lots of uses. They are commonly used for running software that only runs on a certain operating system (e.g. using a Windows VM on Linux to run Windows-specific software). They are often used for experimenting with potentially malicious software. ## Useful features - **Isolation**: hypervisors do a pretty good job of isolating the guest from the host, so you can use VMs to run buggy or untrusted software reasonably safely. - **Snapshots**: you can take "snapshots" of your virtual machine, capturing the entire machine state (disk, memory, etc.), make changes to your machine, and then restore to an earlier state. This is useful for testing out potentially destructive actions, among other things. ## Disadvantages Virtual machines are generally slower than running on bare metal, so they may be unsuitable for certain applications. ## Setup - **Resources**: shared with host machine; be aware of this when allocating physical resources. - **Networking**: many options, default NAT should work fine for most use cases. - **Guest addons**: many hypervisors can install software in the guest to enable nicer integration with host system. You should use this if you can. ## Resources - Hypervisors - [VirtualBox](https://www.virtualbox.org/) (open-source) - [Virt-manager](https://virt-manager.org/) (open-source, manages KVM virtual machines and LXC containers) - [VMWare](https://www.vmware.com/) (commercial, available from IS&T [for MIT students](https://ist.mit.edu/vmware-fusion)) If you are already familiar with popular hypervisors/VMs you may want to learn more about how to do this from a command line friendly way. One option is the [libvirt](https://wiki.libvirt.org/page/UbuntuKVMWalkthrough) toolkit which allows you to manage multiple different virtualization providers/hypervisors. ## Exercises 1. Download and install a hypervisor. 1. Create a new virtual machine and install a Linux distribution (e.g. [Debian](https://www.debian.org/)). 1. Experiment with snapshots. Try things that you've always wanted to try, like running `sudo rm -rf --no-preserve-root /`, and see if you can recover easily. 1. Read what a [fork-bomb](https://en.wikipedia.org/wiki/Fork_bomb) (`:(){ :|:& };:`) is and run it on the VM to see that the resource isolation (CPU, Memory, &c) works. 1. Install guest addons and experiment with different windowing modes, file sharing, and other features. # Containers Virtual Machines are relatively heavy-weight; what if you want to spin up machines in an automated fashion? Enter containers! - Amazon Firecracker - Docker - rkt - lxc Containers are _mostly_ just an assembly of various Linux security features, like virtual file system, virtual network interfaces, chroots, virtual memory tricks, and the like, that together give the appearance of virtualization. Not quite as secure or isolated as a VM, but pretty close and getting better. Usually higher performance, and much faster to start, but not always. The performance boost comes from the fact that unlike VMs which run an entire copy of the operating system, containers share the linux kernel with the host. However note that if you are running linux containers on Windows/macOS a Linux VM will need to be active as a middle layer between the two. ![Docker vs VM](/2019/files/containers-vs-vms.png) _Comparison between Docker containers and Virtual Machines. Credit: blog.docker.com_ Containers are handy for when you want to run an automated task in a standardized setup: - Build systems - Development environments - Pre-packaged servers - Running untrusted programs - Grading student submissions - (Some) cloud computing - Continuous integration - Travis CI - GitHub Actions Moreover, container software like Docker has also been extensively used as a solution for [dependency hell](https://en.wikipedia.org/wiki/Dependency_hell). If a machine needs to be running many services with conflicting dependencies they can be isolated using containers. Usually, you write a file that defines how to construct your container. You start with some minimal _base image_ (like Alpine Linux), and then a list of commands to run to set up the environment you want (install packages, copy files, build stuff, write config files, etc.). Normally, there's also a way to specify any external ports that should be available, and an _entrypoint_ that dictates what command should be run when the container is started (like a grading script). In a similar fashion to code repository websites (like [GitHub](https://github.com/)) there are some container repository websites (like [DockerHub](https://hub.docker.com/))where many software services have prebuilt images that one can easily deploy. ## Exercises 1. Choose a container software (Docker, LXC, …) and install a simple Linux image. Try SSHing into it. 1. Search and download a prebuilt container image for a popular web server (nginx, apache, …) ================================================ FILE: _2019/web.md ================================================ --- layout: lecture title: "Web and Browsers" presenter: Jose video: aspect: 62.5 id: XpZO3S8odec --- Apart from the terminal, the web browser is a tool you will find yourself spending significant amounts of time into. Thus it is worth learning how to use it efficiently and ## Shortcuts Clicking around in your browser is often not the fastest option, getting familiar with common shortcuts can really pay off in the long run. - `Middle Button Click` in a link opens it in a new tab - `Ctrl+T` Opens a new tab - `Ctrl+Shift+T` Reopens a recently closed tab - `Ctrl+L` selects the contents of the search bar - `Ctrl+F` to search within a webpage. If you do this often, you may benefit from an extension that supports regular expressions in searches. ## Search operators Web search engines like Google or DuckDuckGo provide search operators to enable more elaborate web searches: - `"bar foo"` enforces an exact match of bar foo - `foo site:bar.com` searches for foo within bar.com - `foo -bar ` excludes the terms containing bar from the search - `foobar filetype:pdf` Searches for files of that extension - `(foo|bar)` searches for matches that have foo OR bar More through lists are available for popular engines like [Google](https://ahrefs.com/blog/google-advanced-search-operators/) and [DuckDuckGo](https://duck.co/help/results/syntax) ## Searchbar The searchbar is a powerful tool too. Most browsers can infer search engines from websites and will store them. By editing the keyword argument - In Google Chrome they are in [chrome://settings/searchEngines](chrome://settings/searchEngines) - In Firefox they are in [about:preferences#search](about:preferences#search) For example you can make so that `y SOME SEARCH TERMS` to directly search in youtube. Moreover, if you own a domain you can setup subdomain forwards using your registrar. For instance I have mapped `https://ht.josejg.com` to this course website. That way I can just type `ht.` and the searchbar will autocomplete. Another good feature of this setup is that unlike bookmarks they will work in every browser. ## Privacy extensions Nowadays surfing the web can get quite annoying due to ads and invasive due to trackers. Moreover a good adblocker not only blocks most ad content but it will also block sketchy and malicious websites since they will be included in the common blacklists. They will also reduce page load times sometimes by reducing the amount of requests performed. A couple of recommendations are: - **uBlock origin** ([Chrome](https://chrome.google.com/webstore/detail/ublock-origin/cjpalhdlnbpafiamejdnhcphjbkeiagm), [Firefox](https://addons.mozilla.org/en-US/firefox/addon/ublock-origin/)): block ads and trackers based on predefined rules. You should also consider taking a look at the enabled blacklists in settings since you can enable more based on your region or browsing habits. You can even install filters from [around the web](https://github.com/gorhill/uBlock/wiki/Filter-lists-from-around-the-web) - **[Privacy Badger](https://privacybadger.org/)**: detects and blocks trackers automatically. For example when you go from website to website ad companies track which sites you visit and build a profile of you - **[HTTPS everywhere](https://www.eff.org/https-everywhere)** is a wonderful extension that redirects to HTTPS version of a website automatically, if available. You can find about more addons of this kind [here](https://www.privacytools.io/privacy-browser-addons/) ## Style customization Web browsers are just another piece of software running in _your machine_ and thus you usually have the last say about what they should display or how they should behave. An example of this are custom styles. Browsers determine how to render the style of a webpage using Cascading Style Sheets often abbreviated as CSS. You can access the source code of a website by inspecting it and changing its contents and styles temporarily (this is also a reason why you should never trust webpage screenshots). If you want to permanently tell your browser to override the style settings for a webpage you will need to use an extension. Our recommendation is **[Stylus](https://github.com/openstyles/stylus)** ([Firefox](https://addons.mozilla.org/en-US/firefox/addon/styl-us/), [Chrome](https://chrome.google.com/webstore/detail/stylus/clngdbkpkpeebahjckkjfobafhncgmne?hl=en)). For example, we can write the following style for the class website ```css body { background-color: #2d2d2d; color: #eee; font-family: Fira Code; font-size: 16pt; } a:link { text-decoration: none; color: #0a0; } ``` Moreover, Stylus can find styles written by other users and published in [userstyles.org](https://userstyles.org/). Most common websites have one or several dark theme stylesheets for instance. FYI, you should not use Stylish since it was shown to leak user data, more [here](https://arstechnica.com/information-technology/2018/07/stylish-extension-with-2m-downloads-banished-for-tracking-every-site-visit/) ## Functionality Customization In the same way that you can modify the style, you can also modify the behaviour of a website by writing custom javascript and them sourcing it using a web browser extension such as [Tampermonkey](https://tampermonkey.net/) For example the following script enables vim-like navigation using the J and K keys. ```js // ==UserScript== // @name VIM HT // @namespace http://tampermonkey.net/ // @version 0.1 // @description Vim JK for our website // @author You // @match https://hacker-tools.github.io/* // @grant none // ==/UserScript== (function() { 'use strict'; window.onkeyup = function(e) { var key = e.keyCode ? e.keyCode : e.which; if (key == 74) { // J is key 74 window.scrollBy(0,500);; }else if (key == 75) { // K is key 75 window.scrollBy(0,-500);; } } })(); ``` There are also script repositories such as [OpenUserJS](https://openuserjs.org/) and [Greasy Fork](https://greasyfork.org/en). However, be warned, installing user scripts from others can be very dangerous since they can pretty much do anything such as steal your credit card numbers. Never install a script unless you read the whole thing yourself, understand what it does, and are absolutely sure that you know it isn't doing anything suspicious. Never install a script that contains minified or obfuscated code that you can't read! ## Web APIs It has become more and more common for webservices to offer an application interface aka web API so you can interact with the services making web requests. A more in depth introduction to the topic can be found [here](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Introduction). There are [many public APIs](https://github.com/toddmotto/public-apis). Web APIs can be useful for very many reasons: - **Retrieval**. Web APIs can quite easily provide you information such as maps, weather or what your public ip address. For instance `curl ipinfo.io` will return a JSON object with some details about your public ip, region, location, &c. With proper parsing these tools can be integrated even with command line tools. The following bash functions talks to Googles autocompletion API and returns the first ten matches. ```bash function c() { url='https://www.google.com/complete/search?client=hp&hl=en&xhr=t' # NB: user-agent must be specified to get back UTF-8 data! curl -H 'user-agent: Mozilla/5.0' -sSG --data-urlencode "q=$*" "$url" | jq -r ".[1][][0]" | sed 's,,,g' } ``` - **Interaction**. Web API endpoints can also be used to trigger actions. These usually require some sort of authentication token that you can obtain through the service. For example performing the following `curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' "https://hooks.slack.com/services/$SLACK_TOKEN"` will send a `Hello, World!` message in a channel. - **Piping**. Since some services with web APIs are rather popular, common web API "gluing" has already been implemented and is provided with server included. This is the case for services like [If This Then That](https://ifttt.com/) and [Zapier](https://zapier.com/) ## Web Automation Sometimes web APIs are not enough. If only reading is needed you can use a html parser like `pup` or use a library, for example python has BeautifulSoup. However if interactivity or javascript execution is required those solutions fall short. WebDriver For example, the following script will save the specified url using the wayback machine simulating the interaction of typing the website. ```python from selenium.webdriver import Firefox from selenium.webdriver.common.keys import Keys def snapshot_wayback(driver, url): driver.get("https://web.archive.org/") elem = driver.find_element_by_class_name('web-save-url-input') elem.clear() elem.send_keys(url) elem.send_keys(Keys.RETURN) driver.close() driver = Firefox() url = 'https://hacker-tools.github.io' snapshot_wayback(driver, url) ``` ## Exercises 1. Edit a keyword search engine that you use often in your web browser 1. Install the mentioned extensions. Look into how uBlock Origin/Privacy Badger can be disabled for a website. What differences do you see? Try doing it in a website with plenty of ads like YouTube. 1. Install Stylus and write a custom style for the class website using the CSS provided. Here are some common programming characters `= == === >= => ++ /= ~=`. What happens to them when changing the font to Fira Code? If you want to know more search for programming font ligatures. 1. Find a web api to get the weather in your city/area. 1. Use a WebDriver software like [Selenium](https://docs.seleniumhq.org/) to automate some repetitive manual task that you perform often with your browser. ================================================ FILE: _2020/command-line.md ================================================ --- layout: lecture title: "命令行环境" date: 2020-01-21 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: e8BO_dYxk5c solution: ready: true url: command-line-solution --- 当您使用 shell 进行工作时,可以使用一些方法改善您的工作流,本节课我们就来讨论这些方法。 我们已经使用 shell 一段时间了,但是到目前为止我们的关注点主要集中在使用不同的命令上面。现在,我们将会学习如何同时执行多个不同的进程并追踪它们的状态、如何停止或暂停某个进程以及如何使进程在后台运行。 我们还将学习一些能够改善您的 shell 及其他工具的工作流的方法,这主要是通过定义别名或基于配置文件对其进行配置来实现的。这些方法都可以帮您节省大量的时间。例如,仅需要执行一些简单的命令,我们就可以在所有的主机上使用相同的配置。我们还会学习如何使用 SSH 操作远端机器。 # 任务控制 某些情况下我们需要中断正在执行的任务,比如当一个命令需要执行很长时间才能完成时(假设我们在使用 `find` 搜索一个非常大的目录结构)。大多数情况下,我们可以使用 `Ctrl-C` 来停止命令的执行。但是它的工作原理是什么呢?为什么有的时候会无法结束进程? ## 结束进程 您的 shell 会使用 UNIX 提供的信号机制执行进程间通信。当一个进程接收到信号时,它会停止执行、处理该信号并基于信号传递的信息来改变其执行。就这一点而言,信号是一种 _软件中断_。 在上面的例子中,当我们输入 `Ctrl-C` 时,shell 会发送一个 `SIGINT` 信号到进程。 下面这个 Python 程序向您展示了捕获信号 `SIGINT` 并忽略它的基本操作,它并不会让程序停止。为了停止这个程序,我们需要使用 `SIGQUIT` 信号,通过输入 `Ctrl-\` 可以发送该信号。 ```python #!/usr/bin/env python import signal, time def handler(signum, time): print("\nI got a SIGINT, but I am not stopping") signal.signal(signal.SIGINT, handler) i = 0 while True: time.sleep(.1) print("\r{}".format(i), end="") i += 1 ``` 如果我们向这个程序发送两次 `SIGINT` ,然后再发送一次 `SIGQUIT`,程序会有什么反应?注意 `^` 是我们在终端输入 `Ctrl` 时的表示形式: ``` $ python sigint.py 24^C I got a SIGINT, but I am not stopping 26^C I got a SIGINT, but I am not stopping 30^\[1] 39913 quit python sigint.pyƒ ``` 尽管 `SIGINT` 和 `SIGQUIT` 都常常用来发出和终止程序相关的请求。`SIGTERM` 则是一个更加通用的、也更加优雅地退出信号。为了发出这个信号我们需要使用 [`kill`](https://www.man7.org/linux/man-pages/man1/kill.1.html) 命令, 它的语法是: `kill -TERM `。 ## 暂停和后台执行进程 信号可以让进程做其他的事情,而不仅仅是终止它们。例如,`SIGSTOP` 会让进程暂停。在终端中,键入 `Ctrl-Z` 会让 shell 发送 `SIGTSTP` 信号,`SIGTSTP` 是 Terminal Stop 的缩写(即 `terminal` 版本的 SIGSTOP)。 我们可以使用 [`fg`](https://www.man7.org/linux/man-pages/man1/fg.1p.html) 或 [`bg`](http://man7.org/linux/man-pages/man1/bg.1p.html) 命令恢复暂停的工作。它们分别表示在前台继续或在后台继续。 [`jobs`](http://man7.org/linux/man-pages/man1/jobs.1p.html) 命令会列出当前终端会话中尚未完成的全部任务。您可以使用 pid 引用这些任务(也可以用 [`pgrep`](https://www.man7.org/linux/man-pages/man1/pgrep.1.html) 找出 pid)。更加符合直觉的操作是您可以使用百分号 + 任务编号(`jobs` 会打印任务编号)来选取该任务。如果要选择最近的一个任务,可以使用 `$!` 这一特殊参数。 还有一件事情需要掌握,那就是命令中的 `&` 后缀可以让命令在直接在后台运行,这使得您可以直接在 shell 中继续做其他操作,不过它此时还是会使用 shell 的标准输出,这一点有时会比较恼人(这种情况可以使用 shell 重定向处理)。 让已经在运行的进程转到后台运行,您可以键入 `Ctrl-Z` ,然后紧接着再输入 `bg`。注意,后台的进程仍然是您的终端进程的子进程,一旦您关闭终端(会发送另外一个信号 `SIGHUP`),这些后台的进程也会终止。为了防止这种情况发生,您可以使用 [`nohup`](https://www.man7.org/linux/man-pages/man1/nohup.1.html)(一个用来忽略 `SIGHUP` 的封装)来运行程序。针对已经运行的程序,可以使用 `disown` 。除此之外,您可以使用终端多路复用器来实现,下一章节我们会进行详细地探讨。 下面这个简单的会话中展示来了些概念的应用。 ``` $ sleep 1000 ^Z [1] + 18653 suspended sleep 1000 $ nohup sleep 2000 & [2] 18745 appending output to nohup.out $ jobs [1] + suspended sleep 1000 [2] - running nohup sleep 2000 $ bg %1 [1] - 18653 continued sleep 1000 $ jobs [1] - running sleep 1000 [2] + running nohup sleep 2000 $ kill -STOP %1 [1] + 18653 suspended (signal) sleep 1000 $ jobs [1] + suspended (signal) sleep 1000 [2] - running nohup sleep 2000 $ kill -SIGHUP %1 [1] + 18653 hangup sleep 1000 $ jobs [2] + running nohup sleep 2000 $ kill -SIGHUP %2 $ jobs [2] + running nohup sleep 2000 $ kill %2 [2] + 18745 terminated nohup sleep 2000 $ jobs ``` `SIGKILL` 是一个特殊的信号,它不能被进程捕获并且它会马上结束该进程。不过这样做会有一些副作用,例如留下孤儿进程。 您可以在 [这里]() 或输入 [`man signal`](https://www.man7.org/linux/man-pages/man7/signal.7.html) 或使用 `kill -l` 来获取更多关于信号的信息。 # 终端多路复用 当您在使用命令行时,您通常会希望同时执行多个任务。举例来说,您可以想要同时运行您的编辑器,并在终端的另外一侧执行程序。尽管再打开一个新的终端窗口也能达到目的,使用终端多路复用器则是一种更好的办法。 像 [`tmux`](https://www.man7.org/linux/man-pages/man1/tmux.1.html) 这类的终端多路复用器可以允许我们基于面板和标签分割出多个终端窗口,这样您便可以同时与多个 shell 会话进行交互。 不仅如此,终端多路复用使我们可以分离当前终端会话并在将来重新连接。 这让您操作远端设备时的工作流大大改善,避免了 `nohup` 和其他类似技巧的使用。 现在最流行的终端多路器是 [`tmux`](https://www.man7.org/linux/man-pages/man1/tmux.1.html)。`tmux` 是一个高度可定制的工具,您可以使用相关快捷键创建多个标签页并在它们间导航。 `tmux` 的快捷键需要我们掌握,它们都是类似 ` x` 这样的组合,即需要先按下 `Ctrl+b`,松开后再按下 `x`。`tmux` 中对象的继承结构如下: - **会话** - 每个会话都是一个独立的工作区,其中包含一个或多个窗口 - `tmux` 开始一个新的会话 - `tmux new -s NAME` 以指定名称开始一个新的会话 - `tmux ls` 列出当前所有会话 - 在 `tmux` 中输入 ` d` ,将当前会话分离 - `tmux a` 重新连接最后一个会话。您也可以通过 `-t` 来指定具体的会话 - **窗口** - 相当于编辑器或是浏览器中的标签页,从视觉上将一个会话分割为多个部分 - ` c` 创建一个新的窗口,使用 `` 关闭 - ` N` 跳转到第 _N_ 个窗口,注意每个窗口都是有编号的 - ` p` 切换到前一个窗口 - ` n` 切换到下一个窗口 - ` ,` 重命名当前窗口 - ` w` 列出当前所有窗口 - **面板** - 像 vim 中的分屏一样,面板使我们可以在一个屏幕里显示多个 shell - ` "` 水平分割 - ` %` 垂直分割 - ` <方向>` 切换到指定方向的面板,<方向> 指的是键盘上的方向键 - ` z` 切换当前面板的缩放 - ` [` 开始往回卷动屏幕。您可以按下空格键来开始选择,回车键复制选中的部分 - ` <空格>` 在不同的面板排布间切换 扩展阅读: [这里](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/) 是一份 `tmux` 快速入门教程, [而这一篇](http://linuxcommand.org/lc3_adv_termmux.php) 文章则更加详细,它包含了 `screen` 命令。您也许想要掌握 [`screen`](https://www.man7.org/linux/man-pages/man1/screen.1.html) 命令,因为在大多数 UNIX 系统中都默认安装有该程序。 # 别名 输入一长串包含许多选项的命令会非常麻烦。因此,大多数 shell 都支持设置别名。shell 的别名相当于一个长命令的缩写,shell 会自动将其替换成原本的命令。例如,bash 中的别名语法如下: ```bash alias alias_name="command_to_alias arg1 arg2" ``` 注意, `=` 两边是没有空格的,因为 [`alias`](https://www.man7.org/linux/man-pages/man1/alias.1p.html) 是一个 shell 命令,它只接受一个参数。 别名有许多很方便的特性: ```bash # 创建常用命令的缩写 alias ll="ls -lh" # 能够少输入很多 alias gs="git status" alias gc="git commit" alias v="vim" # 手误打错命令也没关系 alias sl=ls # 重新定义一些命令行的默认行为 alias mv="mv -i" # -i prompts before overwrite alias mkdir="mkdir -p" # -p make parent dirs as needed alias df="df -h" # -h prints human readable format # 别名可以组合使用 alias la="ls -A" alias lla="la -l" # 在忽略某个别名 \ls # 或者禁用别名 unalias la # 获取别名的定义 alias ll # 会打印 ll='ls -lh' ``` 值得注意的是,在默认情况下 shell 并不会保存别名。为了让别名持续生效,您需要将配置放进 shell 的启动文件里,像是 `.bashrc` 或 `.zshrc`,下一节我们就会讲到。 # 配置文件(Dotfiles) 很多程序的配置都是通过纯文本格式的被称作 _点文件_ 的配置文件来完成的(之所以称为点文件,是因为它们的文件名以 `.` 开头,例如 `~/.vimrc`。也正因为此,它们默认是隐藏文件,`ls` 并不会显示它们)。 shell 的配置也是通过这类文件完成的。在启动时,您的 shell 程序会读取很多文件以加载其配置项。根据 shell 本身的不同,您从登录开始还是以交互的方式完成这一过程可能会有很大的不同。关于这一话题,[这里](https://blog.flowblok.id.au/2013-02/shell-startup-scripts.html) 有非常好的资源。 对于 `bash` 来说,在大多数系统下,您可以通过编辑 `.bashrc` 或 `.bash_profile` 来进行配置。在文件中您可以添加需要在启动时执行的命令,例如上文我们讲到过的别名,或者是您的环境变量。 实际上,很多程序都要求您在 shell 的配置文件中包含一行类似 `export PATH="$PATH:/path/to/program/bin"` 的命令,这样才能确保这些程序能够被 shell 找到。 还有一些其他的工具也可以通过 _点文件_ 进行配置: - `bash` - `~/.bashrc`, `~/.bash_profile` - `git` - `~/.gitconfig` - `vim` - `~/.vimrc` 和 `~/.vim` 目录 - `ssh` - `~/.ssh/config` - `tmux` - `~/.tmux.conf` 我们应该如何管理这些配置文件呢,它们应该在它们的文件夹下,并使用版本控制系统进行管理,然后通过脚本将其 **符号链接** 到需要的地方。这么做有如下好处: - **安装简单**: 如果您登录了一台新的设备,在这台设备上应用您的配置只需要几分钟的时间; - **可移植性**: 您的工具在任何地方都以相同的配置工作 - **同步**: 在一处更新配置文件,可以同步到其他所有地方 - **变更追踪**: 您可能要在整个程序员生涯中持续维护这些配置文件,而对于长期项目而言,版本历史是非常重要的 配置文件中需要放些什么?您可以通过在线文档和 [帮助手册](https://en.wikipedia.org/wiki/Man_page) 了解所使用工具的设置项。另一个方法是在网上搜索有关特定程序的文章,作者们在文章中会分享他们的配置。还有一种方法就是直接浏览其他人的配置文件:您可以在这里找到无数的 [dotfiles 仓库](https://github.com/search?o=desc&q=dotfiles&s=stars&type=Repositories) —— 其中最受欢迎的那些可以在 [这里](https://github.com/mathiasbynens/dotfiles) 找到(我们建议您不要直接复制别人的配置)。[这里](https://dotfiles.github.io/) 也有一些非常有用的资源。 本课程的老师们也在 GitHub 上开源了他们的配置文件: [Anish](https://github.com/anishathalye/dotfiles), [Jon](https://github.com/jonhoo/configs), [Jose](https://github.com/jjgo/dotfiles). ## 可移植性 配置文件的一个常见的痛点是它可能并不能在多种设备上生效。例如,如果您在不同设备上使用的操作系统或者 shell 是不同的,则配置文件是无法生效的。或者,有时您仅希望特定的配置只在某些设备上生效。 有一些技巧可以轻松达成这些目的。如果配置文件支持 if 语句或类似的东西,则您可以借助它针对不同的设备编写不同的配置。例如,您的 shell 可以包含: ```bash if [[ "$(uname)" == "Linux" ]]; then {do_something}; fi # 使用和 shell 相关的配置时先检查当前 shell 类型 if [[ "$SHELL" == "zsh" ]]; then {do_something}; fi # 您也可以针对特定的设备进行配置 if [[ "$(hostname)" == "myServer" ]]; then {do_something}; fi ``` 如果配置文件支持 include 功能,您也可以多加利用。例如:`~/.gitconfig` 可以这样编写: ``` [include] path = ~/.gitconfig_local ``` 然后我们可以在日常使用的设备上创建配置文件 `~/.gitconfig_local` 来包含与该设备相关的特定配置。您甚至应该创建一个单独的代码仓库来管理这些与设备相关的配置。 如果您希望在不同的程序之间共享某些配置,该方法也适用。例如,如果您想要在 `bash` 和 `zsh` 中同时启用一些别名,您可以把它们写在 `.aliases` 里,然后在这两个 shell 里应用: ```bash # Test if ~/.aliases exists and source it if [ -f ~/.aliases ]; then source ~/.aliases fi ``` # 远端设备 对于程序员来说,在他们的日常工作中使用远程服务器已经非常普遍了。如果您需要使用远程服务器来部署后端软件或您需要一些计算能力强大的服务器,您就会用到安全 shell(SSH)。和其他工具一样,SSH 也是可以高度定制的,也值得我们花时间学习它。 通过如下命令,您可以使用 `ssh` 连接到其他服务器: ```bash ssh foo@bar.mit.edu ``` 这里我们尝试以用户名 `foo` 登录服务器 `bar.mit.edu`。服务器可以通过 URL 指定(例如 `bar.mit.edu`),也可以使用 IP 指定(例如 `foobar@192.168.1.42`)。后面我们会介绍如何修改 ssh 配置文件使我们可以用类似 `ssh bar` 这样的命令来登录服务器。 ## 执行命令 `ssh` 的一个经常被忽视的特性是它可以直接远程执行命令。 `ssh foobar@server ls` 可以直接在用 foobar 的命令下执行 `ls` 命令。 想要配合管道来使用也可以, `ssh foobar@server ls | grep PATTERN` 会在本地查询远端 `ls` 的输出而 `ls | ssh foobar@server grep PATTERN` 会在远端对本地 `ls` 输出的结果进行查询。 ## SSH 密钥 基于密钥的验证机制使用了密码学中的公钥,我们只需要向服务器证明客户端持有对应的私钥,而不需要公开其私钥。这样您就可以避免每次登录都输入密码的麻烦了秘密就可以登录。不过,私钥(通常是 `~/.ssh/id_rsa` 或者 `~/.ssh/id_ed25519`) 等效于您的密码,所以一定要好好保存它。 ### 密钥生成 使用 [`ssh-keygen`](http://man7.org/linux/man-pages/man1/ssh-keygen.1.html) 命令可以生成一对密钥: ```bash ssh-keygen -o -a 100 -t ed25519 -f ~/.ssh/id_ed25519 ``` 您可以为密钥设置密码,防止有人持有您的私钥并使用它访问您的服务器。您可以使用 [`ssh-agent`](https://www.man7.org/linux/man-pages/man1/ssh-agent.1.html) 或 [`gpg-agent`](https://linux.die.net/man/1/gpg-agent) ,这样就不需要每次都输入该密码了。 如果您曾经配置过使用 SSH 密钥推送到 GitHub,那么可能您已经完成了 [这里](https://help.github.com/articles/connecting-to-github-with-ssh/) 介绍的这些步骤,并且已经有了一个可用的密钥对。要检查您是否持有密码并验证它,您可以运行 `ssh-keygen -y -f /path/to/key`. ### 基于密钥的认证机制 `ssh` 会查询 `.ssh/authorized_keys` 来确认那些用户可以被允许登录。您可以通过下面的命令将一个公钥拷贝到这里: ```bash cat .ssh/id_ed25519.pub | ssh foobar@remote 'cat >> ~/.ssh/authorized_keys' ``` 如果支持 `ssh-copy-id` 的话,可以使用下面这种更简单的解决方案: ```bash ssh-copy-id -i .ssh/id_ed25519.pub foobar@remote ``` ## 通过 SSH 复制文件 使用 ssh 复制文件有很多方法: - `ssh+tee`, 最简单的方法是执行 `ssh` 命令,然后通过这样的方法利用标准输入实现 `cat localfile | ssh remote_server tee serverfile`。回忆一下,[`tee`](https://www.man7.org/linux/man-pages/man1/tee.1.html) 命令会将标准输出写入到一个文件; - [`scp`](https://www.man7.org/linux/man-pages/man1/scp.1.html) :当需要拷贝大量的文件或目录时,使用 `scp` 命令则更加方便,因为它可以方便的遍历相关路径。语法如下:`scp path/to/local_file remote_host:path/to/remote_file`; - [`rsync`](https://www.man7.org/linux/man-pages/man1/rsync.1.html) 对 `scp` 进行了改进,它可以检测本地和远端的文件以防止重复拷贝。它还可以提供一些诸如符号连接、权限管理等精心打磨的功能。甚至还可以基于 `--partial` 标记实现断点续传。`rsync` 的语法和 `scp` 类似; ## 端口转发 很多情况下我们都会遇到软件需要监听特定设备的端口。如果是在您的本机,可以使用 `localhost:PORT` 或 `127.0.0.1:PORT`。但是如果需要监听远程服务器的端口该如何操作呢?这种情况下远端的端口并不会直接通过网络暴露给您。 此时就需要进行 _端口转发_。端口转发有两种,一种是本地端口转发和远程端口转发(参见下图,该图片引用自这篇 [StackOverflow 文章](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot))中的图片。 **本地端口转发** ![Local Port Forwarding](https://i.sstatic.net/a28N8.png) **远程端口转发** ![Remote Port Forwarding](https://i.sstatic.net/4iK3b.png) 常见的情景是使用本地端口转发,即远端设备上的服务监听一个端口,而您希望在本地设备上的一个端口建立连接并转发到远程端口上。例如,我们在远端服务器上运行 Jupyter notebook 并监听 `8888` 端口。 然后,建立从本地端口 `9999` 的转发,使用 `ssh -L 9999:localhost:8888 foobar@remote_server` 。这样只需要访问本地的 `localhost:9999` 即可。 ## SSH 配置 我们已经介绍了很多参数。为它们创建一个别名是个好想法,我们可以这样做: ```bash alias my_server="ssh -i ~/.id_ed25519 --port 2222 -L 9999:localhost:8888 foobar@remote_server" ``` 不过,更好的方法是使用 `~/.ssh/config`. ```bash Host vm User foobar HostName 172.16.174.141 Port 2222 IdentityFile ~/.ssh/id_ed25519 LocalForward 9999 localhost:8888 # 在配置文件中也可以使用通配符 Host *.mit.edu User foobaz ``` 这么做的好处是,使用 `~/.ssh/config` 文件来创建别名,类似 `scp`、`rsync` 和 `mosh` 的这些命令都可以读取这个配置并将设置转换为对应的命令行选项。 注意,`~/.ssh/config` 文件也可以被当作配置文件,而且一般情况下也是可以被导入其他配置文件的。不过,如果您将其公开到互联网上,那么其他人都将会看到您的服务器地址、用户名、开放端口等等。这些信息可能会帮助到那些企图攻击您系统的黑客,所以请务必三思。 服务器侧的配置通常放在 `/etc/ssh/sshd_config`。您可以在这里配置免密认证、修改 ssh 端口、开启 X11 转发等等。 您也可以为每个用户单独指定配置。 ## 杂项 连接远程服务器的一个常见痛点是遇到由关机、休眠或网络环境变化导致的掉线。如果连接的延迟很高也很让人讨厌。[Mosh](https://mosh.org/)(即 mobile shell )对 ssh 进行了改进,它允许连接漫游、间歇连接及智能本地回显。 有时将一个远端文件夹挂载到本地会比较方便, [sshfs](https://github.com/libfuse/sshfs) 可以将远端服务器上的一个文件夹挂载到本地,然后您就可以使用本地的编辑器了。 # Shell & 框架 在 shell 工具和脚本那节课中我们已经介绍了 `bash` shell,因为它是目前最通用的 shell,大多数的系统都将其作为默认 shell。但是,它并不是唯一的选项。 例如,`zsh` shell 是 `bash` 的超集并提供了一些方便的功能: - 智能替换, `**` - 行内替换/通配符扩展 - 拼写纠错 - 更好的 tab 补全和选择 - 路径展开 (`cd /u/lo/b` 会被展开为 `/usr/local/bin`) **框架** 也可以改进您的 shell。比较流行的通用框架包括 [prezto](https://github.com/sorin-ionescu/prezto) 或 [oh-my-zsh](https://ohmyz.sh/)。还有一些更精简的框架,它们往往专注于某一个特定功能,例如 [zsh 语法高亮](https://github.com/zsh-users/zsh-syntax-highlighting) 或 [zsh 历史子串查询](https://github.com/zsh-users/zsh-history-substring-search)。像 [fish](https://fishshell.com/) 这样的 shell 已经默认包含了许多这类用户友好的功能,包括: - 向右对齐 - 命令语法高亮 - 历史子串查询 - 基于手册页面的选项补全 - 更智能的自动补全 - 提示符主题 需要注意的是,使用这些框架可能会降低您 shell 的性能,尤其是如果这些框架的代码没有优化或者代码过多。您随时可以测试其性能或禁用某些不常用的功能来实现速度与功能的平衡。 # 终端模拟器 和自定义 shell 一样,花点时间选择适合您的 **终端模拟器** 并进行设置是很有必要的。有许多终端模拟器可供您选择(这里有一些关于它们之间 [比较](https://anarc.at/blog/2018-04-12-terminal-emulators-1/) 的信息) 您会花上很多时间在使用终端上,因此研究一下终端的设置是很有必要的,您可以从下面这些方面来配置您的终端: - 字体选择 - 彩色主题 - 快捷键 - 标签页/面板支持 - 回退配置 - 性能(像 [Alacritty](https://github.com/jwilm/alacritty) 或者 [kitty](https://sw.kovidgoyal.net/kitty/) 这种比较新的终端,它们支持 GPU 加速)。 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) ## 任务控制 1. 我们可以使用类似 `ps aux | grep` 这样的命令来获取任务的 pid ,然后您可以基于 pid 来结束这些进程。但我们其实有更好的方法来做这件事。在终端中执行 `sleep 10000` 这个任务。然后用 `Ctrl-Z` 将其切换到后台并使用 `bg` 来继续允许它。现在,使用 [`pgrep`](https://www.man7.org/linux/man-pages/man1/pgrep.1.html) 来查找 pid 并使用 [`pkill`](https://www.man7.org/linux/man-pages/man1/pgrep.1.html) 结束进程而不需要手动输入 pid。(提示:: 使用 `-af` 标记)。 2. 如果您希望某个进程结束后再开始另外一个进程, 应该如何实现呢?在这个练习中,我们使用 `sleep 60 &` 作为先执行的程序。一种方法是使用 [`wait`](http://man7.org/linux/man-pages/man1/wait.1p.html) 命令。尝试启动这个休眠命令,然后待其结束后再执行 `ls` 命令。 但是,如果我们在不同的 bash 会话中进行操作,则上述方法就不起作用了。因为 `wait` 只能对子进程起作用。之前我们没有提过的一个特性是,`kill` 命令成功退出时其状态码为 0 ,其他状态则是非 0。`kill -0` 则不会发送信号,但是会在进程不存在时返回一个不为 0 的状态码。请编写一个 bash 函数 `pidwait` ,它接受一个 pid 作为输入参数,然后一直等待直到该进程结束。您需要使用 `sleep` 来避免浪费 CPU 性能。 ## 终端多路复用 1. 请完成这个 `tmux` [教程](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/) 参考 [这些步骤](https://www.hamvocke.com/blog/a-guide-to-customizing-your-tmux-conf/) 来学习如何自定义 `tmux`。 ## 别名 1. 创建一个 `dc` 别名,它的功能是当我们错误的将 `cd` 输入为 `dc` 时也能正确执行。 2. 执行 `history | awk '{$1="";print substr($0,2)}' | sort | uniq -c | sort -n | tail -n 10` 来获取您最常用的十条命令,尝试为它们创建别名。注意:这个命令只在 Bash 中生效,如果您使用 ZSH,使用 `history 1` 替换 `history`。 ## 配置文件 让我们帮助您进一步学习配置文件: 1. 为您的配置文件新建一个文件夹,并设置好版本控制 2. 在其中添加至少一个配置文件,比如说您的 shell,在其中包含一些自定义设置(可以从设置 `$PS1` 开始)。 3. 建立一种在新设备进行快速安装配置的方法(无需手动操作)。最简单的方法是写一个 shell 脚本对每个文件使用 `ln -s`,也可以使用 [专用工具](https://dotfiles.github.io/utilities/) 4. 在新的虚拟机上测试该安装脚本。 5. 将您现有的所有配置文件移动到项目仓库里。 6. 将项目发布到 GitHub。 ## 远端设备 进行下面的练习需要您先安装一个 Linux 虚拟机(如果已经安装过则可以直接使用),如果您对虚拟机尚不熟悉,可以参考 [这篇教程](https://hibbard.eu/install-ubuntu-virtual-box/) 来进行安装。 1. 前往 `~/.ssh/` 并查看是否已经存在 SSH 密钥对。如果不存在,请使用 `ssh-keygen -o -a 100 -t ed25519` 来创建一个。建议为密钥设置密码然后使用 `ssh-agent`,更多信息可以参考 [这里](https://www.ssh.com/ssh/agent); 2. 在 `.ssh/config` 加入下面内容: ```bash Host vm User username_goes_here HostName ip_goes_here IdentityFile ~/.ssh/id_ed25519 LocalForward 9999 localhost:8888 ``` 3. 使用 `ssh-copy-id vm` 将您的 ssh 密钥拷贝到服务器。 4. 使用 `python -m http.server 8888` 在您的虚拟机中启动一个 Web 服务器并通过本机的 `http://localhost:9999` 访问虚拟机上的 Web 服务器 5. 使用 `sudo vim /etc/ssh/sshd_config` 编辑 SSH 服务器配置,通过修改 `PasswordAuthentication` 的值来禁用密码验证。通过修改 `PermitRootLogin` 的值来禁用 root 登录。然后使用 `sudo service sshd restart` 重启 `ssh` 服务器,然后重新尝试。 6. (附加题) 在虚拟机中安装 [`mosh`](https://mosh.org/) 并启动连接。然后断开服务器/虚拟机的网络适配器。mosh 可以恢复连接吗? 7. (附加题) 查看 `ssh` 的 `-N` 和 `-f` 选项的作用,找出在后台进行端口转发的命令是什么? ================================================ FILE: _2020/course-shell.md ================================================ --- layout: lecture title: "课程概览与 shell" date: 2020-01-13 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: Z56Jmr9Z34Q solution: ready: true url: course-shell-solution --- # 动机 作为计算机科学家,我们都知道计算机最擅长帮助我们完成重复性的工作。 但是我们却常常忘记这一点也适用于我们使用计算机的方式,而不仅仅是利用计算机程序去帮我们求解问题。 在从事与计算机相关的工作时,我们有很多触手可及的工具可以帮助我们更高效的解决问题。 但是我们中的大多数人实际上只利用了这些工具中的很少一部分,我们常常只是死记硬背一些如咒语般的命令, 或是当我们卡住的时候,盲目地从网上复制粘贴一些命令。 本课程意在帮你解决这一问题。 我们希望教会您如何挖掘现有工具的潜力,并向您介绍一些新的工具。也许我们还可以促使您想要去探索(甚至是去开发)更多的工具。 我们认为这是大多数计算机科学相关课程中缺少的重要一环。 # 课程结构 本课程包含 11 个时长在一小时左右的讲座,每一个讲座都会关注一个 [特定的主题](/missing-semester/2020/)。尽管这些讲座之间基本上是各自独立的,但随着课程的进行,我们会假定您已经掌握了之前的内容。 每个讲座都有在线笔记供查阅,但是课上的很多内容并不会包含在笔记中。因此我们也会把课程录制下来发布到互联网上供大家观看学习。 我们希望能在这 11 个一小时讲座中涵盖大部分必须的内容,因此课程的信息密度是相当大的。为了能帮助您以自己的节奏来掌握讲座内容,每次课程都包含一组练习来帮助您掌握本节课的重点。 课后我们会安排答疑的时间来回答您的问题。如果您参加的是在线课程,可以发送邮件到 [missing-semester@mit.edu](mailto:missing-semester@mit.edu) 来联系我们。 由于时长的限制,我们不可能达到那些专门课程一样的细致程度,我们会适时地将您介绍一些优秀的资源,帮助您深入的理解相关的工具或主题。 但是如果您还有一些特别关注的话题,也请联系我们。 # 主题 1: The Shell ## shell 是什么? 如今的计算机有着多种多样的交互接口让我们可以进行指令的输入,从炫酷的图像用户界面(GUI),语音输入甚至是 AR/VR 都已经无处不在。 这些交互接口可以覆盖 80% 的使用场景,但是它们也从根本上限制了您的操作方式——你不能点击一个不存在的按钮或者是用语音输入一个还没有被录入的指令。 为了充分利用计算机的能力,我们不得不回到最根本的方式,使用文字接口:Shell 几乎所有您能够接触到的平台都支持某种形式的 shell,有些甚至还提供了多种 shell 供您选择。虽然它们之间有些细节上的差异,但是其核心功能都是一样的:它允许你执行程序,输入并获取某种半结构化的输出。 本节课我们会使用 Bourne Again SHell, 简称 "bash" 。 这是被最广泛使用的一种 shell,它的语法和其他的 shell 都是类似的。打开 shell _提示符_(您输入指令的地方),您首先需要打开 _终端_ 。您的设备通常都已经内置了终端,或者您也可以安装一个,非常简单。 ## 使用 shell 当您打开终端时,您会看到一个提示符,它看起来一般是这个样子的: ```console missing:~$ ``` 这是 shell 最主要的文本接口。它告诉你,你的主机名是 `missing` 并且您当前的工作目录("current working directory")或者说您当前所在的位置是 `~` (表示 "home")。 `$` 符号表示您现在的身份不是 root 用户(稍后会介绍)。在这个提示符中,您可以输入 _命令_ ,命令最终会被 shell 解析。最简单的命令是执行一个程序: ```console missing:~$ date Fri 10 Jan 2020 11:49:31 AM EST missing:~$ ``` 这里,我们执行了 `date` 这个程序,不出意料地,它打印出了当前的日期和时间。然后,shell 等待我们输入其他命令。我们可以在执行命令的同时向程序传递 _参数_ : ```console missing:~$ echo hello hello ``` 上例中,我们让 shell 执行 `echo` ,同时指定参数 `hello`。`echo` 程序将该参数打印出来。 shell 基于空格分割命令并进行解析,然后执行第一个单词代表的程序,并将后续的单词作为程序可以访问的参数。如果您希望传递的参数中包含空格(例如一个名为 My Photos 的文件夹),您要么用使用单引号,双引号将其包裹起来,要么使用转义符号 `\` 进行处理(`My\ Photos`)。 但是,shell 是如何知道去哪里寻找 `date` 或 `echo` 的呢?其实,类似于 Python 或 Ruby,shell 是一个编程环境,所以它具备变量、条件、循环和函数(下一课进行讲解)。当你在 shell 中执行命令时,您实际上是在执行一段 shell 可以解释执行的简短代码。如果你要求 shell 执行某个指令,但是该指令并不是 shell 所了解的编程关键字,那么它会去咨询 _环境变量_ `$PATH`,它会列出当 shell 接到某条指令时,进行程序搜索的路径: ```console missing:~$ echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin missing:~$ which echo /bin/echo missing:~$ /bin/echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ``` 当我们执行 `echo` 命令时,shell 了解到需要执行 `echo` 这个程序,随后它便会在 `$PATH` 中搜索由 `:` 所分割的一系列目录,基于名字搜索该程序。当找到该程序时便执行(假定该文件是 _可执行程序_,后续课程将详细讲解)。确定某个程序名代表的是哪个具体的程序,可以使用 `which` 程序。我们也可以绕过 `$PATH`,通过直接指定需要执行的程序的路径来执行该程序 ## 在 shell 中导航 shell 中的路径是一组被分割的目录,在 Linux 和 macOS 上使用 `/` 分割,而在 Windows 上是 `\`。路径 `/` 代表的是系统的根目录,所有的文件夹都包括在这个路径之下,在 Windows 上每个盘都有一个根目录(例如: `C:\`)。 我们假设您在学习本课程时使用的是 Linux 文件系统。如果某个路径以 `/` 开头,那么它是一个 _绝对路径_,其他的都是 _相对路径_ 。相对路径是指相对于当前工作目录的路径,当前工作目录可以使用 `pwd` 命令来获取。此外,切换目录需要使用 `cd` 命令。在路径中,`.` 表示的是当前目录,而 `..` 表示上级目录: ```console missing:~$ pwd /home/missing missing:~$ cd /home missing:/home$ pwd /home missing:/home$ cd .. missing:/$ pwd / missing:/$ cd ./home missing:/home$ pwd /home missing:/home$ cd missing missing:~$ pwd /home/missing missing:~$ ../../bin/echo hello hello ``` 注意,shell 会实时显示当前的路径信息。您可以通过配置 shell 提示符来显示各种有用的信息,这一内容我们会在后面的课程中进行讨论。 一般来说,当我们运行一个程序时,如果我们没有指定路径,则该程序会在当前目录下执行。例如,我们常常会搜索文件,并在需要时创建文件。 为了查看指定目录下包含哪些文件,我们使用 `ls` 命令: ```console missing:~$ ls missing:~$ cd .. missing:/home$ ls missing missing:/home$ cd .. missing:/$ ls bin boot dev etc home ... ``` 除非我们利用第一个参数指定目录,否则 `ls` 会打印当前目录下的文件。大多数的命令接受标记和选项(带有值的标记),它们以 `-` 开头,并可以改变程序的行为。通常,在执行程序时使用 `-h` 或 `--help` 标记可以打印帮助信息,以便了解有哪些可用的标记或选项。例如,`ls --help` 的输出如下: ``` -l use a long listing format ``` ```console missing:~$ ls -l /home drwxr-xr-x 1 missing users 4096 Jun 15 2019 missing ``` 这个参数可以更加详细地列出目录下文件或文件夹的信息。首先,本行第一个字符 `d` 表示 `missing` 是一个目录。然后接下来的九个字符,每三个字符构成一组。 (`rwx`). 它们分别代表了文件所有者(`missing`),用户组(`users`) 以及其他所有人具有的权限。其中 `-` 表示该用户不具备相应的权限。从上面的信息来看,只有文件所有者可以修改(`w`),`missing` 文件夹 (例如,添加或删除文件夹中的文件)。为了进入某个文件夹,用户需要具备该文件夹以及其父文件夹的“搜索”权限(以“可执行”:`x`)权限表示。为了列出它的包含的内容,用户必须对该文件夹具备读权限(`r`)。对于文件来说,权限的意义也是类似的。注意,`/bin` 目录下的程序在最后一组,即表示所有人的用户组中,均包含 `x` 权限,也就是说任何人都可以执行这些程序。 在这个阶段,还有几个趁手的命令是您需要掌握的,例如 `mv`(用于重命名或移动文件)、 `cp`(拷贝文件)以及 `mkdir`(新建文件夹)。 如果您想要知道关于程序参数、输入输出的信息,亦或是想要了解它们的工作方式,请试试 `man` 这个程序。它会接受一个程序名作为参数,然后将它的文档(用户手册)展现给您。注意,使用 `q` 可以退出该程序。 ```console missing:~$ man ls ``` ## 在程序间创建连接 在 shell 中,程序有两个主要的“流”:它们的输入流和输出流。 当程序尝试读取信息时,它们会从输入流中进行读取,当程序打印信息时,它们会将信息输出到输出流中。 通常,一个程序的输入输出流都是您的终端。也就是,您的键盘作为输入,显示器作为输出。 但是,我们也可以重定向这些流! 最简单的重定向是 `< file` 和 `> file`。这两个命令可以将程序的输入输出流分别重定向到文件: ```console missing:~$ echo hello > hello.txt missing:~$ cat hello.txt hello missing:~$ cat < hello.txt hello missing:~$ cat < hello.txt > hello2.txt missing:~$ cat hello2.txt hello ``` 您还可以使用 `>>` 来向一个文件追加内容。使用管道( _pipes_ ),我们能够更好的利用文件重定向。 `|` 操作符允许我们将一个程序的输出和另外一个程序的输入连接起来: ```console missing:~$ ls -l / | tail -n1 drwxr-xr-x 1 root root 4096 Jun 20 2019 var missing:~$ curl --head --silent google.com | grep --ignore-case content-length | cut --delimiter=' ' -f2 219 ``` 我们会在数据清理一章中更加详细的探讨如何更好的利用管道。 ## 一个功能全面又强大的工具 对于大多数的类 Unix 系统,有一类用户是非常特殊的,那就是:根用户(root user)。 您应该已经注意到了,在上面的输出结果中,根用户几乎不受任何限制,他可以创建、读取、更新和删除系统中的任何文件。 通常在我们并不会以根用户的身份直接登录系统,因为这样可能会因为某些错误的操作而破坏系统。 取而代之的是我们会在需要的时候使用 `sudo` 命令。顾名思义,它的作用是让您可以以 su(super user 或 root 的简写)的身份执行一些操作。 当您遇到拒绝访问(permission denied)的错误时,通常是因为此时您必须是根用户才能操作。然而,请再次确认您是真的要执行此操作。 有一件事情是您必须作为根用户才能做的,那就是向 `sysfs` 文件写入内容。系统被挂载在 `/sys` 下,`sysfs` 文件则暴露了一些内核(kernel)参数。 因此,您不需要借助任何专用的工具,就可以轻松地在运行期间配置系统内核。**注意 Windows 和 macOS 没有这个文件** 例如,您笔记本电脑的屏幕亮度写在 `brightness` 文件中,它位于 ``` /sys/class/backlight ``` 通过将数值写入该文件,我们可以改变屏幕的亮度。现在,蹦到您脑袋里的第一个想法可能是: ```console $ sudo find -L /sys/class/backlight -maxdepth 2 -name '*brightness*' /sys/class/backlight/thinkpad_screen/brightness $ cd /sys/class/backlight/thinkpad_screen $ sudo echo 3 > brightness An error occurred while redirecting file 'brightness' open: Permission denied ``` 出乎意料的是,我们还是得到了一个错误信息。毕竟,我们已经使用了 `sudo` 命令!关于 shell,有件事我们必须要知道。`|`、`>`、和 `<` 是通过 shell 执行的,而不是被各个程序单独执行。 `echo` 等程序并不知道 `|` 的存在,它们只知道从自己的输入输出流中进行读写。 回到上面更改屏幕亮度命令执行的报错,为了能让 `sudo echo` 命令输出的亮度值写入 brightness 文件, _shell_ (权限为当前用户) 会先尝试打开 brightness 文件,但此时操作 shell 的不是根(root)用户,所以系统拒绝了这个打开操作,提示无权限。 明白这一点后,我们可以这样操作: ```console $ echo 3 | sudo tee brightness ``` 此时打开 `/sys` 文件的是 `tee` 这个程序,并且该程序以 `root` 权限在运行,因此操作可以进行。 这样您就可以在 `/sys` 中愉快地玩耍了,例如修改系统中各种 LED 的状态(路径可能会有所不同): ```console $ echo 1 | sudo tee /sys/class/leds/input6::scrolllock/brightness ``` # 接下来..... 学到这里,您掌握的 shell 知识已经可以完成一些基础的任务了。您应该已经可以查找感兴趣的文件并使用大多数程序的基本功能了。 在下一场讲座中,我们会探讨如何利用 shell 及其他工具执行并自动化更复杂的任务。 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 本课程中的每节课都包含一系列练习题。有些题目是有明确目的的,另外一些则是开放题,例如“尝试使用 X 和 Y”,我们强烈建议您一定要动手实践,用于尝试这些内容。 此外,我们没有为这些练习题提供答案。如果有任何困难,您可以发送邮件给我们并描述你已经做出的尝试,我们会设法帮您解答。 1. 本课程需要使用类 Unix shell,例如 Bash 或 ZSH。如果您在 Linux 或者 MacOS 上面完成本课程的练习,则不需要做任何特殊的操作。如果您使用的是 Windows,则您不应该使用 cmd 或是 Powershell;您可以使用 [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/) 或者是 Linux 虚拟机。使用 `echo $SHELL` 命令可以查看您的 shell 是否满足要求。如果打印结果为 `/bin/bash` 或 `/usr/bin/zsh` 则是可以的。 2. 在 `/tmp` 下新建一个名为 `missing` 的文件夹。 3. 用 `man` 查看程序 `touch` 的使用手册。 4. 用 `touch` 在 `missing` 文件夹中新建一个叫 `semester` 的文件。 5. 将以下内容一行一行地写入 `semester` 文件: ``` #!/bin/sh curl --head --silent https://missing.csail.mit.edu ``` 第一行可能有点棘手, `#` 在 Bash 中表示注释,而 `!` 即使被双引号(`"`)包裹也具有特殊的含义。 单引号(`'`)则不一样,此处利用这一点解决输入问题。更多信息请参考 [Bash quoting 手册](https://www.gnu.org/software/bash/manual/html_node/Quoting.html) 6. 尝试执行这个文件。例如,将该脚本的路径(`./semester`)输入到您的 shell 中并回车。如果程序无法执行,请使用 `ls` 命令来获取信息并理解其不能执行的原因。 7. 查看 `chmod` 的手册(例如,使用 `man chmod` 命令) 8. 使用 `chmod` 命令改变权限,使 `./semester` 能够成功执行,不要使用 `sh semester` 来执行该程序。您的 shell 是如何知晓这个文件需要使用 `sh` 来解析呢?更多信息请参考:[shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) 9. 使用 `|` 和 `>` ,将 `semester` 文件输出的最后更改日期信息,写入主目录下的 `last-modified.txt` 的文件中 10. 写一段命令来从 `/sys` 中获取笔记本的电量信息,或者台式机 CPU 的温度。注意:macOS 并没有 sysfs,所以 Mac 用户可以跳过这一题。 ================================================ FILE: _2020/data-wrangling.md ================================================ --- layout: lecture title: "数据整理" date: 2020-01-16 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: sz_dsktIjt4 solution: ready: true url: data-wrangling-solution --- 您是否曾经有过这样的需求,将某种格式存储的数据转换成另外一种格式? 肯定有过,对吧! 这也正是我们这节课所要讲授的主要内容。具体来讲,我们需要不断地对数据进行处理,直到得到我们想要的最终结果。 在之前的课程中,其实我们已经接触到了一些数据整理的基本技术。可以这么说,每当您使用管道运算符的时候,其实就是在进行某种形式的数据整理。 例如这样一条命令 `journalctl | grep -i intel`,它会找到所有包含 intel(不区分大小写)的系统日志。您可能并不认为这是数据整理,但是它确实将某种形式的数据(全部系统日志)转换成了另外一种形式的数据(仅包含 intel 的日志)。大多数情况下,数据整理需要您能够明确哪些工具可以被用来达成特定数据整理的目的,并且明白如何组合使用这些工具。 让我们从头讲起。既然是学习数据整理,那有两样东西自然是必不可少的:用来整理的数据以及相关的应用场景。日志处理通常是一个比较典型的使用场景,因为我们经常需要在日志中查找某些信息,这种情况下通读日志是不现实的。现在,让我们研究一下系统日志,看看哪些用户曾经尝试过登录我们的服务器: ```bash ssh myserver journalctl ``` 内容太多了。现在让我们把涉及 sshd 的信息过滤出来: ```bash ssh myserver journalctl | grep sshd ``` 注意,这里我们使用管道将一个远程服务器上的文件传递给本机的 `grep` 程序! `ssh` 太牛了,下一节课我们会讲授命令行环境,届时我们会详细讨论 `ssh` 的相关内容。此时我们打印出的内容,仍然比我们需要的要多得多,读起来也非常费劲。我们来改进一下: ```bash ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less ``` 多出来的引号是什么作用呢?这么说吧,我们的日志是一个非常大的文件,把这么大的文件流直接传输到我们本地的电脑上再进行过滤是对流量的一种浪费。因此我们采取另外一种方式,我们先在远端机器上过滤文本内容,然后再将结果传输到本机。 `less` 为我们创建来一个文件分页器,使我们可以通过翻页的方式浏览较长的文本。为了进一步节省流量,我们甚至可以将当前过滤出的日志保存到文件中,这样后续就不需要再次通过网络访问该文件了: ```console $ ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log $ less ssh.log ``` 过滤结果中仍然包含不少没用的数据。我们有很多办法可以删除这些无用的数据,但是让我们先研究一下 `sed` 这个非常强大的工具。 `sed` 是一个基于文本编辑器 `ed` 构建的 "流编辑器" 。在 `sed` 中,您基本上是利用一些简短的命令来修改文件,而不是直接操作文件的内容(尽管您也可以选择这样做)。相关的命令非常多,但是最常用的是 `s`,即 *替换* 命令,例如我们可以这样写: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed 's/.*Disconnected from //' ``` 上面这段命令中,我们使用了一段简单的 *正则表达式*。正则表达式是一种非常强大的工具,可以让我们基于某种模式来对字符串进行匹配。`s` 命令的语法如下:`s/REGEX/SUBSTITUTION/`, 其中 `REGEX` 部分是我们需要使用的正则表达式,而 `SUBSTITUTION` 是用于替换匹配结果的文本。 (您可能会从我们的 Vim [讲座笔记](/2020/editors/#advanced-vim) 中的"搜索和替换"部分认出这种语法!实际上,Vim 使用的搜索和替换语法与 `sed` 的替换命令相似。学习一个工具通常有助于您更熟练地使用其他工具。) ## 正则表达式 正则表达式非常常见也非常有用,值得您花些时间去理解它。让我们从这一句正则表达式开始学习: `/.*Disconnected from /`。正则表达式通常以(尽管并不总是) `/` 开始和结束。大多数的 ASCII 字符都表示它们本来的含义,但是有一些字符确实具有表示匹配行为的“特殊”含义。不同字符所表示的含义,根据正则表达式的实现方式不同,也会有所变化,这一点确实令人沮丧。常见的模式有: - `.` 除换行符之外的 "任意单个字符" - `*` 匹配前面字符零次或多次 - `+` 匹配前面字符一次或多次 - `[abc]` 匹配 `a`, `b` 和 `c` 中的任意一个 - `(RX1|RX2)` 任何能够匹配 `RX1` 或 `RX2` 的结果 - `^` 行首 - `$` 行尾 `sed` 的正则表达式有些时候是比较奇怪的,它需要你在这些模式前添加 `\` 才能使其具有特殊含义。或者,您也可以添加 `-E` 选项来支持这些匹配。 回过头我们再看 `/.*Disconnected from /`,我们会发现这个正则表达式可以匹配任何以若干任意字符开头,并接着包含 "Disconnected from" 的字符串。这也正是我们所希望的。但是请注意,正则表达式并不容易写对。如果有人将 "Disconnected from" 作为自己的用户名会怎样呢? ``` Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth] ``` 正则表达式会如何匹配?`*` 和 `+` 在默认情况下是贪婪模式,也就是说,它们会尽可能多的匹配文本。因此对上述字符串的匹配结果如下: ``` 46.97.239.16 port 55920 [preauth] ``` 这可不是我们想要的结果。对于某些正则表达式的实现来说,您可以给 `*` 或 `+` 增加一个 `?` 后缀使其变成非贪婪模式,但是很可惜 `sed` 并不支持该后缀。不过,我们可以切换到 perl 的命令行模式,该模式支持编写这样的正则表达式: ```bash perl -pe 's/.*?Disconnected from //' ``` 让我们回到 `sed` 命令并使用它完成后续的任务,毕竟对于这一类任务,`sed` 是最常见的工具。`sed` 还可以非常方便的做一些事情,例如打印匹配后的内容,一次调用中进行多次替换搜索等。但是这些内容我们并不会在此进行介绍。`sed` 本身是一个非常全能的工具,但是在具体功能上往往能找到更好的工具作为替代品。 好的,我们还需要去掉用户名后面的后缀,应该如何操作呢? 想要匹配用户名后面的文本,尤其是当这里的用户名可以包含空格时,这个问题变得非常棘手!这里我们需要做的是匹配 *一整行*: ```bash | sed -E 's/.*Disconnected from (invalid |authenticating )?user .* [^ ]+ port [0-9]+( \[preauth\])?$//' ``` 让我们借助正则表达式在线调试工具 [regex debugger](https://regex101.com/r/qqbZqh/2) 来理解这段表达式。OK,开始的部分和以前是一样的,随后,我们匹配两种类型的“user”(在日志中基于两种前缀区分)。再然后我们匹配属于用户名的所有字符。接着,再匹配任意一个单词(`[^ ]+` 会匹配任意非空且不包含空格的序列)。紧接着后面匹配单“port”和它后面的一串数字,以及可能存在的后缀 `[preauth]`,最后再匹配行尾。 注意,这样做的话,即使用户名是“Disconnected from”,对匹配结果也不会有任何影响,您知道这是为什么吗? 问题还没有完全解决,日志的内容全部被替换成了空字符串,整个日志的内容因此都被删除了。我们实际上希望能够将用户名 *保留* 下来。对此,我们可以使用“捕获组(capture groups)”来完成。被圆括号内的正则表达式匹配到的文本,都会被存入一系列以编号区分的捕获组中。捕获组的内容可以在替换字符串时使用(有些正则表达式的引擎甚至支持替换表达式本身),例如 `\1`、 `\2`、`\3` 等等,因此可以使用如下命令: ```bash | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' ``` 想必您已经意识到了,为了完成某种匹配,我们最终可能会写出非常复杂的正则表达式。例如,这里有一篇关于如何匹配电子邮箱地址的文章 [e-mail address](https://www.regular-expressions.info/email.html),匹配电子邮箱可一点 [也不简单](https://emailregex.com/)。网络上还有很多关于如何匹配电子邮箱地址的 [讨论](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/1917982)。人们还为其编写了 [测试用例](https://fightingforalostcause.net/content/misc/2006/compare-email-regex.php) 及 [测试矩阵](https://mathiasbynens.be/demo/url-regex)。您甚至可以编写一个用于判断一个数 [是否为质数](https://www.noulakaz.net/2007/03/18/a-regular-expression-to-check-for-prime-numbers/) 的正则表达式。 正则表达式是出了名的难以写对,但是它仍然会是您强大的常备工具之一。 ## 回到数据整理 OK,现在我们有如下表达式: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' ``` `sed` 还可以做很多各种各样有趣的事情,例如文本注入:(使用 `i` 命令),打印特定的行 (使用 `p` 命令),基于索引选择特定行等等。详情请见 `man sed`! 现在,我们已经得到了一个包含用户名的列表,列表中的用户都曾经尝试过登录我们的系统。但这还不够,让我们过滤出那些最常出现的用户: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c ``` `sort` 会对其输入数据进行排序。`uniq -c` 会把连续出现的行折叠为一行并使用出现次数作为前缀。我们希望按照出现次数排序,过滤出最常出现的用户名: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 ``` `sort -n` 会按照数字顺序对输入进行排序(默认情况下是按照字典序排序 `-k1,1` 则表示“仅基于以空格分割的第一列进行排序”。`,n` 部分表示“仅排序到第 n 个部分”,默认情况是到行尾。就本例来说,针对整个行进行排序也没有任何问题,我们这里主要是为了学习这一用法! 如果我们希望得到登录次数最少的用户,我们可以使用 `head` 来代替 `tail`。或者使用 `sort -r` 来进行倒序排序。 相当不错。但我们只想获取用户名,而且不要一行一个地显示。 ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | awk '{print $2}' | paste -sd, ``` 如果您使用的是 MacOS:注意这个命令并不能配合 MacOS 系统默认的 BSD `paste` 使用。参考 [课程概览与 shell](https://missing-semester-cn.github.io/2020/course-shell/) 的习题内容获取更多相关信息。 我们可以利用 `paste` 命令来合并行(`-s`),并指定一个分隔符进行分割 (`-d`),那 `awk` 的作用又是什么呢? ## awk -- 另外一种编辑器 `awk` 其实是一种编程语言,只不过它碰巧非常善于处理文本。关于 `awk` 可以介绍的内容太多了,限于篇幅,这里我们仅介绍一些基础知识。 首先, `{print $2}` 的作用是什么? `awk` 程序接受一个模式串(可选)和一个代码块,指定当模式匹配时该做何种操作。我们这里用的是默认的模式串,即匹配所有行。 在代码块中,`$0` 表示整行的内容,`$1` 到 `$n` 为一行中的 n 个域,域的分割基于 `awk` 的域分隔符(默认是空格,可以通过 `-F` 来修改)。在这个例子中,我们的代码意思是:对于每一行文本,打印其第二个部分,也就是用户名。 让我们康康,还有什么炫酷的操作可以做。让我们统计有多少用户名以 `c` 开头,`e` 结尾,并且仅尝试过一次登录的用户。 ```bash | awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l ``` 让我们好好分析一下。首先,注意这次我们为 `awk` 指定了一个匹配模式串(也就是 `{...}` 前面的那部分内容)。该匹配要求文本的第一部分需要等于 1(这部分刚好是 `uniq -c` 得到的计数值),然后其第二部分必须满足给定的一个正则表达式。代码块中的内容则表示打印用户名。然后我们使用 `wc -l` 统计输出结果的行数。 不过,既然 `awk` 是一种编程语言,那么则可以这样: ```awk BEGIN { rows = 0 } $1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 } END { print rows } ``` `BEGIN` 也是一种模式,它会匹配输入的开头( `END` 则匹配结尾)。然后,对每一行第一个部分进行累加,最后将结果输出。事实上,我们完全可以抛弃 `grep` 和 `sed` ,因为 `awk` 就可以 [解决所有问题](https://backreference.org/2010/02/10/idiomatic-awk)。至于怎么做,就留给读者们做课后练习吧。 ## 分析数据 想做数学计算也是可以的!例如这样,您可以将每行的数字加起来: ```bash | paste -sd+ | bc -l ``` 下面这种更加复杂的表达式也可以: ```bash echo "2*($(data | paste -sd+))" | bc -l ``` 有多种方式获取统计数据。[`st`](https://github.com/nferraz/st) 干净利落,但如果您已经安装了 [R](https://www.r-project.org): ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)' ``` R 也是一种编程语言,它非常适合被用来进行数据分析和 [绘制图表](https://ggplot2.tidyverse.org/)。这里我们不会讲的特别详细, 您只需要知道 `summary` 可以打印某个向量的统计结果。我们将输入的一系列数据存放在一个向量后,利用 R 语言就可以得到我们想要的统计数据。 如果您希望绘制一些简单的图表, `gnuplot` 可以帮助到您: ```bash ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes' ``` ## 利用数据整理来确定参数 有时候您要利用数据整理技术从一长串列表里找出你所需要安装或移除的东西。我们之前讨论的相关技术配合 `xargs` 即可实现: ```bash rustup toolchain list | grep nightly | grep -vE "nightly-x86" | sed 's/-x86.*//' | xargs rustup toolchain uninstall ``` ## 整理二进制数据 虽然到目前为止我们的讨论都是基于文本数据,但对于二进制文件其实同样有用。例如我们可以用 ffmpeg 从相机中捕获一张图片,将其转换成灰度图后通过 SSH 将压缩后的文件发送到远端服务器,并在那里解压、存档并显示。 ```bash ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 - | convert - -colorspace gray - | gzip | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -' ``` # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. 学习一下这篇简短的 [交互式正则表达式教程](https://regexone.com/). 2. 统计 words 文件 (`/usr/share/dict/words`) 中包含至少三个 `a` 且不以 `'s` 结尾的单词个数。这些单词中,出现频率前三的末尾两个字母是什么? `sed` 的 `y` 命令,或者 `tr` 程序也许可以帮你解决大小写的问题。共存在多少种词尾两字母组合?还有一个很 有挑战性的问题:哪个组合从未出现过? 3. 进行原地替换听上去很有诱惑力,例如: `sed s/REGEX/SUBSTITUTION/ input.txt > input.txt`。但是这并不是一个明智的做法,为什么呢?还是说只有 `sed` 是这样的? 查看 `man sed` 来完成这个问题 4. 找出您最近十次开机的开机时间平均数、中位数和最长时间。在 Linux 上需要用到 `journalctl` ,而在 macOS 上使用 `log show`。找到每次起到开始和结束时的时间戳。在 Linux 上类似这样操作: ``` Logs begin at ... ``` 和 ``` systemd[577]: Startup finished in ... ``` 在 macOS 上, [查找](https://eclecticlight.co/2018/03/21/macos-unified-log-3-finding-your-way/): ``` === system boot: ``` 和 ``` Previous shutdown cause: 5 ``` 5. 查看之前三次重启启动信息中不同的部分(参见 `journalctl` 的 `-b` 选项)。将这一任务分为几个步骤,首先获取之前三次启动的启动日志,也许获取启动日志的命令就有合适的选项可以帮助您提取前三次启动的日志,亦或者您可以使用 `sed '0,/STRING/d'` 来删除 `STRING` 匹配到的字符串前面的全部内容。然后,过滤掉每次都不相同的部分,例如时间戳。下一步,重复记录输入行并对其计数(可以使用 `uniq` )。最后,删除所有出现过 3 次的内容(因为这些内容是三次启动日志中的重复部分)。 6. 在网上找一个类似 [这个](https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm) 或者 [这个](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/topic-pages/tables/table-1) 的数据集。或者从 [这里](https://www.springboard.com/blog/free-public-data-sets-data-science-project/) 找一些。使用 `curl` 获取数据集并提取其中两列数据,如果您想要获取的是 HTML 数据,那么 [`pup`](https://github.com/EricChiang/pup) 可能会更有帮助。对于 JSON 类型的数据,可以试试 [`jq`](https://stedolan.github.io/jq/)。请使用一条指令来找出其中一列的最大值和最小值,用另外一条指令计算两列之间差的总和。 ================================================ FILE: _2020/debugging-profiling.md ================================================ --- layout: lecture title: "调试及性能分析" date: 2020-01-23 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: l812pUnKxME solution: ready: true url: debugging-profiling-solution --- 代码不能完全按照您的想法运行,它只能完全按照您的写法运行,这是编程界的一条金科玉律。 让您的写法符合您的想法是非常困难的。在这节课中,我们会传授给您一些非常有用技术,帮您处理代码中的 bug 和程序性能问题。 # 调试代码 ## 打印调试法与日志 "最有效的 debug 工具就是细致的分析,配合恰当位置的打印语句" — Brian Kernighan, _Unix 新手入门_。 调试代码的第一种方法往往是在您发现问题的地方添加一些打印语句,然后不断重复此过程直到您获取了足够的信息并找到问题的根本原因。 另外一个方法是使用日志,而不是临时添加打印语句。日志较普通的打印语句有如下的一些优势: - 您可以将日志写入文件、socket 或者甚至是发送到远端服务器而不仅仅是标准输出; - 日志可以支持严重等级(例如 INFO, DEBUG, WARN, ERROR 等),这使您可以根据需要过滤日志; - 对于新发现的问题,很可能您的日志中已经包含了可以帮助您定位问题的足够的信息。 [这里](/static/files/logger.py) 是一个包含日志的例程序: ```bash $ python logger.py # Raw output as with just prints $ python logger.py log # Log formatted output $ python logger.py log ERROR # Print only ERROR levels and above $ python logger.py color # Color formatted output ``` 有很多技巧可以使日志的可读性变得更好,我最喜欢的一个是技巧是对其进行着色。到目前为止,您应该已经知道,以彩色文本显示终端信息时可读性更好。但是应该如何设置呢? `ls` 和 `grep` 这样的程序会使用 [ANSI escape codes](https://en.wikipedia.org/wiki/ANSI_escape_code),它是一系列的特殊字符,可以使您的 shell 改变输出结果的颜色。例如,执行 `echo -e "\e[38;2;255;0;0mThis is red\e[0m"` 会打印红色的字符串:`This is red` 。只要您的终端支持 [真彩色](https://gist.github.com/XVilka/8346728#terminals--true-color)。如果您的终端不支持真彩色(例如 MacOS 的 Terminal.app),您可以使用支持更加广泛的 16 色,例如:"\e[31; 1mThis is red\e[0m "。 下面这个脚本向您展示了如何在终端中打印多种颜色(只要您的终端支持真彩色) ```bash #!/usr/bin/env bash for R in $(seq 0 20 255); do for G in $(seq 0 20 255); do for B in $(seq 0 20 255); do printf "\e[38;2;${R};${G};${B}m█\e[0m"; done done done ``` ## 第三方日志系统 如果您正在构建大型软件系统,您很可能会使用到一些依赖,有些依赖会作为程序单独运行。如 Web 服务器、数据库或消息代理都是此类常见的第三方依赖。 和这些系统交互的时候,阅读它们的日志是非常必要的,因为仅靠客户端侧的错误信息可能并不足以定位问题。 幸运的是,大多数的程序都会将日志保存在您的系统中的某个地方。对于 UNIX 系统来说,程序的日志通常存放在 `/var/log`。例如, [NGINX](https://www.nginx.com/) web 服务器就将其日志存放于 `/var/log/nginx`。 目前,系统开始使用 **system log**,您所有的日志都会保存在这里。大多数(但不是全部的)Linux 系统都会使用 `systemd`,这是一个系统守护进程,它会控制您系统中的很多东西,例如哪些服务应该启动并运行。`systemd` 会将日志以某种特殊格式存放于 `/var/log/journal`,您可以使用 [`journalctl`](http://man7.org/linux/man-pages/man1/journalctl.1.html) 命令显示这些消息。 类似地,在 macOS 系统中,除了有一个 `/var/log/system.log` 之外,越来越多的工具开始使用用 [`log show`](https://www.manpagez.com/man/1/log/) 显示的系统日志。 对于大多数的 UNIX 系统,您也可以使用 [`dmesg`](http://man7.org/linux/man-pages/man1/dmesg.1.html) 命令来读取内核的日志。 如果您希望将日志加入到系统日志中,您可以使用 [`logger`](http://man7.org/linux/man-pages/man1/logger.1.html) 这个 shell 程序。下面这个例子显示了如何使用 `logger` 并且如何找到能够将其存入系统日志的条目。 不仅如此,大多数的编程语言也提供写系统日志的方法。 ```bash logger "Hello Logs" # On macOS log show --last 1m | grep Hello # On Linux journalctl --since "1m ago" | grep Hello ``` 正如我们在数据整理那节课上看到的那样,日志的内容可以非常的多,我们需要对其进行处理和过滤才能得到我们想要的信息。 如果您发现您需要对 `journalctl` 和 `log show` 的结果进行大量的过滤,那么此时可以考虑使用它们自带的选项对其结果先过滤一遍再输出。还有一些像 [`lnav`](http://lnav.org/) 这样的工具,它为日志文件提供了更好的展现和浏览方式。 ## 调试器 当通过打印已经不能满足您的调试需求时,您应该使用调试器。 调试器是一种可以允许我们和正在执行的程序进行交互的程序,它可以做到: - 当到达某一行时将程序暂停; - 一次一条指令地逐步执行程序; - 程序崩溃后查看变量的值; - 满足特定条件时暂停程序; - 其他高级功能。 很多编程语言都有自己的调试器。Python 的调试器是 [`pdb`](https://docs.python.org/3/library/pdb.html). 下面对 `pdb` 支持的命令进行简单的介绍: - **l**(ist) - 显示当前行周围的 11 行,或接着上次显示的,继续往下显示 11 行; - **s**(tep) - 执行当前行,并在第一个可能的时机停止(通常指步入函数); - **n**(ext) - 继续执行,直到到达当前函数的下一行或函数返回(通常指步过函数); - **b**(reak) - 设置断点(基于传入的参数); - **p**(rint) - 在当前上下文对表达式求值并打印结果。还有一个 **pp** 命令,它使用 [`pprint`](https://docs.python.org/3/library/pprint.html) 打印; - **r**(eturn) - 执行到当前函数返回; - **q**(uit) - 退出调试器。 让我们使用 `pdb` 来修复下面的 Python 代码(参考讲座视频) ```python def bubble_sort(arr): n = len(arr) for i in range(n): for j in range(n): if arr[j] > arr[j+1]: arr[j] = arr[j+1] arr[j+1] = arr[j] return arr print(bubble_sort([4, 2, 1, 8, 7, 6])) ``` 注意,因为 Python 是一种解释型语言,所以我们可以使用 `pdb` shell 来执行命令和指令。[`ipdb`](https://pypi.org/project/ipdb/) 是 `pdb` 的增强版,它使用 [`IPython`](https://ipython.org) 作为 REPL 并开启了 tab 补全、语法高亮、更好的回溯和更好的内省,同时保留了与 `pdb` 模块相同的接口。 对于更底层的编程语言,您可能需要了解一下 [`gdb`](https://www.gnu.org/software/gdb/)(及其改进版 [`pwndbg`](https://github.com/pwndbg/pwndbg))和 [`lldb`](https://lldb.llvm.org/)。 它们针对类 C 语言的调试进行了优化,但也允许您探索几乎任何进程并获取其当前的机器状态,例如:寄存器、栈、程序计数器等。 ## 专门工具 即使您需要调试的程序是一个二进制的黑盒程序,仍然有一些工具可以帮助到您。当您的程序需要执行一些只有操作系统内核才能完成的操作时,它需要使用 [系统调用](https://en.wikipedia.org/wiki/System_call)。有一些命令可以帮助您追踪您的程序执行的系统调用。在 Linux 中可以使用 [`strace`](http://man7.org/linux/man-pages/man1/strace.1.html) ,在 macOS 和 BSD 中可以使用 [`dtrace`](http://dtrace.org/blogs/about/)。`dtrace` 用起来可能有些别扭,因为它使用的是它自有的 `D` 语言,但是我们可以使用一个叫做 [`dtruss`](https://www.manpagez.com/man/1/dtruss/) 的封装使其具有和 `strace` (更多信息参考 [这里](https://8thlight.com/blog/colin-jones/2015/11/06/dtrace-even-better-than-strace-for-osx.html))类似的接口 下面的例子展现来如何使用 `strace` 或 `dtruss` 来显示在 `ls` 执行时追踪 [`stat`](http://man7.org/linux/man-pages/man2/stat.2.html) 系统调用的结果。若需要深入了解 `strace`,[这篇文章](https://blogs.oracle.com/linux/strace-the-sysadmins-microscope-v2) 值得一读。 ```bash # On Linux sudo strace -e lstat ls -l > /dev/null # On macOS sudo dtruss -t lstat64_extended ls -l > /dev/null ``` 有些情况下,我们需要查看网络数据包才能定位问题。像 [`tcpdump`](http://man7.org/linux/man-pages/man1/tcpdump.1.html) 和 [Wireshark](https://www.wireshark.org/) 这样的网络数据包分析工具可以帮助您获取网络数据包的内容并基于不同的条件进行过滤。 对于 web 开发, Chrome/Firefox 的开发者工具非常方便,功能也很强大: - 源码 -查看任意站点的 HTML/CSS/JS 源码; - 实时修改 HTML, CSS, JS 代码 - 修改网站的内容、样式和行为用于测试(由此可见网页截图是不可信的); - Javascript shell - 在 JS REPL 中执行命令; - 网络 - 分析请求的时间线; - 存储 - 查看 Cookies 和本地应用存储。 ## 静态分析 有些问题是您不需要执行代码就能发现的。例如,仔细观察一段代码,您就能发现某个循环变量覆盖了某个已经存在的变量或函数名;或是有个变量在被读取之前并没有被定义。 这种情况下 [静态分析](https://en.wikipedia.org/wiki/Static_program_analysis) 工具就可以帮我们找到问题。静态分析会将程序的源码作为输入然后基于规则对其进行分析并对代码的正确性进行推理。 下面这段 Python 代码中存在几个问题。 首先,我们的循环变量 `foo` 覆盖了之前定义的函数 `foo`。最后一行,我们还把 `bar` 错写成了 `baz`,因此当程序完成 `sleep` (一分钟)后,执行到这一行的时候便会崩溃。 ```python import time def foo(): return 42 for foo in range(5): print(foo) bar = 1 bar *= 0.2 time.sleep(60) print(baz) ``` 静态分析工具可以发现此类的问题。当我们使用 [`pyflakes`](https://pypi.org/project/pyflakes) 分析代码的时候,我们会得到与这两处 bug 相关的错误信息。[`mypy`](http://mypy-lang.org/) 则是另外一个工具,它可以对代码进行类型检查。这里,`mypy` 会经过我们 `bar` 起初是一个 `int` ,然后变成了 `float`。这些问题都可以在不运行代码的情况下被发现。 ```bash $ pyflakes foobar.py foobar.py:6: redefinition of unused 'foo' from line 3 foobar.py:11: undefined name 'baz' $ mypy foobar.py foobar.py:6: error: Incompatible types in assignment (expression has type "int", variable has type "Callable[[], Any]") foobar.py:9: error: Incompatible types in assignment (expression has type "float", variable has type "int") foobar.py:11: error: Name 'baz' is not defined Found 3 errors in 1 file (checked 1 source file) ``` 在 shell 工具那一节课的时候,我们介绍了 [`shellcheck`](https://www.shellcheck.net/),这是一个类似的工具,但它是应用于 shell 脚本的。 多数编辑器和 IDE 都支持在编辑器界面内直接显示这些工具的输出结果,并高亮标出警告和错误的位置。这通常被称为**代码检查(code linting)**,它也可以用来展示其他类型的问题,例如代码风格违规或不安全的代码结构。 在 vim 中,插件 [`ale`](https://vimawesome.com/plugin/ale) 或 [`syntastic`](https://vimawesome.com/plugin/syntastic) 可以帮助您做同样的事情。 在 Python 中, [`pylint`](https://www.pylint.org) 和 [`pep8`](https://pypi.org/project/pep8/) 是风格检查工具的典型例子,而 [`bandit`](https://pypi.org/project/bandit/) 则是设计用来发现常见安全漏洞的工具。对于其它语言,人们已经整理了非常详尽的静态分析工具列表,例如 [Awesome Static Analysis](https://github.com/mre/awesome-static-analysis)(您可能想去看看其中的 _Writing_ 一节)。代码检查工具 (linters) 则可以参考 [Awesome Linters](https://github.com/caramelomartins/awesome-linters)。 与风格检查相辅相成的是**代码格式化工具(code formatters)**,例如 Python 的 [`black`](https://github.com/psf/black)、Go 语言的 `gofmt`、Rust 的 `rustfmt` 以及 JavaScript、HTML 和 CSS 的 [`prettier`](https://prettier.io/)。这些工具会自动格式化您的代码,使其符合该编程语言通用的风格规范。虽然您可能不太情愿将代码风格的控制权交给工具,但标准化的代码格式不仅有助于他人阅读您的代码,也能让您更轻松地阅读他人(同样经过格式化)的代码。 # 性能分析 (Profiling) 即使你的代码在功能上完全符合预期,但如果它在运行过程中耗尽了所有的 CPU 或内存资源,那也未必合格。算法课程通常会教授大 O 表示法,却很少教你如何找到程序中的热点 (hot spots)。鉴于[过早优化是万恶之源](http://wiki.c2.com/?PrematureOptimization),您应该了解一下性能分析器 (profilers) 和监控工具。它们会帮助您找到程序中最耗时、最耗资源的部分,从而让您能够集中精力优化这些特定的部分。 ## 计时 与调试类似,多数情况下,只需打印代码从一处运行到另一处的时间,即可发现问题。下面是一个使用 Python [`time`](https://docs.python.org/3/library/time.html) 模块的例子: ```python import time, random n = random.randint(1, 10) * 100 # 获取当前时间 start = time.time() # 做些工作 print("Sleeping for {} ms".format(n)) time.sleep(n/1000) # 比较当前时间和起始时间 print(time.time() - start) # 输出 # Sleeping for 500 ms # 0.5713930130004883 ``` 不过,墙上时间(wall clock time)也可能会有误导性,因为计算机可能同时在运行其他进程,或者在等待某些事件发生。工具通常会区分实际时间、用户时间和系统时间。通常用户时间加系统时间代表了您的进程在 CPU 上实际消耗了多少时间(更详细的解释可以参考 [这篇文章](https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1))。 - 真实时间 _Real_ - 程序从开始到结束流逝的墙上时间,包括其他进程使用的时间以及阻塞(例如等待 I/O 或网络)的时间 - 用户时间 _User_ - CPU 执行用户态代码所花费的时间 - 系统时间 _Sys_ - CPU 执行内核态代码所花费的时间 例如,试着写一个执行 HTTP 请求的命令,并在命令前加上 [`time`](http://man7.org/linux/man-pages/man1/time.1.html)。网络不好的情况下您可能会看到下面的输出结果。请求花费了 2 秒多才完成,但是进程仅花费了 15 毫秒的 CPU 用户时间和 12 毫秒的 CPU 内核时间。 ```bash $ time curl https://missing.csail.mit.edu &> /dev/null real 0m2.561s user 0m0.015s sys 0m0.012s ``` ## 性能分析工具(profilers) ### CPU 大多数情况下,当人们提及性能分析工具的时候,通常指的是 CPU 性能分析工具。 CPU 性能分析工具有两种: 追踪分析器(_tracing_)及采样分析器(_sampling_)。 追踪分析器 会记录程序的每一次函数调用,而采样分析器则只会周期性的监测(通常为每毫秒)您的程序并记录程序堆栈。它们使用这些记录来生成统计信息,显示程序在哪些事情上花费了最多的时间。如果您希望了解更多相关信息,可以参考 [这篇](https://jvns.ca/blog/2017/12/17/how-do-ruby---python-profilers-work-) 介绍性的文章。 大多数的编程语言都有一些基于命令行的分析器,我们可以使用它们来分析代码。它们通常可以集成在 IDE 中,但是本节课我们会专注于这些命令行工具本身。 在 Python 中,我们使用 `cProfile` 模块来分析每次函数调用所消耗的时间。 在下面的例子中,我们实现了一个基础的 grep 命令: ```python #!/usr/bin/env python import sys, re def grep(pattern, file): with open(file, 'r') as f: print(file) for i, line in enumerate(f.readlines()): pattern = re.compile(pattern) match = pattern.search(line) if match is not None: print("{}: {}".format(i, line), end="") if __name__ == '__main__': times = int(sys.argv[1]) pattern = sys.argv[2] for i in range(times): for file in sys.argv[3:]: grep(pattern, file) ``` 我们可以使用下面的命令来对这段代码进行分析。通过它的输出我们可以知道,IO 消耗了大量的时间,编译正则表达式也比较耗费时间。因为正则表达式只需要编译一次,我们可以将其移动到 for 循环外面来改进性能。 ``` $ python -m cProfile -s tottime grep.py 1000 '^(import|\s*def)[^,]*$' *.py [omitted program output] ncalls tottime percall cumtime percall filename:lineno(function) 8000 0.266 0.000 0.292 0.000 {built-in method io.open} 8000 0.153 0.000 0.894 0.000 grep.py:5(grep) 17000 0.101 0.000 0.101 0.000 {built-in method builtins.print} 8000 0.100 0.000 0.129 0.000 {method 'readlines' of '_io._IOBase' objects} 93000 0.097 0.000 0.111 0.000 re.py:286(_compile) 93000 0.069 0.000 0.069 0.000 {method 'search' of '_sre.SRE_Pattern' objects} 93000 0.030 0.000 0.141 0.000 re.py:231(compile) 17000 0.019 0.000 0.029 0.000 codecs.py:318(decode) 1 0.017 0.017 0.911 0.911 grep.py:3() [omitted lines] ``` 关于 Python 的 `cProfile` 分析器(以及其他一些类似的分析器),需要注意的是它显示的是每次函数调用的时间。看上去可能快到反直觉,尤其是如果您在代码里面使用了第三方的函数库,因为内部函数调用也会被看作函数调用。 更加符合直觉的显示分析信息的方式是包括每行代码的执行时间,这也是 *行分析器* 的工作。例如,下面这段 Python 代码会向本课程的网站发起一个请求,然后解析响应返回的页面中的全部 URL: ```python #!/usr/bin/env python import requests from bs4 import BeautifulSoup # 这个装饰器会告诉行分析器 # 我们想要分析这个函数 @profile def get_urls(): response = requests.get('https://missing.csail.mit.edu') s = BeautifulSoup(response.content, 'lxml') urls = [] for url in s.find_all('a'): urls.append(url['href']) if __name__ == '__main__': get_urls() ``` 如果我们使用 Python 的 `cProfile` 分析器,我们会得到超过 2500 行的输出结果,即使对其进行排序,我仍然搞不懂时间到底都花在哪了。如果我们使用 [`line_profiler`](https://github.com/pyutils/line_profiler),它会基于行来显示时间: ```bash $ kernprof -l -v a.py Wrote profile results to urls.py.lprof Timer unit: 1e-06 s Total time: 0.636188 s File: a.py Function: get_urls at line 5 Line # Hits Time Per Hit % Time Line Contents ============================================================== 5 @profile 6 def get_urls(): 7 1 613909.0 613909.0 96.5 response = requests.get('https://missing.csail.mit.edu') 8 1 21559.0 21559.0 3.4 s = BeautifulSoup(response.content, 'lxml') 9 1 2.0 2.0 0.0 urls = [] 10 25 685.0 27.4 0.1 for url in s.find_all('a'): 11 24 33.0 1.4 0.0 urls.append(url['href']) ``` ### 内存 像 C 或者 C++ 这样的语言,内存泄漏会导致您的程序在使用完内存后不去释放它。为了应对内存类的 Bug,我们可以使用类似 [Valgrind](https://valgrind.org/) 这样的工具来检查内存泄漏问题。 对于 Python 这类具有垃圾回收机制的语言,内存分析器也是很有用的,因为对于某个对象来说,只要有指针还指向它,那它就不会被回收。 下面这个例子及其输出,展示了 [memory-profiler](https://pypi.org/project/memory-profiler/) 是如何工作的(注意装饰器和 `line-profiler` 类似)。 ```python @profile def my_func(): a = [1] * (10 ** 6) b = [2] * (2 * 10 ** 7) del b return a if __name__ == '__main__': my_func() ``` ```bash $ python -m memory_profiler example.py Line # Mem usage Increment Line Contents ============================================== 3 @profile 4 5.97 MB 0.00 MB def my_func(): 5 13.61 MB 7.64 MB a = [1] * (10 ** 6) 6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7) 7 13.61 MB -152.59 MB del b 8 13.61 MB 0.00 MB return a ``` ### 事件分析 在我们使用 `strace` 调试代码的时候,您可能会希望忽略一些特殊的代码并希望在分析时将其当作黑盒处理。[`perf`](http://man7.org/linux/man-pages/man1/perf.1.html) 命令将 CPU 的区别进行了抽象,它不会报告时间和内存的消耗,而是报告与您的程序相关的系统事件。 例如,`perf` 可以报告不佳的缓存局部性(poor cache locality)、大量的页错误(page faults)或活锁(livelocks)。下面是关于常见命令的简介: - `perf list` - 列出可以被 pref 追踪的事件; - `perf stat COMMAND ARG1 ARG2` - 收集与某个进程或指令相关的事件; - `perf record COMMAND ARG1 ARG2` - 记录命令执行的采样信息并将统计数据储存在 `perf.data` 中; - `perf report` - 格式化并打印 `perf.data` 中的数据。 ### 可视化 使用分析器来分析真实的程序时,由于软件的复杂性,其输出结果中将包含大量的信息。人类是一种视觉动物,非常不善于阅读大量的文字。因此很多工具都提供了可视化分析器输出结果的功能。 对于采样分析器来说,常见的显示 CPU 分析数据的形式是 [火焰图](http://www.brendangregg.com/flamegraphs.html),火焰图会在 Y 轴显示函数调用关系,并在 X 轴显示其耗时的比例。火焰图同时还是可交互的,您可以深入程序的某一具体部分,并查看其栈追踪(您可以尝试点击下面的图片)。 [![FlameGraph](http://www.brendangregg.com/FlameGraphs/cpu-bash-flamegraph.svg)](http://www.brendangregg.com/FlameGraphs/cpu-bash-flamegraph.svg) 调用图和控制流图可以显示子程序之间的关系,它将函数作为节点并把函数调用作为边。将它们和分析器的信息(例如调用次数、耗时等)放在一起使用时,调用图会变得非常有用,它可以帮助我们分析程序的流程。 在 Python 中您可以使用 [`pycallgraph`](http://pycallgraph.slowchop.com/en/master/) 来生成这些图片。 ![Call Graph](https://upload.wikimedia.org/wikipedia/commons/2/2f/A_Call_Graph_generated_by_pycallgraph.png) ## 资源监控 有时候,分析程序性能的第一步是搞清楚它所消耗的资源。程序变慢通常是因为它所需要的资源不够了。例如,没有足够的内存或者网络连接变慢的时候。 有很多很多的工具可以被用来显示不同的系统资源,例如 CPU 占用、内存使用、网络、磁盘使用等。 - **通用监控** - 最流行的工具要数 [`htop`](https://htop.dev/), 了,它是 [`top`](http://man7.org/linux/man-pages/man1/top.1.html) 的改进版。`htop` 可以显示当前运行进程的多种统计信息。`htop` 有很多选项和快捷键,常见的有:`` 进程排序、 `t` 显示树状结构和 `h` 打开或折叠线程。 还可以留意一下 [`glances`](https://nicolargo.github.io/glances/) ,它的实现类似但是用户界面更好。如果需要合并测量全部的进程, [`dstat`](http://dag.wiee.rs/home-made/dstat/) 是也是一个非常好用的工具,它可以实时地计算不同子系统资源的度量数据,例如 I/O、网络、 CPU 利用率、上下文切换等等; - **I/O 操作** - [`iotop`](http://man7.org/linux/man-pages/man8/iotop.8.html) 可以显示实时 I/O 占用信息而且可以非常方便地检查某个进程是否正在执行大量的磁盘读写操作; - **磁盘使用** - [`df`](http://man7.org/linux/man-pages/man1/df.1.html) 可以显示每个分区的信息,而 [`du`](http://man7.org/linux/man-pages/man1/du.1.html) 则可以显示当前目录下每个文件的磁盘使用情况( **d** isk **u** sage)。`-h` 选项可以使命令以对人类(**h** uman)更加友好的格式显示数据;[`ncdu`](https://dev.yorhel.nl/ncdu) 是一个交互性更好的 `du` ,它可以让您在不同目录下导航、删除文件和文件夹; - **内存使用** - [`free`](http://man7.org/linux/man-pages/man1/free.1.html) 可以显示系统当前空闲的内存。内存也可以使用 `htop` 这样的工具来显示; - **打开文件** - [`lsof`](http://man7.org/linux/man-pages/man8/lsof.8.html) 可以列出被进程打开的文件信息。 当我们需要查看某个文件是被哪个进程打开的时候,这个命令非常有用; - **网络连接和配置** - [`ss`](http://man7.org/linux/man-pages/man8/ss.8.html) 能帮助我们监控网络包的收发情况以及网络接口的显示信息。`ss` 常见的一个使用场景是找到端口被进程占用的信息。如果要显示路由、网络设备和接口信息,您可以使用 [`ip`](http://man7.org/linux/man-pages/man8/ip.8.html) 命令。注意,`netstat` 和 `ifconfig` 这两个命令已经被前面那些工具所代替了。 - **网络使用** - [`nethogs`](https://github.com/raboof/nethogs) 和 [`iftop`](http://www.ex-parrot.com/pdw/iftop/) 是非常好的用于对网络占用进行监控的交互式命令行工具。 如果您希望测试一下这些工具,您可以使用 [`stress`](https://linux.die.net/man/1/stress) 命令来为系统人为地增加负载。 ### 专用工具 有时候,您只需要对黑盒程序进行基准测试,并依此对软件选择进行评估。 类似 [`hyperfine`](https://github.com/sharkdp/hyperfine) 这样的命令行可以帮您快速进行基准测试。例如,我们在 shell 工具和脚本那一节课中我们推荐使用 `fd` 来代替 `find`。我们这里可以用 `hyperfine` 来比较一下它们。 例如,下面的例子中,我们可以看到 `fd` 比 `find` 要快 20 倍。 ```bash $ hyperfine --warmup 3 'fd -e jpg' 'find . -iname "*.jpg"' Benchmark #1: fd -e jpg Time (mean ± σ): 51.4 ms ± 2.9 ms [User: 121.0 ms, System: 160.5 ms] Range (min … max): 44.2 ms … 60.1 ms 56 runs Benchmark #2: find . -iname "*.jpg" Time (mean ± σ): 1.126 s ± 0.101 s [User: 141.1 ms, System: 956.1 ms] Range (min … max): 0.975 s … 1.287 s 10 runs Summary 'fd -e jpg' ran 21.89 ± 2.33 times faster than 'find . -iname "*.jpg"' ``` 和 debug 一样,浏览器也包含了很多不错的性能分析工具,可以用来分析页面加载,让我们可以搞清楚时间都消耗在什么地方(加载、渲染、脚本等等)。 更多关于 [Firefox](https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler) 和 [Chrome](https://developers.google.com/web/tools/chrome-devtools/rendering-tools) 的信息可以点击链接。 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) ## 调试 1. 使用 Linux 上的 `journalctl` 或 macOS 上的 `log show` 命令来获取最近一天中超级用户的登录信息及其所执行的指令。如果找不到相关信息,您可以执行一些无害的命令,例如 `sudo ls` 然后再次查看。 2. 学习 [这份](https://github.com/spiside/pdb-tutorial) `pdb` 实践教程并熟悉相关的命令。更深入的信息您可以参考 [这份](https://realpython.com/python-debugging-pdb) 教程。 3. 安装 [`shellcheck`](https://www.shellcheck.net/) 并尝试对下面的脚本进行检查。这段代码有什么问题吗?请修复相关问题。在您的编辑器中安装一个 linter 插件,这样它就可以自动地显示相关警告信息。 ```bash #!/bin/sh ## Example: a typical script with several problems for f in $(ls *.m3u) do grep -qi hq.*mp3 $f \ && echo -e 'Playlist $f contains a HQ file in mp3 format' done ``` 4. (进阶题) 请阅读 [可逆调试](https://undo.io/resources/reverse-debugging-whitepaper/) 并尝试创建一个可以工作的例子(使用 [`rr`](https://rr-project.org/) 或 [`RevPDB`](https://morepypy.blogspot.com/2016/07/reverse-debugging-for-python.html))。 ## 性能分析 1. [这里](/static/files/sorts.py) 有一些排序算法的实现。请使用 [`cProfile`](https://docs.python.org/3/library/profile.html) 和 [`line_profiler`](https://github.com/pyutils/line_profiler) 来比较插入排序和快速排序的性能。两种算法的瓶颈分别在哪里?然后使用 `memory_profiler` 来检查内存消耗,为什么插入排序更好一些?然后再看看原地排序版本的快排。附加题:使用 `perf` 来查看不同算法的循环次数及缓存命中及丢失情况。 2. 这里有一些用于计算斐波那契数列 Python 代码,它为计算每个数字都定义了一个函数: ```python #!/usr/bin/env python def fib0(): return 0 def fib1(): return 1 s = """def fib{}(): return fib{}() + fib{}()""" if __name__ == '__main__': for n in range(2, 10): exec(s.format(n, n-1, n-2)) # from functools import lru_cache # for n in range(10): # exec("fib{} = lru_cache(1)(fib{})".format(n, n)) print(eval("fib9()")) ``` 将代码拷贝到文件中使其变为一个可执行的程序。首先安装 [`pycallgraph`](http://pycallgraph.slowchop.com/en/master/) 和 [`graphviz`](http://graphviz.org/)(如果您能够执行 `dot`, 则说明已经安装了 GraphViz.)。并使用 `pycallgraph graphviz -- ./fib.py` 来执行代码并查看 `pycallgraph.png` 这个文件。`fib0` 被调用了多少次?我们可以通过记忆法来对其进行优化。将注释掉的部分放开,然后重新生成图片。这回每个 `fibN` 函数被调用了多少次? 3. 我们经常会遇到的情况是某个我们希望去监听的端口已经被其他进程占用了。让我们通过进程的 PID 查找相应的进程。首先执行 `python -m http.server 4444` 启动一个最简单的 web 服务器来监听 `4444` 端口。在另外一个终端中,执行 `lsof | grep LISTEN` 打印出所有监听端口的进程及相应的端口。找到对应的 PID 然后使用 `kill ` 停止该进程。 4. 限制进程资源也是一个非常有用的技术。执行 `stress -c 3` 并使用 `htop` 对 CPU 消耗进行可视化。现在,执行 `taskset --cpu-list 0,2 stress -c 3` 并可视化。`stress` 占用了 3 个 CPU 吗?为什么没有?阅读 [`man taskset`](http://man7.org/linux/man-pages/man1/taskset.1.html) 来寻找答案。附加题:使用 [`cgroups`](http://man7.org/linux/man-pages/man7/cgroups.7.html) 来实现相同的操作,限制 `stress -m` 的内存使用。 5. (进阶题) `curl ipinfo.io` 命令或执行 HTTP 请求并获取关于您 IP 的信息。打开 [Wireshark](https://www.wireshark.org/) 并抓取 `curl` 发起的请求和收到的回复报文。(提示:可以使用 `http` 进行过滤,只显示 HTTP 报文) ================================================ FILE: _2020/editors-notes.txt ================================================ I use these notes as a reference when teaching. If you're a student who ended up here, you probably want to look at editors.md instead. writing words (essays) vs programming - programming: more time spent reading, navigating, editing, than writing in a long stream - different programs for different purposes worth mastering an editor - it will save you hundreds of hours - worth investing time in this, unlike many other things how to learn - start with tutorial - stick with the editor for all code (and ideally word) editing tasks - avoid bad habits - look things up as you go - if it seems like there should be a better way, there probably is - programmers care about their editors, so they're super powerful tools - timeline for learning - in a couple hours: will learn basic editor functions (save, quit, ...) - in 20 hours: will be as fast as you were with your old editor - after that: benefits start - you never stop learning (these are very fancy and powerful tools) which editor to learn? - people have strong opinions: editor wars - https://insights.stackoverflow.com/survey/2019/#development-environments-and-tools - VS Code is most popular GUI-based tool - Vim is most popular CLI-based tool - we are teaching you Vim - all the instructors use Vim - originated from Vi editor (1976), still being developed today - interesting ideas - lots of tools support Vim bindings (e.g. Vim emulation for VS Code has 1.4m downloads) - has a bit of a learning curve (compared to GUI editor), but worth it - worth learning even if you finally choose to use another editor this lecture - philosophy of vim: the neat ideas of this editor - basics - demos - exercises, resources to learn more - starting with vimtutor - this lecture: focus on ideas, not details modal editing (modal ~ "modes") - designed around idea that a lot of time is spent reading/navigating/making small edits - simplified picture: normal mode <-> insert mode - more complex picture: normal <-> {insert, replace, visual, v-line, v-block, command-line} - mode shown in bottom left - keystrokes have different meanings in different modes: e.g. `x` - remapping caps lock to escape basics - switching between normal mode and insert mode - that's all you need to know to get started. insert mode works as you expect. - buffers vs tabs vs windows - buffers ~ open files - every tab has one or more windows - a buffer can be open in 0 or more windows - unlike e.g. web browser - command line - `:` in normal mode - :q, :w, :wq, :e {name of file}, :ls, :help {topic} (:help :w, :help w) vim normal mode is a programming language (most important idea in vim) - once you learn the primitives, you can combine them in interesting ways - becomes muscle memory - movement - selection - edits - counts - modifiers demo (broken fizzbuzz) - main is never called - `G` end of file - `o` open new line below - type in "if __name__ ..." thing - starts at 0 instead of 1 - search for `/range` - `ww` to move forward 2 words - `i` to insert text, "1, " - `ea` to insert after limit, "+1" - newline for "fizzbuzz" - `jj$i` to insert text at end of line - add ", end=''" - `jj.` to repeat for second print - `jjo` to open line below if - add "else: print()" - fizz fizz - `ci'` to change fizz - command-line argument - `ggO` to open above - "import sys" - `/10` - `ci(` to "int(sys.argv[1])" customizing vim - ~/.vimrc - start with our basic config - look online for inspiration plugins - no plugin manager necessary: just put plugins in `~/.vim/pack/vendor/start/` - recommended plugins - fuzzy file finder: ctrlp.vim - code search: ack.vim - directory navigation: nerdtree - magic motions: vim-easymotion - see what your instructors use - find more: https://vimawesome.com/ vim bindings in other tools - shell (set -o vi / bindkey -v) - $EDITOR - readline - Jupyter notebook advanced vim demos homework ================================================ FILE: _2020/editors.md ================================================ --- layout: lecture title: "编辑器 (Vim)" date: 2020-01-15 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: a6Q8Na575qc solution: ready: true url: editors-solution --- 写作和写代码其实是两项非常不同的活动。当我们编程的时候,会经常在文件间进行切换、阅读、浏览和修改代码,而不是连续编写一大段的文字。因此代码编辑器和文本编辑器是很不同的两种工具(例如微软的 Word 与 Visual Studio Code)。 作为程序员,我们大部分时间都花在代码编辑上,所以花点时间掌握某个适合自己的编辑器是非常值得的。通常学习使用一个新的编辑器包含以下步骤: - 阅读教程(比如这节课以及我们为您提供的资源) - 坚持使用它来完成你所有的编辑工作(即使一开始这会让你的工作效率降低) - 随时查阅:如果某个操作看起来像是有更方便的实现方法,一般情况下真的会有 如果您能够遵循上述步骤,并且坚持使用新的编辑器完成您所有的文本编辑任务,那么学习一个复杂的代码编辑器的过程一般是这样的:头两个小时,您会学习到编辑器的基本操作,例如打开和编辑文件、保存与退出、浏览缓冲区。当学习时间累计达到 20 个小时之后,您使用新编辑器的效率应该已经和使用老编辑器一样快。在此之后,其益处开始显现:有了足够的知识和肌肉记忆后,使用新编辑器将大大节省你的时间。而现代文本编辑器都是些复杂且强大的工具,永远有新东西可学:学的越多,效率越高。 # 该学哪个编辑器? 程序员们对自己正在使用的文本编辑器通常有着 [非常强的执念](https://zh.wikipedia.org/wiki/编辑器之战)。 现在最流行的编辑器是什么?[Stack Overflow 的调查](https://insights.stackoverflow.com/survey/2019/#development-environments-and-tools)(这个调查可能并不如我们想象的那样客观,因为 Stack Overflow 的用户并不能代表所有程序员)显示,[Visual Studio Code](https://code.visualstudio.com) 是目前最流行的代码编辑器。而 [Vim](https://www.vim.org) 则是最流行的基于命令行的编辑器。 ## Vim 这门课的所有教员都使用 Vim 作为编辑器。Vim 有着悠久历史;它始于 1976 年的 Vi 编辑器,到现在还在 不断开发中。Vim 有很多聪明的设计思想,所以很多其他工具也支持 Vim 模式(比如,140 万人安装了 [Vim emulation for VS code](https://github.com/VSCodeVim/Vim))。即使你最后使用 其他编辑器,Vim 也值得学习。 由于不可能在 50 分钟内教授 Vim 的所有功能,我们会专注于解释 Vim 的设计哲学,教你基础知识, 并展示一部分高级功能,然后给你掌握这个工具所需要的资源。 # Vim 的哲学 在编程的时候,你会把大量时间花在阅读/编辑而不是在写代码上。所以,Vim 是一个 *多模态* 编辑 器:它对于插入文字和操纵文字有不同的模式。Vim 是可编程的(可以使用 Vimscript 或者像 Python 一样的其他程序语言),Vim 的接口本身也是一个程序语言:键入操作(以及其助记名) 是命令,这些命令也是可组合的。Vim 避免了使用鼠标,因为那样太慢了;Vim 甚至避免用 上下左右键因为那样需要太多的手指移动。 这样的设计哲学使得 Vim 成为了一个能跟上你思维速度的编辑器。 # 编辑模式 Vim 的设计以大多数时间都花在阅读、浏览和进行少量编辑改动为基础,因此它具有多种操作模式: - **正常模式**:在文件中四处移动光标进行修改 - **插入模式**:插入文本 - **替换模式**:替换文本 - **可视化模式**(一般,行,块):选中文本块 - **命令模式**:用于执行命令 在不同的操作模式下,键盘敲击的含义也不同。比如,`x` 在插入模式会插入字母 `x`,但是在正常模式 会删除当前光标所在的字母,在可视模式下则会删除选中文块。 在默认设置下,Vim 会在左下角显示当前的模式。Vim 启动时的默认模式是正常模式。通常你会把大部分 时间花在正常模式和插入模式。 你可以按下 ``(退出键)从任何其他模式返回正常模式。在正常模式,键入 `i` 进入插入 模式,`R` 进入替换模式,`v` 进入可视(一般)模式,`V` 进入可视(行)模式,`` (Ctrl-V, 有时也写作 `^V`)进入可视(块)模式,`:` 进入命令模式。 因为你会在使用 Vim 时大量使用 `` 键,所以可以考虑把大小写锁定键重定义成 `` 键([MacOS 教程](https://vim.fandom.com/wiki/Map_caps_lock_to_escape_in_macOS))或者创建一个[其他的映射](https://vim.fandom.com/wiki/Avoid_the_escape_key#Mappings)通过简单的按键序列来代替 ``。 # 基本操作 ## 插入文本 在正常模式,键入 `i` 进入插入模式。现在 Vim 跟很多其他的编辑器一样,直到你键入 `` 返回正常模式。你只需要掌握这一点和上面介绍的所有基础知识就可以使用 Vim 来编辑文件了 (虽然如果你一直停留在插入模式内不一定高效)。 ## 缓存, 标签页, 窗口 Vim 会维护一系列打开的文件,称为“缓存”。一个 Vim 会话包含一系列标签页,每个标签页包含 一系列窗口(分隔面板)。每个窗口显示一个缓存。跟网页浏览器等其他你熟悉的程序不一样的是, 缓存和窗口不是一一对应的关系;窗口只是缓冲区的视图。一个缓存可以在 *多个* 窗口打开,甚至在同一 个标签页内的多个窗口打开。这个功能其实很好用,比如可以查看同一个文件的不同部分。 Vim 默认打开一个标签页,这个标签也包含一个窗口。 ## 命令行 在正常模式下键入 `:` 进入命令行模式。 在键入 `:` 后,你的光标会立即跳到屏幕下方的命令行。 这个模式有很多功能,包括打开,保存,关闭文件,以及 [退出 Vim](https://twitter.com/iamdevloper/status/435555976687923200)。 - `:q` 退出(关闭窗口) - `:w` 保存(写) - `:wq` 保存然后退出 - `:e {文件名}` 打开要编辑的文件 - `:ls` 显示打开的缓存 - `:help {标题}` 打开帮助文档 - `:help :w` 打开 `:w` 命令的帮助文档 - `:help w` 打开 `w` 移动的帮助文档 # Vim 的接口其实是一种编程语言 Vim 最重要的设计思想是 Vim 的界面本身是一个程序语言。键入操作(以及他们的助记名) 本身是命令,这些命令可以组合使用。这使得移动和编辑更加高效,特别是一旦形成肌肉记忆。 ## 移动 多数时候你会在正常模式下,使用移动命令在缓存中导航。在 Vim 里面移动也被称为 “名词”, 因为它们指向文字块。 - 基本移动: `hjkl` (左, 下, 上, 右) - 词: `w` (下一个词), `b` (词初), `e` (词尾) - 行: `0` (行初), `^` (第一个非空格字符), `$` (行尾) - 屏幕: `H` (屏幕首行), `M` (屏幕中间), `L` (屏幕底部) - 翻页: `Ctrl-u` (上翻), `Ctrl-d` (下翻) - 文件: `gg` (文件头), `G` (文件尾) - 行数: `:{行数}` 或者 `{行数}G` ({行数}为行数) - 杂项: `%` (找到配对,比如括号或者 /* */ 之类的注释对) - 查找: `f{字符}`, `t{字符}`, `F{字符}`, `T{字符}` - 查找/到 向前/向后 在本行的{字符} - `,` / `;` 用于导航匹配 - 搜索: `/{正则表达式}`, `n` / `N` 用于导航匹配 ## 选择 可视化模式: - 可视化:`v` - 可视化行: `V` - 可视化块:`Ctrl+v` 可以用移动命令来选中。 ## 编辑 所有你需要用鼠标做的事, 你现在都可以用键盘:采用编辑命令和移动命令的组合来完成。 这就是 Vim 的界面开始看起来像一个程序语言的时候。Vim 的编辑命令也被称为 “动词”, 因为动词可以施动于名词。 - `i` 进入插入模式 - 但是对于操纵/编辑文本,不单想用退格键完成 - `O` / `o` 在之上/之下插入行 - `d{移动命令}` 删除 {移动命令} - 例如,`dw` 删除词, `d$` 删除到行尾, `d0` 删除到行头。 - `c{移动命令}` 改变 {移动命令} - 例如,`cw` 改变词 - 比如 `d{移动命令}` 再 `i` - `x` 删除字符(等同于 `dl`) - `s` 替换字符(等同于 `xi`) - 可视化模式 + 操作 - 选中文字, `d` 删除 或者 `c` 改变 - `u` 撤销, `` 重做 - `y` 复制 / "yank" (其他一些命令比如 `d` 也会复制) - `p` 粘贴 - 更多值得学习的: 比如 `~` 改变字符的大小写 ## 计数 你可以用一个计数来结合“名词”和“动词”,这会执行指定操作若干次。 - `3w` 向后移动三个词 - `5j` 向下移动 5 行 - `7dw` 删除 7 个词 ## 修饰语 你可以用修饰语改变“名词”的意义。修饰语有 `i`,表示“内部”或者“在内”,和 `a`, 表示“周围”。 - `ci(` 改变当前括号内的内容 - `ci[` 改变当前方括号内的内容 - `da'` 删除一个单引号字符串, 包括周围的单引号 # 演示 这里是一个有问题的 [fizz buzz](https://en.wikipedia.org/wiki/Fizz_buzz) 实现: ```python def fizz_buzz(limit): for i in range(limit): if i % 3 == 0: print('fizz') if i % 5 == 0: print('fizz') if i % 3 and i % 5: print(i) def main(): fizz_buzz(10) ``` 我们会修复以下问题: - 主函数没有被调用 - 从 0 而不是 1 开始 - 在 15 的整数倍的时候在不同行打印 "fizz" 和 "buzz" - 在 5 的整数倍的时候打印 "fizz" - 采用硬编码的参数 10 而不是从命令控制行读取参数 - 主函数没有被调用 - `G` 文件尾 - `o` 向下打开一个新行 - 输入 "if __name__ ..." - 从 0 而不是 1 开始 - 搜索 `/range` - `ww` 向后移动两个词 - `i` 插入文字, "1, " - `ea` 在 limit 后插入, "+1" - 在新的一行 "fizzbuzz" - `jj$i` 插入文字到行尾 - 加入 ", end=''" - `jj.` 重复第二个打印 - `jjo` 在 if 打开一行 - 加入 "else: print()" - fizz fizz - `ci'` 变到 fizz - 命令控制行参数 - `ggO` 向上打开 - "import sys" - `/10` - `ci(` to "int(sys.argv[1])" 展示详情请观看课程视频。比较上面用 Vim 的操作和你可能使用其他程序的操作。 值得一提的是 Vim 需要很少的键盘操作,允许你编辑的速度跟上你思维的速度。 # 自定义 Vim Vim 由一个位于 `~/.vimrc` 的文本配置文件(包含 Vim 脚本命令)。你可能会启用很多基本 设置。 我们提供一个文档详细的基本设置,你可以用它当作你的初始设置。我们推荐使用这个设置因为它修复了一些 Vim 默认设置奇怪行为。 **在 [这儿](/2020/files/vimrc) 下载我们的设置,然后将它保存成 `~/.vimrc`.** Vim 能够被重度自定义,花时间探索自定义选项是值得的。你可以参考其他人的在 GitHub 上共享的设置文件,比如,你的授课人的 Vim 设置 ([Anish](https://github.com/anishathalye/dotfiles/blob/master/vimrc), [Jon](https://github.com/jonhoo/configs/blob/master/editor/.config/nvim/init.vim) (uses [neovim](https://neovim.io/)), [Jose](https://github.com/JJGO/dotfiles/blob/master/vim/.vimrc))。 有很多好的博客文章也聊到了这个话题。尽量不要复制粘贴别人的整个设置文件, 而是阅读和理解它,然后采用对你有用的部分。 # 扩展 Vim Vim 有很多扩展插件。跟很多互联网上已经过时的建议相反,你 *不* 需要在 Vim 使用一个插件 管理器(从 Vim 8.0 开始)。你可以使用内置的插件管理系统。只需要创建一个 `~/.vim/pack/vendor/start/` 的文件夹,然后把插件放到这里(比如通过 `git clone`)。 以下是一些我们最爱的插件: - [ctrlp.vim](https://github.com/ctrlpvim/ctrlp.vim): 模糊文件查找 - [ack.vim](https://github.com/mileszs/ack.vim): 代码搜索 - [nerdtree](https://github.com/scrooloose/nerdtree): 文件浏览器 - [vim-easymotion](https://github.com/easymotion/vim-easymotion): 魔术操作 我们尽量避免在这里提供一份冗长的插件列表。你可以查看讲师们的开源的配置文件 ([Anish](https://github.com/anishathalye/dotfiles), [Jon](https://github.com/jonhoo/configs/blob/master/editor/.config/nvim/init.vim) (使用了 [neovim](https://neovim.io/)), [Jose](https://github.com/JJGO/dotfiles/blob/master/vim/.vimrc)) 来看看我们使用的其他插件。 浏览 [Vim Awesome](https://vimawesome.com/) 来了解一些很棒的插件。 这个话题也有很多博客文章:搜索 "best Vim plugins"。 # 其他程序的 Vim 模式 很多工具提供了 Vim 模式。这些 Vim 模式的质量参差不齐;取决于具体工具,有的提供了 很多酷炫的 Vim 功能,但是大多数对基本功能支持的很好。 ## Shell 如果你是一个 Bash 用户,用 `set -o vi`。如果你用 Zsh:`bindkey -v`。Fish 用 `fish_vi_key_bindings`。另外,不管利用什么 shell,你可以 `export EDITOR=vim`。 这是一个用来决定当一个程序需要启动编辑时启动哪个的环境变量。 例如,`git` 会使用这个编辑器来编辑 commit 信息。 ## Readline 很多程序使用 [GNU Readline](https://tiswww.case.edu/php/chet/readline/rltop.html) 库来作为 它们的命令控制行界面。Readline 也支持基本的 Vim 模式, 可以通过在 `~/.inputrc` 添加如下行开启: ``` set editing-mode vi ``` 比如,在这个设置下,Python REPL 会支持 Vim 快捷键。 ## 其他 甚至有 Vim 的网页浏览快捷键 [browsers](http://vim.wikia.com/wiki/Vim_key_bindings_for_web_browsers), 受欢迎的有 用于 Google Chrome 的 [Vimium](https://chrome.google.com/webstore/detail/vimium/dbepggeogbaibhgnhhndojpepiihcmeb?hl=en) 和用于 Firefox 的 [Tridactyl](https://github.com/tridactyl/tridactyl)。 你甚至可以在 [Jupyter notebooks](https://github.com/lambdalisue/jupyter-vim-binding) 中用 Vim 快捷键。 [这个列表](https://reversed.top/2016-08-13/big-list-of-vim-like-software) 中列举了支持类 vim 键位绑定的软件。 # Vim 进阶 这里我们提供了一些展示这个编辑器能力的例子。我们无法把所有的这样的事情都教给你,但是你 可以在使用中学习。一个好的对策是: 当你在使用你的编辑器的时候感觉 “一定有更好的方法来做这个”, 那么很可能真的有:上网搜寻一下。 ## 搜索和替换 `:s` (替换)命令([文档](http://vim.wikia.com/wiki/Search_and_replace))。 - `%s/foo/bar/g` - 在整个文件中将 foo 全局替换成 bar - `%s/\[.*\](\(.*\))/\1/g` - 将有命名的 Markdown 链接替换成简单 URLs ## 多窗口 - 用 `:sp` / `:vsp` 来分割窗口 - 同一个缓存可以在多个窗口中显示。 ## 宏 - `q{字符}` 来开始在寄存器 `{字符}` 中录制宏 - `q` 停止录制 - `@{字符}` 重放宏 - 宏的执行遇错误会停止 - `{计数}@{字符}` 执行一个宏{计数}次 - 宏可以递归 - 首先用 `q{字符}q` 清除宏 - 录制该宏,用 `@{字符}` 来递归调用该宏 (在录制完成之前不会有任何操作) - 例子:将 xml 转成 json ([file](/2020/files/example-data.xml)) - 一个有 "name" / "email" 键对象的数组 - 用一个 Python 程序? - 用 sed / 正则表达式 - `g/people/d` - `%s//{/g` - `%s/\(.*\)<\/name>/"name": "\1",/g` - ... - Vim 命令 / 宏 - `ggdd`, `Gdd` 删除第一行和最后一行 - 格式化单个元素的宏(存放在 `e` 中) - 转到有 `` 的行 - `qe^r"f>s": "fq` - 格式化单个人的宏 - 转到有 `` 的行 - `qpS{j@eA,j@ejS},q` - 格式化单个人然后转到下一个人的宏 - 转到有 `` 的行 - `qq@pjq` - 执行宏到文件尾 - `999@q` - 手动移除最后的 `,` 然后加上 `[` 和 `]` 分隔符 # 扩展资料 - `vimtutor` 是一个 Vim 安装时自带的教程(注:如果你使用的是 vim 9.2 或更高的版本,那么可以在正常模式下使用 `:Tutor` 来进入一个更现代化,互动性更强的教程) - [Vim Adventures](https://vim-adventures.com/) 是一个学习使用 Vim 的游戏 - [Vim Tips Wiki](http://vim.wikia.com/wiki/Vim_Tips_Wiki) - [Vim Advent Calendar](https://vimways.org/2019/) 有很多 Vim 小技巧 - [Vim Golf](http://www.vimgolf.com/) 是用 Vim 的用户界面作为程序语言的 [code golf](https://en.wikipedia.org/wiki/Code_golf) - [Vi/Vim Stack Exchange](https://vi.stackexchange.com/) - [Vim Screencasts](http://vimcasts.org/) - [Practical Vim](https://pragprog.com/titles/dnvim2/)(书籍) # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. 完成 `vimtutor`。备注:它在一个 [80x24](https://en.wikipedia.org/wiki/VT100)(80 列,24 行) 终端窗口看起来效果最好。 2. 下载我们提供的 [vimrc](/2020/files/vimrc),然后把它保存到 `~/.vimrc`。 通读这个注释详细的文件 (用 Vim!), 然后观察 Vim 在这个新的设置下看起来和使用起来有哪些细微的区别。 3. 安装和配置一个插件: [ctrlp.vim](https://github.com/ctrlpvim/ctrlp.vim). 1. 用 `mkdir -p ~/.vim/pack/vendor/start` 创建插件文件夹 2. 下载这个插件: `cd ~/.vim/pack/vendor/start; git clone https://github.com/ctrlpvim/ctrlp.vim` 3. 阅读这个插件的 [文档](https://github.com/ctrlpvim/ctrlp.vim/blob/master/readme.md)。 尝试用 CtrlP 来在一个工程文件夹里定位一个文件,打开 Vim, 然后用 Vim 命令控制行开始 `:CtrlP`. 4. 自定义 CtrlP:添加 [configuration](https://github.com/ctrlpvim/ctrlp.vim/blob/master/readme.md#basic-options) 到你的 `~/.vimrc` 来用按 Ctrl-P 打开 CtrlP 4. 练习使用 Vim, 在你自己的机器上重做 [演示](#demo)。 5. 下个月用 Vim 完成 *所有的* 文件编辑。每当不够高效的时候,或者你感觉 “一定有一个更好的方式”时, 尝试求助搜索引擎,很有可能有一个更好的方式。如果你遇到难题,可以来我们的答疑时间或者给我们发邮件。 6. 在其他工具中设置 Vim 快捷键 (见上面的操作指南)。 7. 进一步自定义你的 `~/.vimrc` 和安装更多插件。 8. (高阶)用 Vim 宏将 XML 转换到 JSON ([例子文件](/2020/files/example-data.xml))。 尝试着先完全自己做,但是在你卡住的时候可以查看上面 [宏](#macros) 章节。 ================================================ FILE: _2020/files/example-data.xml ================================================ Johnny Zhang Jr. amyalvarez@cole.com Edward Cook dsparks@alvarez-dunn.com Stephen Sweeney dlewis@gmail.com Krystal Riley jflores@wright.biz Ashley Robinson robertsmichael@yahoo.com Kimberly Brooks sharoncunningham@larson.com Brent Proctor edward86@stewart.com William Roberts parkertodd@webb.com Amanda Morales lorizavala@hodges.com Bryan Poole Jr. carolyn56@gray-campos.net Dale Hall martinjames@yahoo.com Isabella Reynolds wbowen@wallace.com Ann Rodriguez charles37@taylor-riley.biz Bryan Davis jessica60@hotmail.com Dalton Powell piercenatasha@yahoo.com Scott Turner harold68@yahoo.com Nicholas Castillo dawnstephens@robinson.info Joseph Pierce lukepatterson@hotmail.com Robyn White jenniferrobinson@hotmail.com Justin Rice brandi76@gmail.com Jamie Graham harrisdavid@yahoo.com Phillip Schmidt stephanie33@gmail.com John Baker todd86@hotmail.com Sharon Austin srivera@yahoo.com Erica Avila jenniferreed@bowers-wilson.com Jeremy Bass jdavis@collins.com Joshua Parsons stephaniecoleman@miller-barker.com Emma Mccoy taylorjohn@wagner.net Megan Williams ronnie54@gmail.com Michael Sutton connie58@mendoza.net Nicholas York kennedykevin@collins.com Donald Robles williamsbrandon@gmail.com Melissa Allen pproctor@ramos-patel.com Shannon Jones beckkathleen@johnson.com David White sandra73@thompson.com Jonathan Thomas johnsonjeremy@gmail.com Rachael Floyd amanda78@johnson.info Tina Carter josewells@jones.net Eric Johnson bowersaustin@hernandez-edwards.com William Kramer rhunt@johnson.com Nathan Williams cynthiayoung@hotmail.com Patty Schwartz salinasdavid@sheppard.biz David Collins pcalhoun@yahoo.com James Thomas brianfox@rogers-cruz.com Mark Casey jerry88@graham.com Robert Galloway cherylmcgee@hotmail.com Caitlin Dunn nicholemartin@yahoo.com Nancy Allison martha33@molina-bullock.com Marvin Burns wrocha@gmail.com Kimberly Jones anitamunoz@french-christian.com Caitlin Wood thomasrandall@bowers-sullivan.org Sara Burton riosangelica@gmail.com Jessica Roberson theresa11@hotmail.com Nicole Macias kevinhodge@martin.biz Christina Williams shawn35@rice-bailey.org Cody Winters nicholassmith@barron-wu.com Patricia Miller DDS pierceraymond@watkins.org Jennifer Lyons vrivera@gmail.com Jerry Rojas jacobalexander@yahoo.com Matthew Perez jrivas@hotmail.com Patrick Hogan moorelisa@yahoo.com Lisa Howard stephen90@smith.biz Justin Sloan edwardsmichael@hotmail.com Suzanne Morrow shane74@yahoo.com Theresa Lara maryrichardson@clark.com Christopher Powers yfowler@davis-lee.net Teresa Howell amy15@yahoo.com Richard Shelton ksmith@yahoo.com Jeremy Cole bleach@gmail.com Melissa Clark rosejeffrey@yahoo.com Kimberly Mcdaniel ularson@ross-david.com Kelly Dixon gatesstephen@hotmail.com Devin Quinn wjohnson@hotmail.com Kevin Greene lhanson@hotmail.com Jeffery Wiggins amy76@gmail.com Latoya Allen vking@yahoo.com Zachary Walker diazjames@hotmail.com Alyssa Molina elizabeth59@gmail.com Heather Miranda davidturner@cortez-martinez.biz Lori Gardner murphytaylor@yahoo.com Jessica Simpson jamesdean@rosales.com Anna Dickerson abigailmurphy@hotmail.com Molly Oconnor morrisrhonda@yahoo.com Brandi Braun ericksonmatthew@jenkins.org Renee Flowers brownantonio@yang-crosby.org Cassandra Compton progers@yahoo.com David Gilbert vickie78@gmail.com Brenda Davis cynthiajones@thornton.com Nicholas Rivera longalyssa@yahoo.com Dustin Hodges sgolden@lee.com Chad Wong williambernard@mccarty.net Robin Craig xbyrd@austin.com Heather Parker allenjoshua@rodriguez.com Jennifer Roberts manningtravis@gmail.com James Andrews ginaromero@hotmail.com Dorothy Hines dsmith@thomas.com Stephen Garcia hughesbrendan@hotmail.com Alfred Ellis elizabeth41@crawford.info Marilyn White victoriaford@hotmail.com Brian Graves cpatel@gmail.com Elizabeth Wagner newtonwesley@cohen.com Michelle Flores shelbygross@duke-thomas.info Larry Russell richard99@meyer.com Terrence Boyd markmartin@flores.com Jessica Carroll eric30@yahoo.com Erin Dean toddmartin@guerra.biz Craig Hernandez joshualang@gonzalez.com Amber Choi doughertynancy@harmon.org Renee Brown terribeard@archer-gibson.info Curtis Turner pjohnson@hotmail.com Benjamin Reed marksmith@austin.net Christina Fernandez richardjoseph@esparza-peters.com Jasmine Campbell thomasmatthew@gmail.com Catherine Bond coreyroberts@gonzalez.com Connie Jones koneal@riley.com Cody Taylor kelsey99@hotmail.com Kendra Gray walkerrussell@hotmail.com Alexander Murray grossrobert@hotmail.com Arthur Jackson travis73@hotmail.com Dr. William Vasquez DDS gonzalezdaniel@hotmail.com April Hampton desireemorris@mcguire.info Gerald Hunter justin91@ross-scott.biz Morgan Bolton erika30@lloyd-smith.biz Angela Barker daniel17@carr.com Angela Montgomery jonathangoodwin@smith-perez.com Yolanda Henry shawnmcguire@gmail.com Susan Hines sarahbailey@wallace.com Michelle Young lewismichele@yahoo.com Glen Hood ljackson@vazquez.com Christopher Wright evansjulie@walton.com Susan Guzman DDS medinaelizabeth@gmail.com Barbara Cortez bchavez@cameron.com Stacey Hammond nancyturner@stewart.com Amanda Stout macdonaldlatoya@hotmail.com Lisa Johnson wnolan@gmail.com Carlos Wyatt iperez@cohen.com Samantha Brewer thomas47@hotmail.com Brett Jackson zpowell@cruz-rivera.com Johnny Guzman tmerritt@yahoo.com Mary Davis collinslisa@hotmail.com Willie Mccoy joshua20@terrell.biz Kelsey Rivera randy72@gmail.com Melissa Maddox christopher13@gmail.com Jason Rodriguez kellypierce@harris.com Donna Walsh wardraymond@martinez.com Monique Patel cynthia75@james.net Dr. Lindsay Farrell PhD brownmaria@gmail.com Ann Ruiz jeremiah94@pennington.org Mary Alexander catherineharper@munoz.org Brittany Russell haileywinters@russell-coffey.net Dominique Rosales matthewpatterson@carr.com Henry Waters karen72@logan.com Jared Weaver karlafletcher@baldwin.org Mr. Thomas Atkins gboone@gmail.com Carla Cohen ibarron@gmail.com Tricia Lewis pperez@hotmail.com Mario Gill lisa43@brown.org James Olsen vickie82@hotmail.com Michael Perry rdavis@yahoo.com Matthew Lucas joshuagray@carpenter-stanley.com Christine Torres samanthayoung@smith-aguilar.biz Lindsay Miller randyevans@yahoo.com Margaret Jones kevincantu@alexander-carson.org Cameron Mcdonald deckerjerome@garcia.com Brittany Sanders dennis55@leonard-turner.com Daniel Patterson timothy36@novak.com David Chaney kristen02@hotmail.com Sheri Silva idawson@alvarez.com Holly Ward saraallen@dunn-smith.net Bryan Solis stacey30@lam.biz Diane Carter paulvargas@gmail.com David Brown james98@gmail.com Bridget Fritz beth24@hotmail.com Paul Boyd johngutierrez@hotmail.com Ernest Baker phillipwhite@hotmail.com George Myers frank52@hammond.com Daniel Miller joshua96@gmail.com Jonathan Ayala jerryharris@davis.net Jill Stone pwright@hotmail.com Trevor Richard mreed@thompson.org Jason Thomas josephflowers@hotmail.com Arthur Thomas lnelson@hicks.com Austin Collins ambermann@barnes.com Jason Diaz ericreyes@hotmail.com Darryl Hall faithdixon@barnes-burgess.org Jason Thomas brittany32@yahoo.com John Sanders waltontheresa@hotmail.com Lisa Hayes victor14@hotmail.com Chelsea Wong iwatkins@williams-solomon.com Joseph Fitzgerald mary86@hotmail.com Crystal Schroeder kbarron@wilson-flynn.org Denise Bean noah23@gmail.com Jamie Atkins cwebb@hotmail.com Joshua Kim esmith@ramirez.com Deanna Mooney jason13@turner.com Jasmine Baker torresjacob@braun.com Victoria Williams rwilliams@hotmail.com Sandra Hall williamsonrichard@gmail.com Miranda Mcpherson xrussell@barajas.biz Samantha Walton danielle73@gmail.com Kyle Serrano stonecassandra@mcfarland.info Mr. Bruce Maldonado DDS diazmatthew@yahoo.com Amber Fisher jonesdavid@rubio.info Brett Berry millerteresa@gmail.com Cory Bradley umatthews@summers.com Ryan Peters shepherdmonique@gmail.com Laura Lee lfleming@higgins.com Christian Smith johnnymartinez@castro-miller.com Kelly Hanson velazquezsandra@chavez-malone.info Brian King hwood@yahoo.com Cynthia Owens sbrown@hotmail.com Lisa Clark derek74@bell-martinez.com Brenda Ford kevin55@hotmail.com Daniel Brady wbennett@hotmail.com Jake Wilson lorraine60@solis.biz April Cole halltyler@yahoo.com Melissa Callahan cmckenzie@rodriguez.info Taylor Brown davisadam@gmail.com Patrick Guerrero hannah48@delgado.net Brian Gonzalez burchmalik@johnson.com Robert Bailey debbiemoore@hotmail.com Jesus Maynard gene45@gmail.com Linda Greer johnharris@reed-allen.net Travis Thomas bryantrachel@gmail.com Vicki Mitchell edaniels@hotmail.com Paula Espinoza donnameyer@dennis.org James Hoffman haustin@larson-wiggins.biz Ashlee Perkins stevenknapp@miller.com Rebecca Leon smitchell@simpson-johnson.com Jorge Williams shawn36@peters-meadows.com Bob Flores kellercourtney@yahoo.com Lisa Miller johnsoncrystal@gmail.com Brandon Davis bryanpetersen@hotmail.com Joshua Daugherty josehayes@carey.com Justin Wise pamelacosta@simmons-morrow.com Kimberly Johnson combssandra@deleon.com Toni Stone eestrada@charles.com Julie Rivers rwilliams@castillo-nelson.org Kelly Scott danielsmith@hotmail.com Michael Carr clarklisa@newman-barrett.com Jonathan Vaughn dennisrebecca@lawrence-harris.com Erica Lowe wilsonkelly@hotmail.com Kimberly Clark jose15@gmail.com Lindsey Robertson rdickerson@yahoo.com Cindy Anderson gmorton@daniels.com Tami Barber harveykaren@hotmail.com Tiffany Wu jessica90@gmail.com Edward Bowers hallkathy@gmail.com Shawn Collier rhondasmith@hotmail.com Michael Cox usimpson@graham-cunningham.net ================================================ FILE: _2020/files/vimrc ================================================ " Comments in Vimscript start with a `"`. " If you open this file in Vim, it'll be syntax highlighted for you. " Vim is based on Vi. Setting `nocompatible` switches from the default " Vi-compatibility mode and enables useful Vim functionality. This " configuration option turns out not to be necessary for the file named " '~/.vimrc', because Vim automatically enters nocompatible mode if that file " is present. But we're including it here just in case this config file is " loaded some other way (e.g. saved as `foo`, and then Vim started with " `vim -u foo`). set nocompatible " Turn on syntax highlighting. syntax on " Disable the default Vim startup message. set shortmess+=I " Show line numbers. set number " This enables relative line numbering mode. With both number and " relativenumber enabled, the current line shows the true line number, while " all other lines (above and below) are numbered relative to the current line. " This is useful because you can tell, at a glance, what count is needed to " jump up or down to a particular line, by {count}k to go up or {count}j to go " down. set relativenumber " Always show the status line at the bottom, even if you only have one window open. set laststatus=2 " The backspace key has slightly unintuitive behavior by default. For example, " by default, you can't backspace before the insertion point set with 'i'. " This configuration makes backspace behave more reasonably, in that you can " backspace over anything. set backspace=indent,eol,start " By default, Vim doesn't let you hide a buffer (i.e. have a buffer that isn't " shown in any window) that has unsaved changes. This is to prevent you from " " forgetting about unsaved changes and then quitting e.g. via `:qa!`. We find " hidden buffers helpful enough to disable this protection. See `:help hidden` " for more information on this. set hidden " This setting makes search case-insensitive when all characters in the string " being searched are lowercase. However, the search becomes case-sensitive if " it contains any capital letters. This makes searching more convenient. set ignorecase set smartcase " Enable searching as you type, rather than waiting till you press enter. set incsearch " Unbind some useless/annoying default key bindings. nmap Q " 'Q' in normal mode enters Ex mode. You almost never want this. " Disable audible bell because it's annoying. set noerrorbells visualbell t_vb= " Enable mouse support. You should avoid relying on this too much, but it can " sometimes be convenient. set mouse+=a " Try to prevent bad habits like using the arrow keys for movement. This is " not the only possible bad habit. For example, holding down the h/j/k/l keys " for movement, rather than using more efficient movement commands, is also a " bad habit. The former is enforceable through a .vimrc, while we don't know " how to prevent the latter. " Do this in normal mode... nnoremap :echoe "Use h" nnoremap :echoe "Use l" nnoremap :echoe "Use k" nnoremap :echoe "Use j" " ...and in insert mode inoremap :echoe "Use h" inoremap :echoe "Use l" inoremap :echoe "Use k" inoremap :echoe "Use j" ================================================ FILE: _2020/index.html ================================================ --- layout: page title: "2020 Lectures" permalink: /2020/ phony: true excerpt: '' # work around a bug ---
    {% assign lectures = site['2020'] | sort: 'date' %} {% for lecture in lectures %} {% if lecture.phony != true %}
  • {{ lecture.date | date: '%-m/%d' }}: {% if lecture.ready %} {{ lecture.title }} {% elsif lecture.noclass %} {{ lecture.title }} [no class] {% else %} {{ lecture.title }} [coming soon] {% endif %} {% if lecture.details %}
    ({{ lecture.details }}) {% endif %}
  • {% endif %} {% endfor %}
讲座视频可以在 YouTube上找到。

往期讲座

您也可以访问去年的讲座笔记和视频

================================================ FILE: _2020/metaprogramming.md ================================================ --- layout: lecture title: "元编程" details: 构建系统、依赖管理、测试、持续集成 date: 2020-01-27 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: _Ms1Z4xfqv4 solution: ready: true url: metaprogramming-solution --- 我们这里说的 “元编程(metaprogramming)” 是什么意思呢?好吧,对于本文要介绍的这些内容,这是我们能够想到的最能概括它们的词。因为我们今天要讲的东西,更多是关于 *流程* ,而不是写代码或更高效的工作。本节课我们会学习构建系统、代码测试以及依赖管理。在您还是学生的时候,这些东西看上去似乎对您来说没那么重要,不过当您开始实习或走进社会的时候,您将会接触到大型的代码库,本节课讲授的这些东西也会变得随处可见。必须要指出的是,“元编程” 也有 “[用于操作程序的程序](https://en.wikipedia.org/wiki/Metaprogramming)” 之含义,这和我们今天讲座所介绍的概念是完全不同的。 # 构建系统 如果您使用 LaTeX 来编写论文,您需要执行哪些命令才能编译出您想要的论文呢?执行基准测试、绘制图表然后将其插入论文的命令又有哪些?或者,如何编译本课程提供的代码并执行测试呢? 对于大多数系统来说,不论其是否包含代码,都会包含一个 “构建过程”。有时,您需要执行一系列操作。通常,这一过程包含了很多步骤,很多分支。执行一些命令来生成图表,然后执行另外的一些命令生成结果,然后再执行其他的命令来生成最终的论文。有很多事情需要我们完成,您并不是第一个因此感到苦恼的人,幸运的是,有很多工具可以帮助我们完成这些操作。 这些工具通常被称为 "构建系统",而且这些工具还不少。如何选择工具完全取决于您当前手头上要完成的任务以及项目的规模。从本质上讲,这些工具都是非常类似的。您需要定义 *依赖*、*目标* 和 *规则*。您必须告诉构建系统您具体的构建目标,系统的任务则是找到构建这些目标所需要的依赖,并根据规则构建所需的中间产物,直到最终目标被构建出来。理想的情况下,如果目标的依赖没有发生改动,并且我们可以从之前的构建中复用这些依赖,那么与其相关的构建规则并不会被执行。 `make` 是最常用的构建系统之一,您会发现它通常被安装到了几乎所有基于 UNIX 的系统中。`make` 并不完美,但是对于中小型项目来说,它已经足够好了。当您执行 `make` 时,它会去参考当前目录下名为 `Makefile` 的文件。所有构建目标、相关依赖和规则都需要在该文件中定义,它看上去是这样的: ```make paper.pdf: paper.tex plot-data.png pdflatex paper.tex plot-%.png: %.dat plot.py ./plot.py -i $*.dat -o $@ ``` 这个文件中的指令,即如何使用右侧文件构建左侧文件的规则。或者,换句话说,冒号左侧的是构建目标,冒号右侧的是构建它所需的依赖。缩进的部分是从依赖构建目标时需要用到的一段命令。在 `make` 中,第一条指令还指明了构建的目的,如果您使用不带参数的 `make`,这便是我们最终的构建结果。或者,您可以使用这样的命令来构建其他目标:`make plot-data.png`。 规则中的 `%` 是一种模式,它会匹配其左右两侧相同的字符串。例如,如果目标是 `plot-foo.png`, `make` 会去寻找 `foo.dat` 和 `plot.py` 作为依赖。现在,让我们看看如果在一个空的源码目录中执行 `make` 会发生什么? ```console $ make make: *** No rule to make target 'paper.tex', needed by 'paper.pdf'. Stop. ``` `make` 会告诉我们,为了构建出 `paper.pdf`,它需要 `paper.tex`,但是并没有一条规则能够告诉它如何构建该文件。让我们构建它吧! ```console $ touch paper.tex $ make make: *** No rule to make target 'plot-data.png', needed by 'paper.pdf'. Stop. ``` 哟,有意思,我们是 **有** 构建 `plot-data.png` 的规则的,但是这是一条模式规则。因为源文件 `data.dat` 并不存在,因此 `make` 就会告诉您它不能构建 `plot-data.png`,让我们创建这些文件: ```console $ cat paper.tex \documentclass{article} \usepackage{graphicx} \begin{document} \includegraphics[scale=0.65]{plot-data.png} \end{document} $ cat plot.py #!/usr/bin/env python import matplotlib import matplotlib.pyplot as plt import numpy as np import argparse parser = argparse.ArgumentParser() parser.add_argument('-i', type=argparse.FileType('r')) parser.add_argument('-o') args = parser.parse_args() data = np.loadtxt(args.i) plt.plot(data[:, 0], data[:, 1]) plt.savefig(args.o) $ cat data.dat 1 1 2 2 3 3 4 4 5 8 ``` 当我们执行 `make` 时会发生什么? ```console $ make ./plot.py -i data.dat -o plot-data.png pdflatex paper.tex ... lots of output ... ``` 看!PDF ! 如果再次执行 `make` 会怎样? ```console $ make make: 'paper.pdf' is up to date. ``` 什么事情都没做!为什么?好吧,因为它什么都不需要做。make 检查出所有之前构建的目标仍然与其列出的依赖项保持最新状态。让我们试试修改 `paper.tex` 后再重新执行 `make`: ```console $ vim paper.tex $ make pdflatex paper.tex ... ``` 注意 `make` 并 **没有** 重新构建 `plot.py`,因为没必要;`plot-data.png` 的所有依赖都没有发生改变。 # 依赖管理 就您的项目来说,它的依赖可能本身也是其他的项目。您也许会依赖某些程序(例如 `python`)、系统包(例如 `openssl`)或相关编程语言的库(例如 `matplotlib`)。 现在,大多数的依赖可以通过某些 **软件仓库** 来获取,这些仓库会在一个地方托管大量的依赖,我们则可以通过一套非常简单的机制来安装依赖。例如 Ubuntu 系统下面有 Ubuntu 软件包仓库,您可以通过 `apt` 这个工具来访问, RubyGems 则包含了 Ruby 的相关库,PyPi 包含了 Python 库, Arch Linux 用户贡献的库则可以在 Arch User Repository 中找到。 由于每个仓库、每种工具的运行机制都不太一样,因此我们并不会在本节课深入讲解具体的细节。我们会介绍一些通用的术语,例如 *版本控制*。大多数被其他项目所依赖的项目都会在每次发布新版本时创建一个 *版本号*。通常看上去像 8.1.3 或 64.1.20192004。版本号一般是数字构成的,但也并不绝对。版本号有很多用途,其中最重要的作用是保证软件能够运行。试想一下,假如我的库要发布一个新版本,在这个版本里面我重命名了某个函数。如果有人在我的库升级版本后,仍希望基于它构建新的软件,那么很可能构建会失败,因为它希望调用的函数已经不复存在了。有了版本控制就可以很好的解决这个问题,我们可以指定当前项目需要基于某个版本,甚至某个范围内的版本,或是某些项目来构建。这么做的话,即使某个被依赖的库发生了变化,依赖它的软件可以基于其之前的版本进行构建。 这样还并不理想!如果我们发布了一项和安全相关的升级,它并 *没有* 影响到任何公开接口(API),但是处于安全的考虑,依赖它的项目都应该立即升级,那应该怎么做呢?这也是版本号包含多个部分的原因。不同项目所用的版本号其具体含义并不完全相同,但是一个相对比较常用的标准是 [语义版本号](https://semver.org/),这种版本号具有不同的语义,它的格式是这样的:主版本号.次版本号.补丁号。相关规则有: - 如果新的版本没有改变 API,请将补丁号递增; - 如果您添加了 API 并且该改动是向后兼容的,请将次版本号递增; - 如果您修改了 API 但是它并不向后兼容,请将主版本号递增。 这么做有很多好处。现在如果我们的项目是基于您的项目构建的,那么只要最新版本的主版本号只要没变就是安全的 ,次版本号不低于之前我们使用的版本即可。换句话说,如果我依赖的版本是 `1.3.7`,那么使用 `1.3.8`、`1.6.1`,甚至是 `1.3.0` 都是可以的。如果版本号是 `2.2.4` 就不一定能用了,因为它的主版本号增加了。我们可以将 Python 的版本号作为语义版本号的一个实例。您应该知道,Python 2 和 Python 3 的代码是不兼容的,这也是为什么 Python 的主版本号改变的原因。类似的,使用 Python 3.5 编写的代码在 3.7 上可以运行,但是在 3.4 上可能会不行。 使用依赖管理系统的时候,您可能会遇到锁文件(_lock files_)这一概念。锁文件列出了您当前每个依赖所对应的具体版本号。通常,您需要执行升级程序才能更新依赖的版本。这么做的原因有很多,例如避免不必要的重新编译、创建可复现的软件版本或禁止自动升级到最新版本(可能会包含 bug)。还有一种极端的依赖锁定叫做 _vendoring_,它会把您的依赖中的所有代码直接拷贝到您的项目中,这样您就能够完全掌控代码的任何修改,同时您也可以将自己的修改添加进去,不过这也意味着如果该依赖的维护者更新了某些代码,您也必须要自己去拉取这些更新。 # 持续集成系统 随着您接触到的项目规模越来越大,您会发现修改代码之后还有很多额外的工作要做。您可能需要上传一份新版本的文档、上传编译后的文件到某处、发布代码到 pypi,执行测试套件等等。或许您希望每次有人提交代码到 GitHub 的时候,他们的代码风格被检查过并执行过某些基准测试?如果您有这方面的需求,那么请花些时间了解一下持续集成。 持续集成(Continuous integration),或者叫做 CI 是一种雨伞术语(umbrella term,涵盖了一组术语的术语),它指的是那些“当您的代码变动时,自动运行的东西”,市场上有很多提供各式各样 CI 工具的公司,这些工具大部分都是免费或开源的。比较大的有 Travis CI、Azure Pipelines 和 GitHub Actions。它们的工作原理都是类似的:您需要在代码仓库中添加一个文件,描述当前仓库发生任何修改时,应该如何应对。目前为止,最常见的规则是:如果有人提交代码,执行测试套件。当这个事件被触发时,CI 提供方会启动一个(或多个)虚拟机,执行您制定的规则,并且通常会记录下相关的执行结果。您可以进行某些设置,这样当测试套件失败时您能够收到通知或者当测试全部通过时,您的仓库主页会显示一个徽标。 本课程的网站基于 GitHub Pages 构建,这就是一个很好的例子。Pages 在每次 `master` 有代码更新时,会执行 Jekyll 博客软件,然后使您的站点可以通过某个 GitHub 域名来访问。对于我们来说这些事情太琐碎了,我现在我们只需要在本地进行修改,然后使用 git 提交代码,发布到远端。CI 会自动帮我们处理后续的事情。 ## 测试简介 多数的大型软件都有“测试套件”。您可能已经对测试的相关概念有所了解,但是我们觉得有些测试方法和测试术语还是应该再次提醒一下: - 测试套件(Test suite):所有测试的统称。 - 单元测试(Unit test):一种“微型测试”,用于对某个封装的特性进行测试。 - 集成测试(Integration test):一种“宏观测试”,针对系统的某一大部分进行,测试其不同的特性或组件是否能 *协同* 工作。 - 回归测试(Regression test):一种实现特定模式的测试,用于保证之前引起问题的 bug 不会再次出现。 - 模拟(Mocking): 使用一个假的实现来替换函数、模块或类型,屏蔽那些和测试不相关的内容。例如,您可能会“模拟网络连接” 或 “模拟硬盘”。 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. 大多数的 makefiles 都提供了 一个名为 `clean` 的构建目标,这并不是说我们会生成一个名为 `clean` 的文件,而是我们可以使用它清理文件,让 make 重新构建。您可以理解为它的作用是“撤销”所有构建步骤。在上面的 makefile 中为 `paper.pdf` 实现一个 `clean` 目标。您需要将构建目标设置为 [phony](https://www.gnu.org/software/make/manual/html_node/Phony-Targets.html)。您也许会发现 [`git ls-files`](https://git-scm.com/docs/git-ls-files) 子命令很有用。其他一些有用的 make 构建目标可以在 [这里](https://www.gnu.org/software/make/manual/html_node/Standard-Targets.html#Standard-Targets) 找到; 2. 指定版本要求的方法很多,让我们学习一下 [Rust 的构建系统](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html) 的依赖管理。大多数的包管理仓库都支持类似的语法。对于每种语法(尖号、波浪号、通配符、比较、多个版本要求),构建一种场景使其具有实际意义; 3. Git 可以作为一个简单的 CI 系统来使用,在任何 git 仓库中的 `.git/hooks` 目录中,您可以找到一些文件(当前处于未激活状态),它们的作用和脚本一样,当某些事件发生时便可以自动执行。请编写一个 [`pre-commit`](https://git-scm.com/docs/githooks#_pre_commit) 钩子,它会在提交前执行 `make paper.pdf` 并在出现构建失败的情况拒绝您的提交。这样做可以避免产生包含不可构建版本的提交信息; 4. 基于 [GitHub Pages](https://pages.github.com/) 创建任意一个可以自动发布的页面。添加一个 [GitHub Action](https://github.com/features/actions) 到该仓库,对仓库中的所有 shell 文件执行 `shellcheck`([方法之一](https://github.com/marketplace/actions/shellcheck)); 5. [构建属于您的](https://help.github.com/en/actions/automating-your-workflow-with-github-actions/building-actions) GitHub action,对仓库中所有的 `.md` 文件执行 [`proselint`](http://proselint.com/) 或 [`write-good`](https://github.com/btford/write-good),在您的仓库中开启这一功能,提交一个包含错误的文件看看该功能是否生效。 ================================================ FILE: _2020/potpourri.md ================================================ --- layout: lecture title: "大杂烩" date: 2020-01-29 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: JZDt-PRq0uo --- ## 目录 - [修改键位映射](#%E4%BF%AE%E6%94%B9%E9%94%AE%E4%BD%8D%E6%98%A0%E5%B0%84) - [守护进程](#%E5%AE%88%E6%8A%A4%E8%BF%9B%E7%A8%8B) - [FUSE](#fuse) - [备份](#%E5%A4%87%E4%BB%BD) - [API(应用程序接口)](#API%EF%BC%88%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E6%8E%A5%E5%8F%A3%EF%BC%89) - [常见命令行标志参数及模式](#%E5%B8%B8%E8%A7%81%E5%91%BD%E4%BB%A4%E8%A1%8C%E6%A0%87%E5%BF%97%E5%8F%82%E6%95%B0%E5%8F%8A%E6%A8%A1%E5%BC%8F) - [窗口管理器](#%E7%AA%97%E5%8F%A3%E7%AE%A1%E7%90%86%E5%99%A8) - [VPN](#vpn) - [Markdown](#markdown) - [Hammerspoon (macOS 桌面自动化)](#Hammerspoon%20(macOS%E6%A1%8C%E9%9D%A2%E8%87%AA%E5%8A%A8%E5%8C%96)) - [资源](#%E8%B5%84%E6%BA%90) - [开机引导以及 Live USB](#%E5%BC%80%E6%9C%BA%E5%BC%95%E5%AF%BC%E4%BB%A5%E5%8F%8A%20Live%20USB) - [Docker, Vagrant, VMs, Cloud, OpenStack](#docker-vagrant-vms-cloud-openstack) - [交互式记事本编程](#%E4%BA%A4%E4%BA%92%E5%BC%8F%E8%AE%B0%E4%BA%8B%E6%9C%AC%E7%BC%96%E7%A8%8B) - [GitHub](#github) ## 修改键位映射 作为一名程序员,键盘是你的主要输入工具。它像计算机里的其他部件一样是可配置的,而且值得你在这上面花时间。 一个很常见的配置是修改键位映射。通常这个功能由在计算机上运行的软件实现。当某一个按键被按下,软件截获键盘发出的按键事件(keypress event)并使用另外一个事件取代。比如: - 将 Caps Lock 映射为 Ctrl 或者 Escape:Caps Lock 使用了键盘上一个非常方便的位置而它的功能却很少被用到,所以我们(讲师)非常推荐这个修改; - 将 PrtSc 映射为播放/暂停:大部分操作系统支持播放/暂停键; - 交换 Ctrl 和 Meta 键(Windows 的徽标键或者 Mac 的 Command 键)。 你也可以将键位映射为任意常用的指令。软件监听到特定的按键组合后会运行设定的脚本。 - 打开一个新的终端或者浏览器窗口; - 输出特定的字符串,比如:一个超长邮件地址或者 MIT ID; - 使计算机或者显示器进入睡眠模式。 甚至更复杂的修改也可以通过软件实现: - 映射按键顺序,比如:按 Shift 键五下切换大小写锁定; - 区别映射单点和长按,比如:单点 Caps Lock 映射为 Escape,而长按 Caps Lock 映射为 Ctrl; - 对不同的键盘或软件保存专用的映射配置。 下面是一些修改键位映射的软件: - macOS - [karabiner-elements](https://pqrs.org/osx/karabiner/), [skhd](https://github.com/koekeishiya/skhd) 或者 [BetterTouchTool](https://folivora.ai/) - Linux - [xmodmap](https://wiki.archlinux.org/index.php/Xmodmap) 或者 [Autokey](https://github.com/autokey/autokey) - Windows - 控制面板,[AutoHotkey](https://www.autohotkey.com/) 或者 [SharpKeys](https://www.randyrants.com/category/sharpkeys/) - QMK - 如果你的键盘支持定制固件,[QMK](https://docs.qmk.fm/) 可以直接在键盘的硬件上修改键位映射。保留在键盘里的映射免除了在别的机器上的重复配置。 ## 守护进程 即便守护进程(daemon)这个词看上去有些陌生,你应该已经大约明白它的概念。大部分计算机都有一系列在后台保持运行,不需要用户手动运行或者交互的进程。这些进程就是守护进程。以守护进程运行的程序名一般以 `d` 结尾,比如 SSH 服务端 `sshd`,用来监听传入的 SSH 连接请求并对用户进行鉴权。 Linux 中的 `systemd`(the system daemon)是最常用的配置和运行守护进程的方法。运行 `systemctl status` 命令可以看到正在运行的所有守护进程。这里面有很多可能你没有见过,但是掌管了系统的核心部分的进程:管理网络、DNS 解析、显示系统的图形界面等等。用户使用 `systemctl` 命令和 `systemd` 交互来 `enable`(启用)、`disable`(禁用)、`start`(启动)、`stop`(停止)、`restart`(重启)、或者 `status`(检查)配置好的守护进程及系统服务。 `systemd` 提供了一个很方便的界面用于配置和启用新的守护进程或系统服务。下面的配置文件使用了守护进程来运行一个简单的 Python 程序。文件的内容非常直接所以我们不对它详细阐述。`systemd` 配置文件的详细指南可参见 [freedesktop.org](https://www.freedesktop.org/software/systemd/man/systemd.service.html)。 ```ini # /etc/systemd/system/myapp.service [Unit] # 配置文件描述 Description=My Custom App # 在网络服务启动后启动该进程 After=network.target [Service] # 运行该进程的用户 User=foo # 运行该进程的用户组 Group=foo # 运行该进程的根目录 WorkingDirectory=/home/foo/projects/mydaemon # 开始该进程的命令 ExecStart=/usr/bin/local/python3.7 app.py # 在出现错误时重启该进程 Restart=on-failure [Install] # 相当于Windows的开机启动。即使GUI没有启动,该进程也会加载并运行 WantedBy=multi-user.target # 如果该进程仅需要在GUI活动时运行,这里应写作: # WantedBy=graphical.target # graphical.target在multi-user.target的基础上运行和GUI相关的服务 ``` 如果你只是想定期运行一些程序,可以直接使用 [`cron`](https://www.man7.org/linux/man-pages/man8/cron.8.html)。它是一个系统内置的,用来执行定期任务的守护进程。 ## FUSE 现在的软件系统一般由很多模块化的组件构建而成。你使用的操作系统可以通过一系列共同的方式使用不同的文件系统上的相似功能。比如当你使用 `touch` 命令创建文件的时候,`touch` 使用系统调用(system call)向内核发出请求。内核再根据文件系统,调用特有的方法来创建文件。这里的问题是,UNIX 文件系统在传统上是以内核模块的形式实现,导致只有内核可以进行文件系统相关的调用。 [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace)(用户空间文件系统)允许运行在用户空间上的程序实现文件系统调用,并将这些调用与内核接口联系起来。在实践中,这意味着用户可以在文件系统调用中实现任意功能。 FUSE 可以用于实现如:一个将所有文件系统操作都使用 SSH 转发到远程主机,由远程主机处理后返回结果到本地计算机的虚拟文件系统。这个文件系统里的文件虽然存储在远程主机,对于本地计算机上的软件而言和存储在本地别无二致。`sshfs` 就是一个实现了这种功能的 FUSE 文件系统。 一些有趣的 FUSE 文件系统包括: - [sshfs](https://github.com/libfuse/sshfs):使用 SSH 连接在本地打开远程主机上的文件 - [rclone](https://rclone.org/commands/rclone_mount/):将 Dropbox、Google Drive、Amazon S3、或者 Google Cloud Storage 一类的云存储服务挂载为本地文件系统 - [gocryptfs](https://nuetzlich.net/gocryptfs/):覆盖在加密文件上的文件系统。文件以加密形式保存在磁盘里,但该文件系统挂载后用户可以直接从挂载点访问文件的明文 - [kbfs](https://keybase.io/docs/kbfs):分布式端到端加密文件系统。在这个文件系统里有私密(private),共享(shared),以及公开(public)三种类型的文件夹 - [borgbackup](https://borgbackup.readthedocs.io/en/stable/usage/mount.html):方便用户浏览删除重复数据后的压缩加密备份 ## 备份 任何没有备份的数据都可能在一个瞬间永远消失。复制数据很简单,但是可靠地备份数据很难。下面列举了一些关于备份的基础知识,以及一些常见做法容易掉进的陷阱。 首先,复制存储在同一个磁盘上的数据不是备份,因为这个磁盘是一个单点故障(single point of failure)。这个磁盘一旦出现问题,所有的数据都可能丢失。放在家里的外置磁盘因为火灾、抢劫等原因可能会和源数据一起丢失,所以是一个弱备份。推荐的做法是将数据备份到不同的地点存储。 同步方案也不是备份。即使方便如 Dropbox 或者 Google Drive,当数据在本地被抹除或者损坏,同步方案可能会把这些“更改”同步到云端。同理,像 RAID 这样的磁盘镜像方案也不是备份。它不能防止文件被意外删除、损坏、或者被勒索软件加密。 有效备份方案的几个核心特性是:版本控制,删除重复数据,以及安全性。对备份的数据实施版本控制保证了用户可以从任何记录过的历史版本中恢复数据。在备份中检测并删除重复数据,使其仅备份增量变化可以减少存储开销。在安全性方面,作为用户,你应该考虑别人需要有什么信息或者工具才可以访问或者完全删除你的数据及备份。最后一点,不要盲目信任备份方案。用户应该经常检查备份是否可以用来恢复数据。 备份不限制于备份在本地计算机上的文件。云端应用的重大发展使得我们很多的数据只存储在云端。当我们无法登录这些应用,在云端存储的网络邮件,社交网络上的照片,流媒体音乐播放列表,以及在线文档等等都会随之丢失。用户应该有这些数据的离线备份,而且已经有项目可以帮助下载并存储它们。 如果想要了解更多具体内容,请参考本课程 2019 年关于备份的 [课堂笔记](/2019/backups)。 ## API(应用程序接口) 关于如何使用计算机有效率地完成 _本地_ 任务,我们这堂课已经介绍了很多方法。这些方法在互联网上其实也适用。大多数线上服务提供的 API(应用程序接口)让你可以通过编程方式来访问这些服务的数据。比如,美国国家气象局就提供了一个可以从 shell 中获取天气预报的 API。 这些 API 大多具有类似的格式。它们的结构化 URL 通常使用 `api.service.com` 作为根路径,用户可以访问不同的子路径来访问需要调用的操作,以及添加查询参数使 API 返回符合查询参数条件的结果。 以美国天气数据为例,为了获得某个地点的天气数据,你可以发送一个 GET 请求(比如使用 `curl`)到 [`https://api.weather.gov/points/42.3604,-71.094`](https://api.weather.gov/points/42.3604,-71.094)。返回中会包括一系列用于获取特定信息(比如小时预报、气象观察站信息等)的 URL。通常这些返回都是 `JSON` 格式,你可以使用 [`jq`](https://stedolan.github.io/jq/) 等工具来选取需要的部分。 有些需要认证的 API 通常要求用户在请求中加入某种私密令牌(secret token)来完成认证。请阅读你想访问的 API 所提供的文档来确定它请求的认证方式,但是其实大多数 API 都会使用 [OAuth](https://www.oauth.com/)。OAuth 通过向用户提供一系列仅可用于该 API 特定功能的私密令牌进行校验。因为使用了有效 OAuth 令牌的请求在 API 看来就是用户本人发出的请求,所以请一定保管好这些私密令牌。否则其他人就可以冒用你的身份进行任何你可以在这个 API 上进行的操作。 [IFTTT](https://ifttt.com/) 这个网站可以将很多 API 整合在一起,让某 API 发生的特定事件触发在其他 API 上执行的任务。IFTTT 的全称 If This Then That 足以说明它的用法,比如在检测到用户的新推文后,自动发布在其他平台。但是你可以对它支持的 API 进行任意整合,所以试着来设置一下任何你需要的功能吧! ## 常见命令行标志参数及模式 命令行工具的用法千差万别,阅读 `man` 页面可以帮助你理解每种工具的用法。即便如此,下面我们将介绍一下命令行工具一些常见的共同功能。 - 大部分工具支持 `--help` 或者类似的标志参数(flag)来显示它们的简略用法。 - 会造成不可撤回操作的工具一般会提供“空运行”(dry run)标志参数,这样用户可以确认工具真实运行时会进行的操作。这些工具通常也会有“交互式”(interactive)标志参数,在执行每个不可撤回的操作前提示用户确认。 - `--version` 或者 `-V` 标志参数可以让工具显示它的版本信息(对于提交软件问题报告非常重要)。 - 基本所有的工具支持使用 `--verbose` 或者 `-v` 标志参数来输出详细的运行信息。多次使用这个标志参数,比如 `-vvv`,可以让工具输出更详细的信息(经常用于调试)。同样,很多工具支持 `--quiet` 标志参数来抑制除错误提示之外的其他输出。 - 大多数工具中,使用 `-` 代替输入或者输出文件名意味着工具将从标准输入(standard input)获取所需内容,或者向标准输出(standard output)输出结果。 - 会造成破坏性结果的工具一般默认进行非递归的操作,但是支持使用“递归”(recursive)标志函数(通常是 `-r`)。 - 有的时候你可能需要向工具传入一个 _看上去_ 像标志参数的普通参数,比如: - 使用 `rm` 删除一个叫 `-r` 的文件; - 在通过一个程序运行另一个程序的时候(`ssh machine foo`),向内层的程序(`foo`)传递一个标志参数。 这时候你可以使用特殊参数 `--` 让某个程序 _停止处理_ `--` 后面出现的标志参数以及选项(以 `-` 开头的内容): - `rm -- -r` 会让 `rm` 将 `-r` 当作文件名; - `ssh machine --for-ssh -- foo --for-foo` 的 `--` 会让 `ssh` 知道 `--for-foo` 不是 `ssh` 的标志参数。 ## 窗口管理器 大部分人适应了 Windows、macOS、以及 Ubuntu 默认的“拖拽”式窗口管理器。这些窗口管理器的窗口一般就堆在屏幕上,你可以拖拽改变窗口的位置、缩放窗口、以及让窗口堆叠在一起。这种堆叠式(floating/stacking)管理器只是窗口管理器中的一种。特别在 Linux 中,有很多种其他的管理器。 平铺式(tiling)管理器就是一个常见的替代。顾名思义,平铺式管理器会把不同的窗口像贴瓷砖一样平铺在一起而不和其他窗口重叠。这和 [tmux](https://github.com/tmux/tmux) 管理终端窗口的方式类似。平铺式管理器按照写好的布局显示打开的窗口。如果只打开一个窗口,它会填满整个屏幕。新开一个窗口的时候,原来的窗口会缩小到比如三分之二或者三分之一的大小来腾出空间。打开更多的窗口会让已有的窗口进一步调整。 就像 tmux 那样,平铺式管理器可以让你在完全不使用鼠标的情况下使用键盘切换、缩放、以及移动窗口。它们值得一试! ## VPN VPN 现在非常火,但我们不清楚这是不是因为 [一些好的理由](https://gist.github.com/joepie91/5a9909939e6ce7d09e29)。你应该了解 VPN 能提供的功能和它的限制。使用了 VPN 的你对于互联网而言,**最好的情况** 下也就是换了一个网络供应商(ISP)。所有你发出的流量看上去来源于 VPN 供应商的网络而不是你的“真实”地址,而你实际接入的网络只能看到加密的流量。 虽然这听上去非常诱人,但是你应该知道使用 VPN 只是把原本对网络供应商的信任放在了 VPN 供应商那里——网络供应商 _能看到的_,VPN 供应商 _也都能看到_。如果相比网络供应商你更信任 VPN 供应商,那当然很好。反之,则连接 VPN 的价值不明确。机场的不加密公共热点确实不可以信任,但是在家庭网络环境里,这个差异就没有那么明显。 你也应该了解现在大部分包含用户敏感信息的流量已经被 HTTPS 或者 TLS 加密。这种情况下你所处的网络环境是否“安全”不太重要:供应商只能看到你和哪些服务器在交谈,却不能看到你们交谈的内容。 这一切的大前提都是“最好的情况”。曾经发生过 VPN 提供商错误使用弱加密或者直接禁用加密的先例。另外,有些恶意的或者带有投机心态的供应商会记录和你有关的所有流量,并很可能会将这些信息卖给第三方。找错一家 VPN 经常比一开始就不用 VPN 更危险。 MIT 向有访问校内资源需求的成员开放自己运营的 [VPN](https://ist.mit.edu/vpn)。如果你也想自己配置一个 VPN,可以了解一下 [WireGuard](https://www.wireguard.com/) 以及 [Algo](https://github.com/trailofbits/algo)。 ## Markdown 你在职业生涯中大概率会编写各种各样的文档。在很多情况下这些文档需要使用标记来增加可读性,比如:插入粗体或者斜体内容,增加页眉、超链接、以及代码片段。 在不使用 Word 或者 LaTeX 等复杂工具的情况下,你可以考虑使用 [Markdown](https://commonmark.org/help/) 这个轻量化的标记语言(markup language)。你可能已经见过 Markdown 或者它的一个变种。很多环境都支持并使用 Markdown 的一些子功能。 Markdown 致力于将人们编写纯文本时的一些习惯标准化。比如: - 用 `*` 包围的文字表示强调(*斜体*),或者用 `**` 表示特别强调(**粗体**); - 以 `#` 开头的行是标题,`#` 的数量表示标题的级别,比如:`##二级标题`; - 以 `-` 开头代表一个无序列表的元素。一个数字加 `.`(比如 `1.`)代表一个有序列表元素; - 反引号 `` ` ``(backtick)包围的文字会以 `代码字体` 显示。如果要显示一段代码,可以在每一行前加四个空格缩进,或者使用三个反引号包围整个代码片段: ``` 就像这样 ``` - 如果要添加超链接,将 _需要显示_ 的文字用方括号包围,并在后面紧接着用圆括号包围链接:`[显示文字](指向的链接)`。 Markdown 不仅容易上手,而且应用非常广泛。实际上本课程的课堂笔记和其他资料都是使用 Markdown 编写的。点击 [这个链接](https://github.com/missing-semester-cn/missing-semester-cn.github.io/blob/master/_2020/potpourri.md) 可以看到本页面的原始 Markdown 内容。 ## Hammerspoon (macOS 桌面自动化) [Hammerspoon](https://www.hammerspoon.org/) 是面向 macOS 的一个桌面自动化框架。它允许用户编写和操作系统功能挂钩的 Lua 脚本,从而与键盘、鼠标、窗口、文件系统等交互。 下面是 Hammerspoon 的一些示例应用: - 绑定移动窗口到的特定位置的快捷键 - 创建可以自动将窗口整理成特定布局的菜单栏按钮 - 在你到实验室以后,通过检测所连接的 WiFi 网络自动静音扬声器 - 在你不小心拿了朋友的充电器时弹出警告 从用户的角度,Hammerspoon 可以运行任意 Lua 代码,绑定菜单栏按钮、按键、或者事件。Hammerspoon 提供了一个全面的用于和系统交互的库,因此它能没有限制地实现任何功能。你可以从头编写自己的 Hammerspoon 配置,也可以结合别人公布的配置来满足自己的需求。 ### 资源 - [Getting Started with Hammerspoon](https://www.hammerspoon.org/go/):Hammerspoon 官方教程 - [Sample configurations](https://github.com/Hammerspoon/hammerspoon/wiki/Sample-Configurations):Hammerspoon 官方示例配置 - [Anish's Hammerspoon config](https://github.com/anishathalye/dotfiles-local/tree/mac/hammerspoon):Anish 的 Hammerspoon 配置 ## 开机引导以及 Live USB 在你的计算机启动时,[BIOS](https://en.wikipedia.org/wiki/BIOS) 或者 [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) 会在加载操作系统之前对硬件系统进行初始化,这被称为引导(booting)。你可以通过按下计算机提示的键位组合来配置引导,比如 `Press F9 to configure BIOS. Press F12 to enter boot menu`。在 BIOS 菜单中你可以对硬件相关的设置进行更改,也可以在引导菜单中选择从硬盘以外的其他设备加载操作系统——比如 Live USB。 [Live USB](https://en.wikipedia.org/wiki/Live_USB) 是包含了完整操作系统的闪存盘。Live USB 的用途非常广泛,包括: - 作为安装操作系统的启动盘; - 在不将操作系统安装到硬盘的情况下,直接运行 Live USB 上的操作系统; - 对硬盘上的相同操作系统进行修复; - 恢复硬盘上的数据。 Live USB 通过在闪存盘上 _写入_ 操作系统的镜像制作,而写入不是单纯的往闪存盘上复制 `.iso` 文件。你可以使用 [UNetbootin](https://unetbootin.github.io/) 、[Rufus](https://github.com/pbatard/rufus) 等 Live USB 写入工具制作。 ## Docker, Vagrant, VMs, Cloud, OpenStack [虚拟机](https://en.wikipedia.org/wiki/Virtual_machine)(Virtual Machine)以及容器化(containerization)等工具可以帮助你模拟一个包括操作系统的完整计算机系统。虚拟机可以用于创建独立的测试或者开发环境,以及用作安全测试的沙盒。 [Vagrant](https://www.vagrantup.com/) 是一个构建和配置虚拟开发环境的工具。它支持用户在配置文件中写入比如操作系统、系统服务、需要安装的软件包等描述,然后使用 `vagrant up` 命令在各种环境(VirtualBox,KVM,Hyper-V 等)中启动一个虚拟机。[Docker](https://www.docker.com/) 是一个使用容器化概念的类似工具。 租用云端虚拟机可以享受以下资源的即时访问: - 便宜、常开、且有公共 IP 地址的虚拟机用来托管网站等服务 - 有大量 CPU、磁盘、内存、以及 GPU 资源的虚拟机 - 超出用户可以使用的物理主机数量的虚拟机 - 相比物理主机的固定开支,虚拟机的开支一般按运行的时间计算。所以如果用户只需要在短时间内使用大量算力,租用 1000 台虚拟机运行几分钟明显更加划算。 受欢迎的 VPS 服务商有 [Amazon AWS](https://aws.amazon.com/),[Google Cloud](https://cloud.google.com/)、[ Microsoft Azure](https://azure.microsoft.com/) 以及 [DigitalOcean](https://www.digitalocean.com/)。 MIT CSAIL 的成员可以使用 [CSAIL OpenStack instance](https://tig.csail.mit.edu/shared-computing/open-stack/) 申请免费的虚拟机用于研究。 ## 交互式记事本编程 [交互式记事本](https://en.wikipedia.org/wiki/Notebook_interface) 可以帮助开发者进行与运行结果交互等探索性的编程。现在最受欢迎的交互式记事本环境大概是 [Jupyter](https://jupyter.org/)。它的名字来源于所支持的三种核心语言:Julia、Python、R。[Wolfram Mathematica](https://www.wolfram.com/mathematica/) 是另外一个常用于科学计算的优秀环境。 ## GitHub [GitHub](https://github.com/) 是最受欢迎的开源软件开发平台之一。我们课程中提到的很多工具,从 [vim](https://github.com/vim/vim) 到 [Hammerspoon](https://github.com/Hammerspoon/hammerspoon),都托管在 Github 上。向你每天使用的开源工具作出贡献其实很简单,下面是两种贡献者们经常使用的方法: - 创建一个 [议题(issue)](https://help.github.com/en/github/managing-your-work-on-github/creating-an-issue)。 议题可以用来反映软件运行的问题或者请求新的功能。创建议题并不需要创建者阅读或者编写代码,所以它是一个轻量化的贡献方式。高质量的问题报告对于开发者十分重要。在现有的议题发表评论也可以对项目的开发作出贡献。 - 使用 [拉取请求(pull request)](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) 提交代码更改。由于涉及到阅读和编写代码,提交拉取请求总的来说比创建议题更加深入。拉取请求是请求别人把你自己的代码拉取(且合并)到他们的仓库里。很多开源项目仅允许认证的管理者管理项目代码,所以一般需要 [复刻(fork)](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) 这些项目的上游仓库(upstream repository),在你的 Github 账号下创建一个内容完全相同但是由你控制的复刻仓库。这样你就可以在这个复刻仓库自由创建新的分支并推送修复问题或者实现新功能的代码。完成修改以后再回到开源项目的 Github 页面 [创建一个拉取请求](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request)。 提交请求后,项目管理者会和你交流拉取请求里的代码并给出反馈。如果没有问题,你的代码会和上游仓库中的代码合并。很多大的开源项目会提供贡献指南,容易上手的议题,甚至专门的指导项目来帮助参与者熟悉这些项目。 ================================================ FILE: _2020/qa.md ================================================ --- layout: lecture title: "提问&回答" date: 2020-01-30 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: Wz50FvGG6xU --- 最后一节课,我们回答学生提出的问题: - [学习操作系统相关内容的推荐,比如进程,虚拟内存,中断,内存管理等](#学习操作系统相关内容的推荐比如进程虚拟内存中断内存管理等) - [你会优先学习的工具有那些?](#你会优先学习的工具有那些) - [使用 Python VS Bash 脚本 VS 其他语言?](#使用-python-vs-bash-脚本-vs-其他语言) - [`source script.sh` 和 `./script.sh` 有什么区别?](#source-scriptsh-和-scriptsh-有什么区别) - [各种软件包和工具存储在哪里?引用过程是怎样的? `/bin` 或 `/lib` 是什么?](#各种软件包和工具存储在哪里引用过程是怎样的-bin-或-lib-是什么) - [我应该用 `apt-get install` 还是 `pip install` 去下载软件包呢?](#我应该用-apt-get-install-还是-pip-install-去下载软件包呢) - [用于提高代码性能,简单好用的性能分析工具有哪些?](#用于提高代码性能简单好用的性能分析工具有哪些) - [你使用那些浏览器插件?](#你使用那些浏览器插件) - [有哪些有用的数据整理工具?](#有哪些有用的数据整理工具) - [Docker 和虚拟机有什么区别?](#docker-和虚拟机有什么区别) - [不同操作系统的优缺点是什么,我们如何选择(比如选择最适用于我们需求的 Linux 发行版)?](#不同操作系统的优缺点是什么我们如何选择比如选择最适用于我们需求的-linux-发行版) - [使用 Vim 编辑器 VS Emacs 编辑器?](#使用-vim-编辑器-vs-emacs-编辑器) - [机器学习应用的提示或技巧?](#机器学习应用的提示或技巧) - [还有更多的 Vim 小窍门吗?](#还有更多的-vim-小窍门吗) - [2FA 是什么,为什么我需要使用它?](#2fa-是什么为什么我需要使用它) - [对于不同的 Web 浏览器有什么评价?](#对于不同的-web-浏览器有什么评价) ## 学习操作系统相关内容的推荐,比如进程,虚拟内存,中断,内存管理等 首先,不清楚你是不是真的需要了解这些更底层的话题。 当你开始编写更加底层的代码,比如实现或修改内核的时候,这些内容是很重要的。除了其他课程中简要介绍过的进程和信号量之外,大部分话题都不相关。 学习资源: - [MIT's 6.828 class](https://pdos.csail.mit.edu/6.828/) - 研究生阶段的操作系统课程(课程资料是公开的)。 - 现代操作系统 第四版(*Modern Operating Systems 4th ed*) - 作者是 Andrew S. Tanenbaum 这本书对上述很多概念都有很好的描述。 - FreeBSD 的设计与实现(*The Design and Implementation of the FreeBSD Operating System*) - 关于 FreeBSD OS 不错的资源(注意,FreeBSD OS 不是 Linux)。 - 其他的指南例如 [用 Rust 写操作系统](https://os.phil-opp.com/) 这里用不同的语言逐步实现了内核,主要用于教学的目的。 ## 你会优先学习的工具有那些? 值得优先学习的内容: - 多去使用键盘,少使用鼠标。这一目标可以通过多加利用快捷键,更换界面等来实现。 - 学好编辑器。作为程序员你大部分时间都是在编辑文件,因此值得学好这些技能。 - 学习怎样去自动化或简化工作流程中的重复任务。因为这会节省大量的时间。 - 学习像 Git 之类的版本控制工具并且知道如何与 GitHub 结合,以便在现代的软件项目中协同工作。 ## 使用 Python VS Bash 脚本 VS 其他语言? 通常来说,Bash 脚本对于简短的一次性脚本有效,比如当你想要运行一系列的命令的时候。但是 Bash 脚本有一些比较奇怪的地方,这使得大型程序或脚本难以用 Bash 实现: - Bash 对于简单的使用情形没什么问题,但是很难对于所有可能的输入都正确。例如,脚本参数中的空格会导致 Bash 脚本出错。 - Bash 对于代码重用并不友好。因此,重用你先前已经写好的代码很困难。通常 Bash 中没有软件库的概念。 - Bash 依赖于一些像 `$?` 或 `$@` 的特殊字符指代特殊的值。其他的语言却会显式地引用,比如 `exitCode` 或 `sys.args`。 因此,对于大型或者更加复杂的脚本我们推荐使用更加成熟的脚本语言例如 Python 和 Ruby。 你可以找到很多用这些语言编写的,用来解决常见问题的在线库。 如果你发现某种语言实现了你所需要的特定功能库,最好的方式就是直接去使用那种语言。 ## `source script.sh` 和 `./script.sh` 有什么区别? 这两种情况 `script.sh` 都会在 bash 会话中被读取和执行,不同点在于哪个会话执行这个命令。 对于 `source` 命令来说,命令是在当前的 bash 会话中执行的,因此当 `source` 执行完毕,对当前环境的任何更改(例如更改目录或是定义函数)都会留存在当前会话中。 单独运行 `./script.sh` 时,当前的 bash 会话将启动新的 bash 会话(实例),并在新实例中运行命令 `script.sh`。 因此,如果 `script.sh` 更改目录,新的 bash 会话(实例)会更改目录,但是一旦退出并将控制权返回给父 bash 会话,父会话仍然留在先前的位置(不会有目录的更改)。 同样,如果 `script.sh` 定义了要在终端中访问的函数,需要用 `source` 命令在当前 bash 会话中定义这个函数。否则,如果你运行 `./script.sh`,只有新的 bash 会话(进程)才能执行定义的函数,而当前的 shell 不能。 ## 各种软件包和工具存储在哪里?引用过程是怎样的? `/bin` 或 `/lib` 是什么? 根据你在命令行中运行的程序,这些包和工具会全部在 `PATH` 环境变量所列出的目录中查找到, 你可以使用 `which` 命令(或是 `type` 命令)来检查你的 shell 在哪里发现了特定的程序。 一般来说,特定种类的文件存储有一定的规范,[文件系统,层次结构标准(Filesystem, Hierarchy Standard)](https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard) 可以查到我们讨论内容的详细列表。 - `/bin` - 基本命令二进制文件 - `/sbin` - 基本的系统二进制文件,通常是 root 运行的 - `/dev` - 设备文件,通常是硬件设备接口文件 - `/etc` - 主机特定的系统配置文件 - `/home` - 系统用户的主目录 - `/lib` - 系统软件通用库 - `/opt` - 可选的应用软件 - `/sys` - 包含系统的信息和配置([第一堂课](/2020/course-shell/) 介绍的) - `/tmp` - 临时文件( `/var/tmp` ) 通常重启时删除 - `/usr/` - 只读的用户数据 + `/usr/bin` - 非必须的命令二进制文件 + `/usr/sbin` - 非必须的系统二进制文件,通常是由 root 运行的 + `/usr/local/bin` - 用户编译程序的二进制文件 - `/var` -变量文件 像日志或缓存 ## 我应该用 `apt-get install` 还是 `pip install` 去下载软件包呢? 这个问题没有普遍的答案。这与使用系统程序包管理器还是特定语言的程序包管理器来安装软件这一更笼统的问题相关。需要考虑的几件事: - 常见的软件包都可以通过这两种方法获得,但是小众的软件包或较新的软件包可能不在系统程序包管理器中。在这种情况下,使用特定语言的程序包管理器是更好的选择。 - 同样,特定语言的程序包管理器相比系统程序包管理器有更多的最新版本的程序包。 - 当使用系统软件包管理器时,将在系统范围内安装库。如果出于开发目的需要不同版本的库,则系统软件包管理器可能不能满足你的需要。对于这种情况,大多数编程语言都提供了隔离或虚拟环境,因此你可以用特定语言的程序包管理器安装不同版本的库而不会发生冲突。对于 Python,可以使用 virtualenv,对于 Ruby,使用 RVM 。 - 根据操作系统和硬件架构,其中一些软件包可能会附带二进制文件或者软件包需要被编译。例如,在树莓派(Raspberry Pi)之类的 ARM 架构计算机中,在软件附带二进制文件和软件包需要被编译的情况下,使用系统包管理器比特定语言包管理器更好。这在很大程度上取决于你的特定设置。 你应该仅使用一种解决方案,而不同时使用两种方法,因为这可能会导致难以解决的冲突。我们的建议是尽可能使用特定语言的程序包管理器,并使用隔离的环境(例如 Python 的 virtualenv)以避免影响全局环境。 ## 用于提高代码性能,简单好用的性能分析工具有哪些? 性能分析方面相当有用和简单工具是 [print timing](/2020/debugging-profiling/#timing)。你只需手动计算代码不同部分之间花费的时间。通过重复执行此操作,你可以有效地对代码进行二分法搜索,并找到花费时间最长的代码段。 对于更高级的工具, Valgrind 的 [Callgrind](http://valgrind.org/docs/manual/cl-manual.html) 可让你运行程序并计算所有的时间花费以及所有调用堆栈(即哪个函数调用了另一个函数)。然后,它会生成带注释的代码版本,其中包含每行花费的时间。但是,它会使程序运行速度降低一个数量级,并且不支持线程。其他的,[ `perf` ](http://www.brendangregg.com/perf.html) 工具和其他特定语言的采样性能分析器可以非常快速地输出有用的数据。[Flamegraphs](http://www.brendangregg.com/flamegraphs.html) 是对采样分析器结果的可视化工具。你还可以使用针对特定编程语言或任务的工具。例如,对于 Web 开发而言,Chrome 和 Firefox 内置的开发工具具有出色的性能分析器。 有时,代码中最慢的部分是系统等待磁盘读取或网络数据包之类的事件。在这些情况下,需要检查根据硬件性能估算的理论速度是否不偏离实际数值,也有专门的工具来分析系统调用中的等待时间,包括用于用户程序内核跟踪的 [eBPF](http://www.brendangregg.com/blog/2019-01-01/learn-ebpf-tracing.html) 。如果需要低级的性能分析,[ `bpftrace` ](https://github.com/iovisor/bpftrace) 值得一试。 ## 你使用那些浏览器插件? 我们钟爱的插件主要与安全性与可用性有关: - [uBlock Origin](https://github.com/gorhill/uBlock) - 是一个 [用途广泛(wide-spectrum)](https://github.com/gorhill/uBlock/wiki/Blocking-mode) 的拦截器,它不仅可以拦截广告,还可以拦截第三方的页面,也可以拦截内部脚本和其他种类资源的加载。如果你打算花更多的时间去配置,前往 [中等模式(medium mode)](https://github.com/gorhill/uBlock/wiki/Blocking-mode:-medium-mode) 或者 [强力模式(hard mode)](https://github.com/gorhill/uBlock/wiki/Blocking-mode:-hard-mode)。在你调整好设置之前一些网站会停止工作,但是这些配置会显著提高你的网络安全水平。另外, [简易模式(easy mode)](https://github.com/gorhill/uBlock/wiki/Blocking-mode:-easy-mode) 作为默认模式已经相当不错了,可以拦截大部分的广告和跟踪,你也可以自定义规则来拦截网站对象。 - [Stylus](https://github.com/openstyles/stylus/) - 是 Stylish 的分支(不要使用 Stylish,它会 [窃取浏览记录](https://www.theregister.co.uk/2018/07/05/browsers_pull_stylish_but_invasive_browser_extension/)),这个插件可让你将自定义 CSS 样式加载到网站。使用 Stylus,你可以轻松地自定义和修改网站的外观。可以删除侧边框,更改背景颜色,更改文字大小或字体样式。这可以使你经常访问的网站更具可读性。此外,Stylus 可以找到其他用户编写并发布在 [userstyles.org](https://userstyles.org/) 中的样式。大多数常用的网站都有一个或几个深色主题样式。 - 全页屏幕捕获 - 内置于 [Firefox](https://screenshots.firefox.com/) 和 [ Chrome 扩展程序](https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl?hl=en) 中。这些插件提供完整的网站截图,通常比打印要好用。 - [多账户容器](https://addons.mozilla.org/en-US/firefox/addon/multi-account-containers/) - 该插件使你可以将 Cookie 分为“容器”,从而允许你以不同的身份浏览 web 网页并且/或确保网站无法在它们之间共享信息。 - 密码集成管理器 - 大多数密码管理器都有浏览器插件,这些插件帮你将登录凭据输入网站的过程不仅方便,而且更加安全。与简单复制粘贴用户名和密码相比,这些插件将首先检查网站域是否与列出的条目相匹配,以防止冒充网站的网络钓鱼窃取登录凭据。 ## 有哪些有用的数据整理工具? 在数据整理那一节课程中,我们没有时间讨论一些数据整理工具,包括分别用于 JSON 和 HTML 数据的专用解析器, `jq` 和 `pup`。Perl 语言是另一个更高级的可以用于数据整理管道的工具。另一个技巧是使用 `column -t` 命令,可以将空格文本(不一定对齐)转换为对齐的文本。 更一般地讲,还有 vim 和 Python 两个非传统意义上的数据整理工具。对于某些复杂的多行转换,vim 宏是非常有用的工具。你可以记录一系列操作,并根据需要重复执行多次,例如,在“编辑器”一节的 [讲义](/2020/editors/#macros)(去年 [视频](/2019/editors/))中,有一个示例是使用 vim 宏将 XML 格式的文件转换为 JSON。 对于通常以 CSV 格式显示的表格数据, Python [pandas](https://pandas.pydata.org/) 库是一个很棒的工具。不仅因为它能让复杂操作的定义(如分组依据,联接或过滤器)变得非常容易,而且还便于根据不同属性绘制数据。它还支持导出多种表格格式,包括 XLS,HTML 或 LaTeX。另外,R 语言(一种有争议的 [不好](http://arrgh.tim-smith.us/) 的语言)具有很多功能,可以计算数据的统计数字,这在管道的最后一步中非常有用。 [ggplot2](https://ggplot2.tidyverse.org/) 是 R 中很棒的绘图库。 ## Docker 和虚拟机有什么区别? Docker 基于容器这个更为概括的概念。关于容器和虚拟机之间最大的不同是,虚拟机会执行整个的 OS 栈,包括内核(即使这个内核和主机内核相同)。与虚拟机不同,容器避免运行其他内核实例,而是与主机分享内核。在 Linux 环境中,有 LXC 机制来实现,并且这能使一系列分离的主机像是在使用自己的硬件启动程序,而实际上是共享主机的硬件和内核。因此容器的开销小于完整的虚拟机。 另一方面,容器的隔离性较弱而且只有在主机运行相同的内核时才能正常工作。例如,如果你在 macOS 上运行 Docker,Docker 需要启动 Linux 虚拟机去获取初始的 Linux 内核,这样的开销仍然很大。最后,Docker 是容器的特定实现,它是为软件部署而定制的。基于这些,它有一些奇怪之处:例如,默认情况下,Docker 容器在重启之间不会有以任何形式的存储。 ## 不同操作系统的优缺点是什么,我们如何选择(比如选择最适用于我们需求的 Linux 发行版)? 关于 Linux 发行版,尽管有相当多的版本,但大部分发行版在大多数使用情况下的表现是相同的。 可以使用任何发行版去学习 Linux 与 UNIX 的特性和其内部工作原理。 发行版之间的根本区别是发行版如何处理软件包更新。 某些版本,例如 Arch Linux 采用滚动更新策略,用了最前沿的软件包(bleeding-edge),但软件可能并不稳定。另外一些发行版(如 Debian,CentOS 或 Ubuntu LTS)其更新策略要保守得多,因此更新的内容会更稳定,但会牺牲一些新功能。我们建议你使用 Debian 或 Ubuntu 来获得简单稳定的台式机和服务器体验。 Mac OS 是介于 Windows 和 Linux 之间的一个操作系统,它有很漂亮的界面。但是,Mac OS 是基于 BSD 而不是 Linux,因此系统的某些部分和命令是不同的。 另一种值得体验的是 FreeBSD。虽然某些程序不能在 FreeBSD 上运行,但与 Linux 相比,BSD 生态系统的碎片化程度要低得多,并且说明文档更加友好。 除了开发 Windows 应用程序或需要使用某些 Windows 系统更好支持的功能(例如对游戏的驱动程序支持)外,我们不建议使用 Windows。 对于双系统,我们认为最有效的是 macOS 的 bootcamp,长期来看,任何其他组合都可能会出现问题,尤其是当你结合了其他功能比如磁盘加密。 ## 使用 Vim 编辑器 VS Emacs 编辑器? 我们三个都使用 vim 作为我们的主要编辑器。但是 Emacs 也是一个不错的选择,你可以两者都尝试,看看那个更适合你。Emacs 不使用 vim 的模式编辑,但是这些功能可以通过 Emacs 插件像 [Evil](https://github.com/emacs-evil/evil) 或 [Doom Emacs](https://github.com/hlissner/doom-emacs) 来实现。 Emacs 的优点是可以用 Lisp 语言进行扩展(Lisp 比 vim 默认的脚本语言 vimscript 要更好用)。 ## 机器学习应用的提示或技巧? 课程的一些经验可以直接用于机器学习程序。 就像许多科学学科一样,在机器学习中,你需要进行一系列实验,并检查哪些数据有效,哪些无效。 你可以使用 Shell 轻松快速地搜索这些实验结果,并且以合理的方式汇总。这意味着需要在限定时间内或使用特定数据集的情况下,检查所有实验结果。通过使用 JSON 文件记录实验的所有相关参数,使用我们在本课程中介绍的工具,这件事情可以变得极其简单。 最后,如果你不使用集群提交你的 GPU 作业,那你应该研究如何使该过程自动化,因为这是一项非常耗时的任务,会消耗你的精力。 ## 还有更多的 Vim 小窍门吗? 更多的窍门: - 插件 - 花时间去探索插件。有很多不错的插件修复了 vim 的缺陷或者增加了能够与现有 vim 工作流结合的新功能。关于这部分内容,资源是 [VimAwesome](https://vimawesome.com/) 和其他程序员的 dotfiles。 - 标记 - 在 vim 里你可以使用 `m` 为字母 `X` 做标记,之后你可以通过 `'` 回到标记位置。这可以让你快速定位到文件内或文件间的特定位置。 - 导航 - `Ctrl+O` 和 `Ctrl+I` 命令可以使你在最近访问位置前后移动。 - 撤销树 - vim 有不错的更改跟踪机制,不同于其他的编辑器,vim 存储变更树,因此即使你撤销后做了一些修改,你仍然可以通过撤销树的导航回到初始状态。一些插件比如 [gundo.vim](https://github.com/sjl/gundo.vim) 和 [undotree](https://github.com/mbbill/undotree) 通过图形化来展示撤销树。 - 时间撤销 - `:earlier` 和 `:later` 命令使得你可以用时间而非某一时刻的更改来定位文件。 - [持续撤销](https://vim.fandom.com/wiki/Using_undo_branches#Persistent_undo) - 是一个默认未被开启的 vim 的内置功能,它在 vim 启动之间保存撤销历史,需要配置在 `.vimrc` 目录下的 `undofile` 和 `undodir`,vim 会保存每个文件的修改历史。 - 热键(Leader Key) - 热键是一个用于用户自定义配置命令的特殊按键。这种模式通常是按下后释放这个按键(通常是空格键)并与其他的按键组合去实现一个特殊的命令。插件也会用这些按键增加它们的功能,例如,插件 UndoTree 使用 ` U` 去打开撤销树。 - 高级文本对象 - 文本对象比如搜索也可以用 vim 命令构成。例如,`d/` 会删除下一处匹配 pattern 的字符串,`cgn` 可以用于更改上次搜索的关键字。 ## 2FA 是什么,为什么我需要使用它? 双因子验证(Two Factor Authentication 2FA)在密码之上为帐户增加了一层额外的保护。为了登录,你不仅需要知道密码,还必须以某种方式“证明”可以访问某些硬件设备。最简单的情形是可以通过接收手机的 SMS 来实现(尽管 SMS 2FA 存在 [已知问题](https://www.kaspersky.com/blog/2fa-practical-guide/24219/))。我们推荐使用 [YubiKey](https://www.yubico.com/) 之类的 [U2F](https://en.wikipedia.org/wiki/Universal_2nd_Factor) 方案。 ## 对于不同的 Web 浏览器有什么评价? 2020 的浏览器现状是,大部分的浏览器都与 Chrome 类似,因为它们都使用同样的引擎(Blink)。Microsoft Edge 同样基于 Blink,而 Safari 则 基于 WebKit(与 Blink 类似的引擎),这些浏览器仅仅是更糟糕的 Chrome 版本。不管是在性能还是可用性上,Chrome 都是一款很不错的浏览器。如果你想要替代品,我们推荐 Firefox。Firefox 与 Chrome 的在各方面不相上下,并且在隐私方面更加出色。 有一款目前还没有完成的叫 Flow 的浏览器,它实现了全新的渲染引擎,有望比现有引擎速度更快。 ================================================ FILE: _2020/security.md ================================================ --- layout: lecture title: "安全和密码学" date: 2020-01-28 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: tjwobAmnKTo solution: ready: true url: security-solution --- 去年的 [这节课](/2019/security/) 我们从计算机 _用户_ 的角度探讨了增强隐私保护和安全的方法。 今年我们将关注比如散列函数、密钥生成函数、对称/非对称密码体系这些安全和密码学的概念是如何应用于前几节课所学到的工具(Git 和 SSH)中的。 本课程不能作为计算机系统安全 ([6.858](https://css.csail.mit.edu/6.858/)) 或者 密码学 ([6.857](https://courses.csail.mit.edu/6.857/) 以及 6.875) 的替代。 如果你不是密码学的专家,请不要 [试图创造或者修改加密算法](https://www.schneier.com/blog/archives/2015/05/amateurs_produc.html)。从事和计算机系统安全相关的工作同理。 这节课将对一些基本的概念进行简单(但实用)的说明。 虽然这些说明不足以让你学会如何 _设计_ 安全系统或者加密协议,但我们希望你可以对现在使用的程序和协议有一个大概了解。 # 熵 [熵](https://en.wikipedia.org/wiki/Entropy_(information_theory)) (Entropy) 是不确定性的度量,这很有用,可以用来决定密码的强度。 ![XKCD 936: Password Strength](https://imgs.xkcd.com/comics/password_strength.png) 正如上面的 [XKCD 漫画](https://xkcd.com/936/) 所描述的, "correcthorsebatterystaple" 这个密码比 "Tr0ub4dor&3" 更安全——可是熵是如何量化安全性的呢? 熵的单位是 _比特_。对于一个均匀分布的随机离散变量,熵等于 `log_2(所有可能的个数,即 n)`。 扔一次硬币的熵是 1 比特。掷一次(六面)骰子的熵大约为 2.58 比特。 一般我们认为攻击者了解密码的模型(最小长度,最大长度,可能包含的字符种类等),但是不了解某个密码是如何随机选择的—— 比如 [掷骰子](https://en.wikipedia.org/wiki/Diceware)。 使用多少比特的熵取决于应用的威胁模型。 上面的 XKCD 漫画告诉我们,大约 40 比特的熵足以对抗在线穷举攻击(受限于网络速度和应用认证机制)。 而对于离线穷举攻击(主要受限于计算速度), 一般需要更强的密码 (比如 80 比特或更多)。 # 散列函数 [密码散列函数](https://en.wikipedia.org/wiki/Cryptographic_hash_function) (Cryptographic hash function) 可以将任意大小的数据映射为一个固定大小的输出。除此之外,还有一些其他特性。 一个散列函数的大概规范如下: ``` hash(value: array) -> vector (N对于该函数固定) ``` [SHA-1](https://en.wikipedia.org/wiki/SHA-1) 是 Git 中使用的一种散列函数, 它可以将任意大小的输入映射为一个 160 比特(可被 40 位十六进制数表示)的输出。 下面我们用 `sha1sum` 命令来测试 SHA1 对几个字符串的输出: ```console $ printf 'hello' | sha1sum aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d $ printf 'hello' | sha1sum aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d $ printf 'Hello' | sha1sum f7ff9e8b7bb2e09b70935a5d785e0cc5d9d0abf0 ``` 抽象地讲,散列函数可以被认为是一个不可逆,且看上去随机(但具确定性)的函数 (这就是 [散列函数的理想模型](https://en.wikipedia.org/wiki/Random_oracle))。 一个散列函数拥有以下特性: - 确定性:对于不变的输入永远有相同的输出。 - 不可逆性:对于 `hash(m) = h`,难以通过已知的输出 `h` 来计算出原始输入 `m`。 - 目标碰撞抵抗性/弱无碰撞:对于一个给定输入 `m_1`,难以找到 `m_2 != m_1` 且 `hash(m_1) = hash(m_2)`。 - 碰撞抵抗性/强无碰撞:难以找到一组满足 `hash(m_1) = hash(m_2)` 的输入 `m_1, m_2`(该性质严格强于目标碰撞抵抗性)。 注:虽然 SHA-1 还可以用于特定用途,但它已经 [不再被认为](https://shattered.io/) 是一个强密码散列函数。 你可参照 [密码散列函数的生命周期](https://valerieaurora.org/hash.html) 这个表格了解一些散列函数是何时被发现弱点及破解的。 请注意,针对应用推荐特定的散列函数超出了本课程内容的范畴。 如果选择散列函数对于你的工作非常重要,请先系统学习信息安全及密码学。 ## 密码散列函数的应用 - Git 中的内容寻址存储(Content-addressed storage):[散列函数](https://en.wikipedia.org/wiki/Hash_function) 是一个宽泛的概念(存在非密码学的散列函数),那么 Git 为什么要特意使用密码散列函数? - 文件的信息摘要(Message digest):像 Linux ISO 这样的软件可以从非官方的(有时不太可信的)镜像站下载,所以需要设法确认下载的软件和官方一致。 官方网站一般会在(指向镜像站的)下载链接旁边备注安装文件的哈希值。 用户从镜像站下载安装文件后可以对照公布的哈希值来确定安装文件没有被篡改。 - [承诺机制](https://en.wikipedia.org/wiki/Commitment_scheme)(Commitment scheme): 假设我希望承诺一个值,但之后再透露它—— 比如在没有一个可信的、双方可见的硬币的情况下在我的脑海中公平的“扔一次硬币”。 我可以选择一个值 `r = random()`,并和你分享它的哈希值 `h = sha256(r)`。 这时你可以开始猜硬币的正反:我们一致同意偶数 `r` 代表正面,奇数 `r` 代表反面。 你猜完了以后,我告诉你值 `r` 的内容,得出胜负。同时你可以使用 `sha256(r)` 来检查我分享的哈希值 `h` 以确认我没有作弊。 # 密钥生成函数 [密钥生成函数](https://en.wikipedia.org/wiki/Key_derivation_function) (Key Derivation Functions) 作为密码散列函数的相关概念,被应用于包括生成固定长度,可以使用在其他密码算法中的密钥等方面。 为了对抗穷举法攻击,密钥生成函数通常较慢。 ## 密钥生成函数的应用 - 从密码生成可以在其他加密算法中使用的密钥,比如对称加密算法(见下)。 - 存储登录凭证时不可直接存储明文密码。
正确的方法是针对每个用户随机生成一个 [盐](https://en.wikipedia.org/wiki/Salt_(cryptography)) `salt = random()`, 并存储盐,以及密钥生成函数对连接了盐的明文密码生成的哈希值 `KDF(password + salt)`。
在验证登录请求时,使用输入的密码连接存储的盐重新计算哈希值 `KDF(input + salt)`,并与存储的哈希值对比。 # 对称加密 说到加密,可能你会首先想到隐藏明文信息。对称加密使用以下几个方法来实现这个功能: ``` keygen() -> key (这是一个随机方法) encrypt(plaintext: array, key) -> array (输出密文) decrypt(ciphertext: array, key) -> array (输出明文) ``` 加密方法 `encrypt()` 输出的密文 `ciphertext` 很难在不知道 `key` 的情况下得出明文 `plaintext`。
解密方法 `decrypt()` 有明显的正确性。因为功能要求给定密文及其密钥,解密方法必须输出明文:`decrypt(encrypt(m, k), k) = m`。 [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) 是现在常用的一种对称加密系统。 ## 对称加密的应用 - 加密不信任的云服务上存储的文件。对称加密和密钥生成函数配合起来,就可以使用密码加密文件: 将密码输入密钥生成函数生成密钥 `key = KDF(passphrase)`,然后存储 `encrypt(file, key)`。 # 非对称加密 非对称加密的“非对称”代表在其环境中,使用两个具有不同功能的密钥: 一个是私钥(private key),不向外公布;另一个是公钥(public key),公布公钥不像公布对称加密的共享密钥那样可能影响加密体系的安全性。
非对称加密使用以下几个方法来实现加密/解密(encrypt/decrypt),以及签名/验证(sign/verify): ``` keygen() -> (public key, private key) (这是一个随机方法) encrypt(plaintext: array, public key) -> array (输出密文) decrypt(ciphertext: array, private key) -> array (输出明文) sign(message: array, private key) -> array (生成签名) verify(message: array, signature: array, public key) -> bool (验证签名是否是由和这个公钥相关的私钥生成的) ``` 非对称的加密/解密方法和对称的加密/解密方法有类似的特征。
信息在非对称加密中使用 _公钥_ 加密, 且输出的密文很难在不知道 _私钥_ 的情况下得出明文。
解密方法 `decrypt()` 有明显的正确性。 给定密文及私钥,解密方法一定会输出明文: `decrypt(encrypt(m, public key), private key) = m`。 对称加密和非对称加密可以类比为机械锁。 对称加密就好比一个防盗门:只要是有钥匙的人都可以开门或者锁门。 非对称加密好比一个可以拿下来的挂锁。你可以把打开状态的挂锁(公钥)给任何一个人并保留唯一的钥匙(私钥)。这样他们将给你的信息装进盒子里并用这个挂锁锁上以后,只有你可以用保留的钥匙开锁。 签名/验证方法具有和书面签名类似的特征。
在不知道 _私钥_ 的情况下,不管需要签名的信息为何,很难计算出一个可以使 `verify(message, signature, public key)` 返回为真的签名。
对于使用私钥签名的信息,验证方法验证和私钥相对应的公钥时一定返回为真: `verify(message, sign(message, private key), public key) = true`。 ## 非对称加密的应用 - [PGP 电子邮件加密](https://en.wikipedia.org/wiki/Pretty_Good_Privacy):用户可以将所使用的公钥在线发布,比如:PGP 密钥服务器或 [Keybase](https://keybase.io/)。任何人都可以向他们发送加密的电子邮件。 - 聊天加密:像 [Signal](https://signal.org/) 和 [Keybase](https://keybase.io/) 使用非对称密钥来建立私密聊天。 - 软件签名:Git 支持用户对提交(commit)和标签(tag)进行 GPG 签名。任何人都可以使用软件开发者公布的签名公钥验证下载的已签名软件。 ## 密钥分发 非对称加密面对的主要挑战是,如何分发公钥并对应现实世界中存在的人或组织。 Signal 的信任模型是,信任用户第一次使用时给出的身份(trust on first use),同时支持用户线下(out-of-band)、面对面交换公钥(Signal 里的 safety number)。 PGP 使用的是 [信任网络](https://en.wikipedia.org/wiki/Web_of_trust)。简单来说,如果我想加入一个信任网络,则必须让已经在信任网络中的成员对我进行线下验证,比如对比证件。验证无误后,信任网络的成员使用私钥对我的公钥进行签名。这样我就成为了信任网络的一部分。只要我使用签名过的公钥所对应的私钥就可以证明“我是我”。 Keybase 主要使用 [社交网络证明 (social proof)](https://keybase.io/blog/chat-apps-softer-than-tofu),和一些别的精巧设计。 每个信任模型有它们各自的优点:我们(讲师)更倾向于 Keybase 使用的模型。 # 案例分析 ## 密码管理器 每个人都应该尝试使用密码管理器,比如 [KeePassXC](https://keepassxc.org/)、[pass](https://www.passwordstore.org/) 和 [1Password](https://1password.com))。 密码管理器会帮助你对每个网站生成随机且复杂(表现为高熵)的密码,并使用你指定的主密码配合密钥生成函数来对称加密它们。 你只需要记住一个复杂的主密码,密码管理器就可以生成很多复杂度高且不会重复使用的密码。密码管理器通过这种方式降低密码被猜出的可能,并减少网站信息泄露后对其他网站密码的威胁。 ## 两步验证(双因子验证) [两步验证](https://en.wikipedia.org/wiki/Multi-factor_authentication)(2FA)要求用户同时使用密码(“你知道的信息”)和一个身份验证器(“你拥有的物品”,比如 [YubiKey](https://www.yubico.com/))来消除密码泄露或者 [钓鱼攻击](https://en.wikipedia.org/wiki/Phishing) 的威胁。 ## 全盘加密 对笔记本电脑的硬盘进行全盘加密是防止因设备丢失而信息泄露的简单且有效方法。 Linux 的[cryptsetup + LUKS](https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_a_non-root_file_system), Windows 的 [BitLocker](https://fossbytes.com/enable-full-disk-encryption-windows-10/),或者 macOS 的 [FileVault](https://support.apple.com/en-us/HT204837) 都使用一个由密码保护的对称密钥来加密盘上的所有信息。 ## 聊天加密 [Signal](https://signal.org/) 和 [Keybase](https://keybase.io/) 使用非对称加密对用户提供端到端 (End-to-end) 安全性。 获取联系人的公钥非常关键。为了保证安全性,应使用线下方式验证 Signal 或者 Keybase 的用户公钥,或者信任 Keybase 用户提供的社交网络证明。 ## SSH 我们在 [之前的一堂课](/2020/command-line/#remote-machines) 讨论了 SSH 和 SSH 密钥的使用。那么我们今天从密码学的角度来分析一下它们。 当你运行 `ssh-keygen` 命令,它会生成一个非对称密钥对:公钥和私钥 `(public_key, private_key)`。 生成过程中使用的随机数由系统提供的熵决定。这些熵可以来源于硬件事件(hardware events)等。 公钥最终会被分发,它可以直接明文存储。 但是为了防止泄露,私钥必须加密存储。`ssh-keygen` 命令会提示用户输入一个密码,并将它输入密钥生成函数 产生一个密钥。最终,`ssh-keygen` 使用对称加密算法和这个密钥加密私钥。 在实际运用中,当服务器已知用户的公钥(存储在 `.ssh/authorized_keys` 文件中,一般在用户 HOME 目录下),尝试连接的客户端可以使用非对称签名来证明用户的身份——这便是 [挑战应答方式](https://en.wikipedia.org/wiki/Challenge%E2%80%93response_authentication)。 简单来说,服务器选择一个随机数字发送给客户端。客户端使用用户私钥对这个数字信息签名后返回服务器。 服务器随后使用 `.ssh/authorized_keys` 文件中存储的用户公钥来验证返回的信息是否由所对应的私钥所签名。这种验证方式可以有效证明试图登录的用户持有所需的私钥。 {% comment %} extra topics, if there's time security concepts, tips - biometrics - HTTPS {% endcomment %} # 资源 - [去年的讲稿](/2019/security/): 更注重于计算机用户可以如何增强隐私保护和安全 - [Cryptographic Right Answers](https://latacora.micro.blog/2018/04/03/cryptographic-right-answers.html): 解答了在一些应用环境下“应该使用什么加密?”的问题 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. **熵** 1. 假设一个密码是由四个小写的单词拼接组成,每个单词都是从一个含有 10 万单词的字典中随机选择,且每个单词选中的概率相同。 一个符合这样构造的例子是 `correcthorsebatterystaple`。这个密码有多少比特的熵? 2. 假设另一个密码是用八个随机的大小写字母或数字组成。一个符合这样构造的例子是 `rg8Ql34g`。这个密码又有多少比特的熵? 3. 哪一个密码更强? 4. 假设一个攻击者每秒可以尝试 1 万个密码,这个攻击者需要多久可以分别破解上述两个密码? 2. **密码散列函数** 从 [Debian 镜像站](https://www.debian.org/CD/http-ftp/) 下载一个光盘映像(比如这个来自阿根廷镜像站的 [映像](http://debian.xfree.com.ar/debian-cd/10.2.0/amd64/iso-cd/debian-10.2.0-amd64-netinst.iso))。使用 `sha256sum` 命令对比下载映像的哈希值和官方 Debian 站公布的哈希值。如果你下载了上面的映像,官方公布的哈希值可以参考 [这个文件](https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/SHA256SUMS)。 3. **对称加密** 使用 [OpenSSL](https://www.openssl.org/) 的 AES 模式加密一个文件: `openssl aes-256-cbc -salt -in {源文件名} -out {加密文件名}`。 使用 `cat` 或者 `hexdump` 对比源文件和加密的文件,再用 `openssl aes-256-cbc -d -in {加密文件名} -out {解密文件名}` 命令解密刚刚加密的文件。最后使用` cmp`命令确认源文件和解密后的文件内容相同。 4. **非对称加密** 1. 在你自己的电脑上使用更安全的 [ED25519 算法](https://wiki.archlinux.org/index.php/SSH_keys#Ed25519) 生成一组[SSH 密钥对](https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys--2)。为了确保私钥不使用时的安全,一定使用密码加密你的私钥。 2. [配置 GPG](https://www.digitalocean.com/community/tutorials/how-to-use-gpg-to-encrypt-and-sign-messages)。 3. 给 Anish 发送一封加密的电子邮件([Anish 的公钥](https://keybase.io/anish))。 4. 使用 `git commit -S` 命令签名一个 Git 提交,并使用 `git show --show-signature` 命令验证这个提交的签名。或者,使用 `git tag -s` 命令签名一个 Git 标签,并使用 `git tag -v` 命令验证标签的签名。 ================================================ FILE: _2020/shell-tools.md ================================================ --- layout: lecture title: "Shell 工具和脚本" date: 2020-01-14 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: kgII-YWo3Zw solution: ready: true url: shell-tools-solution --- 在这节课中,我们将会展示 bash 作为脚本语言的一些基础操作,以及几种最常用的 shell 工具。 # Shell 脚本 到目前为止,我们已经学习了如何在 shell 中执行命令,并使用管道将命令组合使用。但是,很多情况下我们需要执行一系列的操作并使用条件或循环这样的控制流。 shell 脚本的复杂性进一步提高。 大多数 shell 都有自己的一套脚本语言,包括变量、控制流和自己的语法。shell 脚本与其他脚本语言不同之处在于,shell 脚本针对 shell 所从事的相关工作进行了优化。因此,创建命令流程(pipelines)、将结果保存到文件、从标准输入中读取输入,这些都是 shell 脚本中的原生操作,这让它比通用的脚本语言更易用。本节中,我们会专注于 bash 脚本,因为它最流行,应用更为广泛。 在 bash 中为变量赋值的语法是 `foo=bar`,访问变量中存储的数值,其语法为 `$foo`。 需要注意的是,`foo = bar` (使用空格隔开)是不能正确工作的,因为解释器会调用程序 `foo` 并将 `=` 和 `bar` 作为参数。 总的来说,在 shell 脚本中使用空格会起到分割参数的作用,有时候可能会造成混淆,请务必多加检查。 Bash 中的字符串通过 `'` 和 `"` 分隔符来定义,但是它们的含义并不相同。以 `'` 定义的字符串为原义字符串,其中的变量不会被转义,而 `"` 定义的字符串会将变量值进行替换。 ```bash foo=bar echo "$foo" # 打印 bar echo '$foo' # 打印 $foo ``` 和其他大多数的编程语言一样,`bash` 也支持 `if`, `case`, `while` 和 `for` 这些控制流关键字。同样地, `bash` 也支持函数,它可以接受参数并基于参数进行操作。下面这个函数是一个例子,它会创建一个文件夹并使用 `cd` 进入该文件夹。 ```bash mcd () { mkdir -p "$1" cd "$1" } ``` 这里 `$1` 是脚本的第一个参数。与其他脚本语言不同的是,bash 使用了很多特殊的变量来表示参数、错误代码和相关变量。下面列举了其中一些变量,更完整的列表可以参考 [这里](https://www.tldp.org/LDP/abs/html/special-chars.html)。 - `$0` - 脚本名 - `$1` 到 `$9` - 脚本的参数。 `$1` 是第一个参数,依此类推。 - `$@` - 所有参数 - `$#` - 参数个数 - `$?` - 前一个命令的返回值 - `$$` - 当前脚本的进程识别码 - `!!` - 完整的上一条命令,包括参数。常见应用:当你因为权限不足执行命令失败时,可以使用 `sudo !!` 再尝试一次。 - `$_` - 上一条命令的最后一个参数。如果你正在使用的是交互式 shell,你可以通过按下 `Esc` 之后键入 . 来获取这个值。 命令通常使用 `STDOUT` 来返回输出值,使用 `STDERR` 来返回错误及错误码,便于脚本以更加友好的方式报告错误。 返回码或退出状态是脚本/命令之间交流执行状态的方式。返回值 0 表示正常执行,其他所有非 0 的返回值都表示有错误发生。 退出码可以搭配 `&&`(与操作符)和 `||`(或操作符)使用,用来进行条件判断,决定是否执行其他程序。它们都属于 [短路运算符](https://en.wikipedia.org/wiki/Short-circuit_evaluation)(short-circuiting) 同一行的多个命令可以用 `;` 分隔。程序 `true` 的返回码永远是 `0`,`false` 的返回码永远是 `1`。让我们看几个例子 ```bash false || echo "Oops, fail" # Oops, fail true || echo "Will not be printed" # true && echo "Things went well" # Things went well false && echo "Will not be printed" # false ; echo "This will always run" # This will always run ``` 另一个常见的模式是以变量的形式获取一个命令的输出,这可以通过 _命令替换_(_command substitution_)实现。 当您通过 `$( CMD )` 这样的方式来执行 `CMD` 这个命令时,它的输出结果会替换掉 `$( CMD )` 。例如,如果执行 `for file in $(ls)` ,shell 首先将调用 `ls` ,然后遍历得到的这些返回值。还有一个冷门的类似特性是 _进程替换_(_process substitution_), `<( CMD )` 会执行 `CMD` 并将结果输出到一个临时文件中,并将 `<( CMD )` 替换成临时文件名。这在我们希望返回值通过文件而不是 STDIN 传递时很有用。例如, `diff <(ls foo) <(ls bar)` 会显示文件夹 `foo` 和 `bar` 中文件的区别。 说了很多,现在该看例子了,下面这个例子展示了一部分上面提到的特性。这段脚本会遍历我们提供的参数,使用 `grep` 搜索字符串 `foobar`,如果没有找到,则将其作为注释追加到文件中。 ```bash #!/bin/bash echo "Starting program at $(date)" # date会被替换成日期和时间 echo "Running program $0 with $# arguments with pid $$" for file in "$@"; do grep foobar "$file" > /dev/null 2> /dev/null # 如果模式没有找到,则grep退出状态为 1 # 我们将标准输出流和标准错误流重定向到Null,因为我们并不关心这些信息 if [[ $? -ne 0 ]]; then echo "File $file does not have any foobar, adding one" echo "# foobar" >> "$file" fi done ``` 在条件语句中,我们比较 `$?` 是否等于 0。 Bash 实现了许多类似的比较操作,您可以查看 [`test 手册`](https://man7.org/linux/man-pages/man1/test.1.html)。 在 bash 中进行比较时,尽量使用双方括号 `[[ ]]` 而不是单方括号 `[ ]`,这样会降低犯错的几率,尽管这样并不能兼容 `sh`。 更详细的说明参见 [这里](http://mywiki.wooledge.org/BashFAQ/031)。 当执行脚本时,我们经常需要提供形式类似的参数。bash 使我们可以轻松的实现这一操作,它可以基于文件扩展名展开表达式。这一技术被称为 shell 的 _通配_(_globbing_) - 通配符 - 当你想要利用通配符进行匹配时,你可以分别使用 `?` 和 `*` 来匹配一个或任意个字符。例如,对于文件 `foo`, `foo1`, `foo2`, `foo10` 和 `bar`, `rm foo?` 这条命令会删除 `foo1` 和 `foo2` ,而 `rm foo*` 则会删除除了 `bar` 之外的所有文件。 - 花括号 `{}` - 当你有一系列的指令,其中包含一段公共子串时,可以用花括号来自动展开这些命令。这在批量移动或转换文件时非常方便。 ```bash convert image.{png,jpg} # 会展开为 convert image.png image.jpg cp /path/to/project/{foo,bar,baz}.sh /newpath # 会展开为 cp /path/to/project/foo.sh /path/to/project/bar.sh /path/to/project/baz.sh /newpath # 也可以结合通配使用 mv *{.py,.sh} folder # 会移动所有 *.py 和 *.sh 文件 mkdir foo bar # 下面命令会创建 foo/a, foo/b, ... foo/h, bar/a, bar/b, ... bar/h 这些文件 touch {foo,bar}/{a..h} touch foo/x bar/y # 比较文件夹 foo 和 bar 中包含文件的不同 diff <(ls foo) <(ls bar) # 输出 # < x # --- # > y ``` 编写 `bash` 脚本有时候会很别扭和反直觉。例如 [shellcheck](https://github.com/koalaman/shellcheck) 这样的工具可以帮助你定位 sh/bash 脚本中的错误。 注意,脚本并不一定只有用 bash 写才能在终端里调用。比如说,这是一段 Python 脚本,作用是将输入的参数倒序输出: ```python #!/usr/local/bin/python import sys for arg in reversed(sys.argv[1:]): print(arg) ``` 内核知道去用 python 解释器而不是 shell 命令来运行这段脚本,是因为脚本的开头第一行的 [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))。 在 `shebang` 行中使用 [`env`](https://man7.org/linux/man-pages/man1/env.1.html) 命令是一种好的实践,它会利用环境变量中的程序来解析该脚本,这样就提高了您的脚本的可移植性。`env` 会利用我们第一节讲座中介绍过的 `PATH` 环境变量来进行定位。 例如,使用了 `env` 的 shebang 看上去是这样的 `#!/usr/bin/env python`。 shell 函数和脚本有如下一些不同点: - 函数只能与 shell 使用相同的语言,脚本可以使用任意语言。因此在脚本中包含 `shebang` 是很重要的。 - 函数仅在定义时被加载,脚本会在每次被执行时加载。这让函数的加载比脚本略快一些,但每次修改函数定义,都要重新加载一次。 - 函数会在当前的 shell 环境中执行,脚本会在单独的进程中执行。因此,函数可以对环境变量进行更改,比如改变当前工作目录,脚本则不行。使用 [`export`](https://man7.org/linux/man-pages/man1/export.1p.html) 导出的环境变量会以传值的方式传递给脚本。 - 与其他程序语言一样,函数可以提高代码模块性、代码复用性并创建清晰性的结构。shell 脚本中往往也会包含它们自己的函数定义。 # Shell 工具 ## 查看命令如何使用 看到这里,您可能会有疑问,我们应该如何为特定的命令找到合适的标记呢?例如 `ls -l`, `mv -i` 和 `mkdir -p`。更普遍的是,给您一个命令行,您应该怎样了解如何使用这个命令行并找出它的不同的选项呢? 一般来说,您可能会先去网上搜索答案,但是,UNIX 可比 StackOverflow 出现的早,因此我们的系统里其实早就包含了可以获取相关信息的方法。 在上一节中我们介绍过,最常用的方法是为对应的命令行添加 `-h` 或 `--help` 标记。另外一个更详细的方法则是使用 `man` 命令。[`man`](https://man7.org/linux/man-pages/man1/man.1.html) 命令是手册(manual)的缩写,它提供了命令的用户手册。 例如,`man rm` 会输出命令 `rm` 的说明,同时还有其标记列表,包括之前我们介绍过的 `-i`。 事实上,目前我们给出的所有命令的说明链接,都是网页版的 Linux 命令手册。即使是您安装的第三方命令,前提是开发者编写了手册并将其包含在了安装包中。在交互式的、基于字符处理的终端窗口中,一般也可以通过 `:help` 命令或键入 `?` 来获取帮助。 有时候手册内容太过详实,让我们难以在其中查找哪些最常用的标记和语法。 [TLDR pages](https://tldr.sh/) 是一个很不错的替代品,它提供了一些案例,可以帮助您快速找到正确的选项。 例如,自己就常常在 tldr 上搜索 [`tar`](https://tldr.ostera.io/tar) 和 [`ffmpeg`](https://tldr.ostera.io/ffmpeg) 的用法。 ## 查找文件 程序员们面对的最常见的重复任务就是查找文件或目录。所有的类 UNIX 系统都包含一个名为 [`find`](https://man7.org/linux/man-pages/man1/find.1.html) 的工具,它是 shell 上用于查找文件的绝佳工具。`find` 命令会递归地搜索符合条件的文件,例如: ```bash # 查找所有名称为src的文件夹 find . -name src -type d # 查找所有文件夹路径中包含test的python文件 find . -path '*/test/*.py' -type f # 查找前一天修改的所有文件 find . -mtime -1 # 查找所有大小在500k至10M的tar.gz文件 find . -size +500k -size -10M -name '*.tar.gz' ``` 除了列出所寻找的文件之外,find 还能对所有查找到的文件进行操作。这能极大地简化一些单调的任务。 ```bash # 删除全部扩展名为.tmp 的文件 find . -name '*.tmp' -exec rm {} \; # 查找全部的 PNG 文件并将其转换为 JPG find . -name '*.png' -exec magick {} {}.jpg \; ``` 尽管 `find` 用途广泛,它的语法却比较难以记忆。例如,为了查找满足模式 `PATTERN` 的文件,您需要执行 `find -name '*PATTERN*'` (如果您希望模式匹配时是不区分大小写,可以使用 `-iname` 选项) 您当然可以使用 alias 设置别名来简化上述操作,但 shell 的哲学之一便是寻找(更好用的)替代方案。 记住,shell 最好的特性就是您只是在调用程序,因此您只要找到合适的替代程序即可(甚至自己编写)。 例如,[`fd`](https://github.com/sharkdp/fd) 就是一个更简单、更快速、更友好的程序,它可以用来作为 `find` 的替代品。它有很多不错的默认设置,例如输出着色、默认支持正则匹配、支持 unicode 并且我认为它的语法更符合直觉。以模式 `PATTERN` 搜索的语法是 `fd PATTERN`。 大多数人都认为 `find` 和 `fd` 已经很好用了,但是有的人可能想知道,我们是不是可以有更高效的方法,例如不要每次都搜索文件而是通过编译索引或建立数据库的方式来实现更加快速地搜索。 这就要靠 [`locate`](https://man7.org/linux/man-pages/man1/locate.1.html) 了。 `locate` 使用一个由 [`updatedb`](https://man7.org/linux/man-pages/man1/updatedb.1.html) 负责更新的数据库,在大多数系统中 `updatedb` 都会通过 [`cron`](https://man7.org/linux/man-pages/man8/cron.8.html) 每日更新。这便需要我们在速度和时效性之间作出权衡。而且,`find` 和类似的工具可以通过别的属性比如文件大小、修改时间或是权限来查找文件,`locate` 则只能通过文件名。 [这里](https://unix.stackexchange.com/questions/60205/locate-vs-find-usage-pros-and-cons-of-each-other) 有一个更详细的对比。 ## 查找代码 查找文件是很有用的技能,但是很多时候您的目标其实是查看文件的内容。一个最常见的场景是您希望查找具有某种模式的全部文件,并找它们的位置。 为了实现这一点,很多类 UNIX 的系统都提供了 [`grep`](https://man7.org/linux/man-pages/man1/grep.1.html) 命令,它是用于对输入文本进行匹配的通用工具。它是一个非常重要的 shell 工具,我们会在后续的数据清理课程中深入的探讨它。 `grep` 有很多选项,这也使它成为一个非常全能的工具。其中我经常使用的有 `-C` :获取查找结果的上下文(Context);`-v` 将对结果进行反选(Invert),也就是输出不匹配的结果。举例来说, `grep -C 5` 会输出匹配结果前后五行。当需要搜索大量文件的时候,使用 `-R` 会递归地进入子目录并搜索所有的文本文件。 但是,我们有很多办法可以对 `grep -R` 进行改进,例如使其忽略 `.git` 文件夹,使用多 CPU 等等。 因此也出现了很多它的替代品,包括 [ack](https://beyondgrep.com/), [ag](https://github.com/ggreer/the_silver_searcher) 和 [rg](https://github.com/BurntSushi/ripgrep)。它们都特别好用,但是功能也都差不多,我比较常用的是 ripgrep (`rg`) ,因为它速度快,而且用法非常符合直觉。例子如下: ```bash # 查找所有使用了 requests 库的文件 rg -t py 'import requests' # 查找所有没有写 shebang 的文件(包含隐藏文件) rg -u --files-without-match "^#\!" # 查找所有的foo字符串,并打印其之后的5行 rg foo -A 5 # 打印匹配的统计信息(匹配的行和文件的数量) rg --stats PATTERN ``` 与 `find`/`fd` 一样,重要的是你要知道有些问题使用合适的工具就会迎刃而解,而具体选择哪个工具则不是那么重要。 ## 查找 shell 命令 目前为止,我们已经学习了如何查找文件和代码,但随着你使用 shell 的时间越来越久,您可能想要找到之前输入过的某条命令。首先,按向上的方向键会显示你使用过的上一条命令,继续按上键则会遍历整个历史记录。 `history` 命令允许您以程序员的方式来访问 shell 中输入的历史命令。这个命令会在标准输出中打印 shell 中的历史命令。如果我们要搜索历史记录,则可以利用管道将输出结果传递给 `grep` 进行模式搜索。 `history | grep find` 会打印包含 find 子串的命令。 对于大多数的 shell 来说,您可以使用 `Ctrl+R` 对命令历史记录进行回溯搜索。敲 `Ctrl+R` 后您可以输入子串来进行匹配,查找历史命令行。 反复按下就会在所有搜索结果中循环。在 [zsh](https://github.com/zsh-users/zsh-history-substring-search) 中,使用方向键上或下也可以完成这项工作。 `Ctrl+R` 可以配合 [fzf](https://github.com/junegunn/fzf/wiki/Configuring-shell-key-bindings#ctrl-r) 使用。`fzf` 是一个通用的模糊查找工具,它可以和很多命令一起使用。这里我们可以对历史命令进行模糊查找并将结果以赏心悦目的格式输出。 另外一个和历史命令相关的技巧我喜欢称之为 **基于历史的自动补全**。 这一特性最初是由 [fish](https://fishshell.com/) shell 创建的,它可以根据您最近使用过的开头相同的命令,动态地对当前的 shell 命令进行补全。这一功能在 [zsh](https://github.com/zsh-users/zsh-autosuggestions) 中也可以使用,它可以极大的提高用户体验。 你可以修改 shell history 的行为,例如,如果在命令的开头加上一个空格,它就不会被加进 shell 记录中。当你输入包含密码或是其他敏感信息的命令时会用到这一特性。 为此你需要在 `.bashrc` 中添加 `HISTCONTROL=ignorespace` 或者向 `.zshrc` 添加 `setopt HIST_IGNORE_SPACE`。 如果你不小心忘了在前面加空格,可以通过编辑 `.bash_history` 或 `.zhistory` 来手动地从历史记录中移除那一项。 ## 文件夹导航 之前对所有操作我们都默认一个前提,即您已经位于想要执行命令的目录下,但是如何才能高效地在目录间随意切换呢?有很多简便的方法可以做到,比如设置 alias,使用 [ln -s](https://man7.org/linux/man-pages/man1/ln.1.html) 创建符号连接等。而开发者们已经想到了很多更为精妙的解决方案。 由于本课程的目的是尽可能对你的日常习惯进行优化。因此,我们可以使用 [`fasd`](https://github.com/clvv/fasd) 和 [autojump](https://github.com/wting/autojump) 这两个工具来查找最常用或最近使用的文件和目录。 Fasd 基于 [_frecency_ ](https://developer.mozilla.org/en-US/docs/Mozilla/Tech/Places/Frecency_algorithm) 对文件和文件排序,也就是说它会同时针对频率(_frequency_)和时效(_recency_)进行排序。默认情况下,`fasd` 使用命令 `z` 帮助我们快速切换到最常访问的目录。例如, 如果您经常访问 `/home/user/files/cool_project` 目录,那么可以直接使用 `z cool` 跳转到该目录。对于 autojump,则使用 `j cool` 代替即可。 还有一些更复杂的工具可以用来概览目录结构,例如 [`tree`](https://linux.die.net/man/1/tree), [`broot`](https://github.com/Canop/broot) 或更加完整的文件管理器,例如 [`nnn`](https://github.com/jarun/nnn) 或 [`ranger`](https://github.com/ranger/ranger)。 # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. 阅读 [`man ls`](https://man7.org/linux/man-pages/man1/ls.1.html) ,然后使用 `ls` 命令进行如下操作: - 所有文件(包括隐藏文件) - 文件打印以人类可以理解的格式输出 (例如,使用 454M 而不是 454279954) - 文件以最近修改顺序排序 - 以彩色文本显示输出结果 典型输出如下: ``` -rw-r--r-- 1 user group 1.1M Jan 14 09:53 baz drwxr-xr-x 5 user group 160 Jan 14 09:53 . -rw-r--r-- 1 user group 514 Jan 14 06:42 bar -rw-r--r-- 1 user group 106M Jan 13 12:12 foo drwx------+ 47 user group 1.5K Jan 12 18:08 .. ``` 2. 编写两个 bash 函数 `marco` 和 `polo` 执行下面的操作。 每当你执行 `marco` 时,当前的工作目录应当以某种形式保存,当执行 `polo` 时,无论现在处在什么目录下,都应当 `cd` 回到当时执行 `marco` 的目录。 为了方便 debug,你可以把代码写在单独的文件 `marco.sh` 中,并通过 `source marco.sh` 命令,(重新)加载函数。 3. 假设您有一个命令,它很少出错。因此为了在出错时能够对其进行调试,需要花费大量的时间重现错误并捕获输出。 编写一段 bash 脚本,运行如下的脚本直到它出错,将它的标准输出和标准错误流记录到文件,并在最后输出所有内容。 加分项:报告脚本在失败前共运行了多少次。 ```bash #!/usr/bin/env bash n=$(( RANDOM % 100 )) if [[ n -eq 42 ]]; then echo "Something went wrong" >&2 echo "The error was using magic numbers" exit 1 fi echo "Everything went according to plan" ``` 4. 本节课我们讲解的 `find` 命令中的 `-exec` 参数非常强大,它可以对我们查找的文件进行操作。但是,如果我们要对所有文件进行操作呢?例如创建一个 zip 压缩文件?我们已经知道,命令行可以从参数或标准输入接受输入。在用管道连接命令时,我们将标准输出和标准输入连接起来,但是有些命令,例如 `tar` 则需要从参数接受输入。这里我们可以使用 [`xargs`](https://man7.org/linux/man-pages/man1/xargs.1.html) 命令,它可以使用标准输入中的内容作为参数。 例如 `ls | xargs rm` 会删除当前目录中的所有文件。 您的任务是编写一个命令,它可以递归地查找文件夹中所有的 HTML 文件,并将它们压缩成 zip 文件。注意,即使文件名中包含空格,您的命令也应该能够正确执行(提示:查看 `xargs` 的参数 `-d`,译注:MacOS 上的 `xargs` 没有 `-d`,[查看这个 issue](https://github.com/missing-semester/missing-semester/issues/93)) {% comment %} find . -type f -name "*.html" | xargs -d '\n' tar -cvzf archive.tar.gz {% endcomment %} 如果您使用的是 MacOS,请注意默认的 BSD `find` 与 [GNU coreutils](https://en.wikipedia.org/wiki/List_of_GNU_Core_Utilities_commands) 中的是不一样的。你可以为 `find` 添加 `-print0` 选项,并为 `xargs` 添加 `-0` 选项。作为 Mac 用户,您需要注意 mac 系统自带的命令行工具和 GNU 中对应的工具是有区别的;如果你想使用 GNU 版本的工具,也可以使用 [brew 来安装](https://formulae.brew.sh/formula/coreutils)。 5. (进阶)编写一个命令或脚本递归的查找文件夹中最近修改的文件。更通用的做法,你可以按照最近的修改时间列出文件吗? ================================================ FILE: _2020/version-control.md ================================================ --- layout: lecture title: "版本控制(Git)" date: 2020-01-22 ready: true sync: true syncdate: 2025-08-16 video: aspect: 56.25 id: 2sjqTHE0zok solution: ready: true url: version-control-solution --- 版本控制系统 (VCSs) 是一类用于追踪源代码(或其他文件、文件夹)改动的工具。顾名思义,这些工具可以帮助我们管理代码的修改历史;不仅如此,它还可以让协作编码变得更方便。VCS 通过一系列的快照将某个文件夹及其内容保存了起来,每个快照都包含了顶级目录中所有的文件或文件夹的完整状态。同时它还维护了快照创建者的信息以及每个快照的相关信息等等。 为什么说版本控制系统非常有用?即使您只是一个人进行编程工作,它也可以帮您创建项目的快照,记录每个改动的目的、基于多分支并行开发等等。和别人协作开发时,它更是一个无价之宝,您可以看到别人对代码进行的修改,同时解决由于并行开发引起的冲突。 现代的版本控制系统可以帮助您轻松地(甚至自动地)回答以下问题: - 当前模块是谁编写的? - 这个文件的这一行是什么时候被编辑的?是谁作出的修改?修改原因是什么呢? - 最近的 1000 个版本中,何时/为什么导致了单元测试失败? 尽管版本控制系统有很多, 其事实上的标准则是 **Git** 。而这篇 [XKCD 漫画](https://xkcd.com/1597/) 则反映出了人们对 Git 的评价: ![xkcd 1597](https://imgs.xkcd.com/comics/git.png) 因为 Git 接口的抽象泄漏(leaky abstraction)问题,通过自顶向下的方式(从命令行接口开始)学习 Git 可能会让人感到非常困惑。很多时候您只能死记硬背一些命令行,然后像使用魔法一样使用它们,一旦出现问题,就只能像上面那幅漫画里说的那样去处理了。 尽管 Git 的接口有些丑陋,但是它的底层设计和思想却是非常优雅的。丑陋的接口只能靠死记硬背,而优雅的底层设计则非常容易被人理解。因此,我们将通过一种自底向上的方式向您介绍 Git。我们会从数据模型开始,最后再学习它的接口。一旦您搞懂了 Git 的数据模型,再学习其接口并理解这些接口是如何操作数据模型的就非常容易了。 # Git 的数据模型 进行版本控制的方法很多。Git 拥有一个经过精心设计的模型,这使其能够支持版本控制所需的所有特性,例如维护历史记录、支持分支和促进协作。 ## 快照 Git 将顶级目录中的文件和文件夹作为集合,并通过一系列快照来管理其历史记录。在 Git 的术语里,文件被称作 Blob 对象(数据对象),也就是一组数据。目录则被称之为“树”,它将名字与 Blob 对象或树对象进行映射(使得目录中可以包含其他目录)。快照则是被追踪的最顶层的树。例如,一个树看起来可能是这样的: ``` (tree) | +- foo (tree) | | | + bar.txt (blob, contents = "hello world") | +- baz.txt (blob, contents = "git is wonderful") ``` 这个顶层的树包含了两个元素,一个名为 "foo" 的树(它本身包含了一个 blob 对象 "bar.txt"),以及一个 blob 对象 "baz.txt"。 ## 历史记录建模:关联快照 版本控制系统和快照有什么关系呢?线性历史记录是一种最简单的模型,它包含了一组按照时间顺序线性排列的快照。不过出于种种原因,Git 并没有采用这样的模型。 在 Git 中,历史记录是一个由快照组成的有向无环图。有向无环图,听上去似乎是什么高大上的数学名词。不过不要怕,您只需要知道这代表 Git 中的每个快照都有一系列的“父辈”,也就是其之前的一系列快照。注意,快照具有多个“父辈”而非一个,因为某个快照可能由多个父辈而来。例如,经过合并后的两条分支。 在 Git 中,这些快照被称为“提交”。通过可视化的方式来表示这些历史提交记录时,看起来差不多是这样的: ``` o <-- o <-- o <-- o ^ \ --- o <-- o ``` 上面是一个 ASCII 码构成的简图,其中的 `o` 表示一次提交(快照)。 箭头指向了当前提交的父辈(这是一种“在...之前”,而不是“在...之后”的关系)。在第三次提交之后,历史记录分岔成了两条独立的分支。这可能因为此时需要同时开发两个不同的特性,它们之间是相互独立的。开发完成后,这些分支可能会被合并并创建一个新的提交,这个新的提交会同时包含这些特性。新的提交会创建一个新的历史记录,看上去像这样(最新的合并提交用粗体标记):

o <-- o <-- o <-- o <----  o 
            ^            /
             \          v
              --- o <-- o

Git 中的提交是不可改变的。但这并不代表错误不能被修改,只不过这种“修改”实际上是创建了一个全新的提交记录。而引用(参见下文)则被更新为指向这些新的提交。 ## 数据模型及其伪代码表示 以伪代码的形式来学习 Git 的数据模型,可能更加清晰: ``` // 文件就是一组数据 type blob = array // 一个包含文件和目录的目录 type tree = map // 每个提交都包含一个父辈,元数据和顶层树 type commit = struct { parents: array author: string message: string snapshot: tree } ``` 这是一种简洁的历史模型。 ## 对象和内存寻址 Git 中的对象可以是 blob、树或提交: ``` type object = blob | tree | commit ``` Git 在储存数据时,所有的对象都会基于它们的 [SHA-1 哈希](https://en.wikipedia.org/wiki/SHA-1) 进行寻址。 ``` objects = map def store(object): id = sha1(object) objects[id] = object def load(id): return objects[id] ``` Blobs、树和提交都一样,它们都是对象。当它们引用其他对象时,它们并没有真正的在硬盘上保存这些对象,而是仅仅保存了它们的哈希值作为引用。 例如,[上面](#snapshots) 例子中的树(可以通过 `git cat-file -p 698281bc680d1995c5f4caaf3359721a5a58d48d` 来进行可视化),看上去是这样的: ``` 100644 blob 4448adbf7ecd394f42ae135bbeed9676e894af85 baz.txt 040000 tree c68d233a33c5c06e0340e4c224f0afca87c8ce87 foo ``` 树本身会包含一些指向其他内容的指针,例如 `baz.txt` (blob) 和 `foo` (树)。如果我们用 `git cat-file -p 4448adbf7ecd394f42ae135bbeed9676e894af85`,即通过哈希值查看 baz.txt 的内容,会得到以下信息: ``` git is wonderful ``` ## 引用 现在,所有的快照都可以通过它们的 SHA-1 哈希值来标记了。但这也太不方便了,谁也记不住一串 40 位的十六进制字符。 针对这一问题,Git 的解决方法是给这些哈希值赋予人类可读的名字,也就是引用(references)。引用是指向提交的指针。与对象不同的是,它是可变的(引用可以被更新,指向新的提交)。例如,`master` 引用通常会指向主分支的最新一次提交。 ``` references = map def update_reference(name, id): references[name] = id def read_reference(name): return references[name] def load_reference(name_or_id): if name_or_id in references: return load(references[name_or_id]) else: return load(name_or_id) ``` 这样,Git 就可以使用诸如 "master" 这样人类可读的名称来表示历史记录中某个特定的提交,而不需要在使用一长串十六进制字符了。 有一个细节需要我们注意, 通常情况下,我们会想要知道“我们当前所在位置”,并将其标记下来。这样当我们创建新的快照的时候,我们就可以知道它的相对位置(如何设置它的“父辈”)。在 Git 中,我们当前的位置有一个特殊的索引,它就是 "HEAD"。 ## 仓库 最后,我们可以粗略地给出 Git 仓库的定义了:`对象` 和 `引用`。 在硬盘上,Git 仅存储对象和引用:因为其数据模型仅包含这些东西。所有的 `git` 命令都对应着对提交树的操作,例如增加对象,增加或删除引用。 当您输入某个指令时,请思考一下这条命令是如何对底层的图数据结构进行操作的。另一方面,如果您希望修改提交树,例如“丢弃未提交的修改和将 ‘master’ 引用指向提交 `5d83f9e` 时,有什么命令可以完成该操作(针对这个具体问题,您可以使用 `git checkout master; git reset --hard 5d83f9e`) # 暂存区 Git 中还包括一个和数据模型完全不相关的概念,但它确是创建提交的接口的一部分。 就上面介绍的快照系统来说,您也许会期望它的实现里包括一个 “创建快照” 的命令,该命令能够基于当前工作目录的当前状态创建一个全新的快照。有些版本控制系统确实是这样工作的,但 Git 不是。我们希望简洁的快照,而且每次从当前状态创建快照可能效果并不理想。例如,考虑如下场景,您开发了两个独立的特性,然后您希望创建两个独立的提交,其中第一个提交仅包含第一个特性,而第二个提交仅包含第二个特性。或者,假设您在调试代码时添加了很多打印语句,然后您仅仅希望提交和修复 bug 相关的代码而丢弃所有的打印语句。 Git 处理这些场景的方法是使用一种叫做 “暂存区(staging area)”的机制,它允许您指定下次快照中要包括那些改动。 # Git 的命令行接口 为了避免重复信息,我们将不会详细解释以下命令行。强烈推荐您阅读 [Pro Git 中文版](https://git-scm.com/book/zh/v2) 或可以观看本讲座的视频来学习。 ## 基础 {% comment %} The `git init` command initializes a new Git repository, with repository metadata being stored in the `.git` directory: ```console $ mkdir myproject $ cd myproject $ git init Initialized empty Git repository in /home/missing-semester/myproject/.git/ $ git status On branch master No commits yet nothing to commit (create/copy files and use "git add" to track) ``` How do we interpret this output? "No commits yet" basically means our version history is empty. Let's fix that. ```console $ echo "hello, git" > hello.txt $ git add hello.txt $ git status On branch master No commits yet Changes to be committed: (use "git rm --cached ..." to unstage) new file: hello.txt $ git commit -m 'Initial commit' [master (root-commit) 4515d17] Initial commit 1 file changed, 1 insertion(+) create mode 100644 hello.txt ``` With this, we've `git add` ed a file to the staging area, and then `git commit`ed that change, adding a simple commit message " Initial commit ". If we didn't specify a `-m` option, Git would open our text editor to allow us type a commit message. Now that we have a non-empty version history, we can visualize the history. Visualizing the history as a DAG can be especially helpful in understanding the current status of the repo and connecting it with your understanding of the Git data model. The `git log` command visualizes history. By default, it shows a flattened version, which hides the graph structure. If you use a command like `git log --all --graph --decorate`, it will show you the full version history of the repository, visualized in graph form. ```console $ git log --all --graph --decorate * commit 4515d17a167bdef0a91ee7d50d75b12c9c2652aa (HEAD -> master) Author: Missing Semester Date: Tue Jan 21 22:18:36 2020 -0500 Initial commit ``` This doesn't look all that graph-like, because it only contains a single node. Let's make some more changes, author a new commit, and visualize the history once more. ```console $ echo "another line" >> hello.txt $ git status On branch master Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git checkout -- ..." to discard changes in working directory) modified: hello.txt no changes added to commit (use "git add" and/or "git commit -a") $ git add hello.txt $ git status On branch master Changes to be committed: (use "git reset HEAD ..." to unstage) modified: hello.txt $ git commit -m 'Add a line' [master 35f60a8] Add a line 1 file changed, 1 insertion(+) ``` Now, if we visualize the history again, we'll see some of the graph structure: ``` * commit 35f60a825be0106036dd2fbc7657598eb7b04c67 (HEAD -> master) | Author: Missing Semester | Date: Tue Jan 21 22:26:20 2020 -0500 | | Add a line | * commit 4515d17a167bdef0a91ee7d50d75b12c9c2652aa Author: Anish Athalye Date: Tue Jan 21 22:18:36 2020 -0500 Initial commit ``` Also, note that it shows the current HEAD, along with the current branch (master). We can look at old versions using the `git checkout` command. ```console $ git checkout 4515d17 # previous commit hash; yours will be different Note: checking out '4515d17'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b HEAD is now at 4515d17 Initial commit $ cat hello.txt hello, git $ git checkout master Previous HEAD position was 4515d17 Initial commit Switched to branch 'master' $ cat hello.txt hello, git another line ``` Git can show you how files have evolved (differences, or diffs) using the `git diff` command: ```console $ git diff 4515d17 hello.txt diff --git c/hello.txt w/hello.txt index 94bab17..f0013b2 100644 --- c/hello.txt +++ w/hello.txt @@ -1 +1,2 @@ hello, git +another line ``` {% endcomment %} - `git help `: 获取 git 命令的帮助信息 - `git init`: 创建一个新的 git 仓库,其数据会存放在一个名为 `.git` 的目录下 - `git status`: 显示当前的仓库状态 - `git add `: 添加文件到暂存区 - `git commit`: 创建一个新的提交 - 如何编写 [良好的提交信息](https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html)! - 为何要 [编写良好的提交信息](https://chris.beams.io/posts/git-commit/) - `git log`: 显示历史日志 - `git log --all --graph --decorate`: 可视化历史记录(有向无环图) - `git diff `: 显示与暂存区文件的差异 - `git diff `: 显示某个文件两个版本之间的差异 - `git checkout `: 更新 HEAD(如果是检出分支则同时更新当前分支) ## 分支和合并 {% comment %} Branching allows you to "fork" version history. It can be helpful for working on independent features or bug fixes in parallel. The `git branch` command can be used to create new branches; `git checkout -b ` creates and branch and checks it out. Merging is the opposite of branching: it allows you to combine forked version histories, e.g. merging a feature branch back into master. The `git merge` command is used for merging. {% endcomment %} - `git branch`: 显示分支 - `git branch `: 创建分支 - `git checkout -b `: 创建分支并切换到该分支 - 相当于 `git branch ; git checkout ` - `git merge `: 合并到当前分支 - `git mergetool`: 使用工具来处理合并冲突 - `git rebase`: 将一系列补丁变基(rebase)为新的基线 ## 远端操作 - `git remote`: 列出远端 - `git remote add `: 添加一个远端 - `git push :`: 将对象传送至远端并更新远端引用 - `git branch --set-upstream-to=/`: 创建本地和远端分支的关联关系 - `git fetch`: 从远端获取对象/索引 - `git pull`: 相当于 `git fetch; git merge` - `git clone`: 从远端下载仓库 ## 撤销 - `git commit --amend`: 编辑提交的内容或信息 - `git reset HEAD `: 恢复暂存的文件 - `git checkout -- `: 丢弃修改 - `git restore`: git2.32 版本后取代 git reset 进行许多撤销操作 # Git 高级操作 - `git config`: Git 是一个 [高度可定制的](https://git-scm.com/docs/git-config) 工具 - `git clone --depth=1`: 浅克隆(shallow clone),不包括完整的版本历史信息 - `git add -p`: 交互式暂存 - `git rebase -i`: 交互式变基 - `git blame`: 查看最后修改某行的人 - `git stash`: 暂时移除工作目录下的修改内容 - `git bisect`: 通过二分查找搜索历史记录 - `.gitignore`: [指定](https://git-scm.com/docs/gitignore) 故意不追踪的文件 # 杂项 - **图形用户界面**: Git 的 [图形用户界面客户端](https://git-scm.com/downloads/guis) 有很多,但是我们自己并不使用这些图形用户界面的客户端,我们选择使用命令行接口 - **Shell 集成**: 将 Git 状态集成到您的 shell 中会非常方便。([zsh](https://github.com/olivierverdier/zsh-git-prompt), [bash](https://github.com/magicmonty/bash-git-prompt))。[Oh My Zsh](https://github.com/ohmyzsh/ohmyzsh) 这样的框架中一般已经集成了这一功能 - **编辑器集成**: 和上面一条类似,将 Git 集成到编辑器中好处多多。[fugitive.vim](https://github.com/tpope/vim-fugitive) 是 Vim 中集成 Git 的常用插件 - **工作流**: 我们已经讲解了数据模型与一些基础命令,但还没讨论到进行大型项目时的一些惯例 ( 有 [很多](https://nvie.com/posts/a-successful-git-branching-model/) [不同的](https://www.endoflineblog.com/gitflow-considered-harmful) [处理方法](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow)) - **GitHub**: Git 并不等同于 GitHub。 在 GitHub 中您需要使用一个被称作 [拉取请求(pull request)](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) 的方法来向其他项目贡献代码 - **其他 Git 提供商**: GitHub 并不是唯一的。还有像 [GitLab](https://about.gitlab.com/) 和 [BitBucket](https://bitbucket.org/) 这样的平台。 # 资源 - [Pro Git](https://git-scm.com/book/en/v2) ,**强烈推荐**!学习前五章的内容可以教会您流畅使用 Git 的绝大多数技巧,因为您已经理解了 Git 的数据模型。后面的章节提供了很多有趣的高级主题。([Pro Git 中文版](https://git-scm.com/book/zh/v2)); - [Oh Shit, Git!?!](https://ohshitgit.com/) ,简短的介绍了如何从 Git 错误中恢复; - [Git for Computer Scientists](https://eagain.net/articles/git-for-computer-scientists/) ,简短的介绍了 Git 的数据模型,与本文相比包含较少量的伪代码以及大量的精美图片; - [Git from the Bottom Up](https://jwiegley.github.io/git-from-the-bottom-up/) 详细的介绍了 Git 的实现细节,而不仅仅局限于数据模型。好奇的同学可以看看; - [How to explain git in simple words](https://smusamashah.github.io/blog/2017/10/14/explain-git-in-simple-words); - [Learn Git Branching](https://learngitbranching.js.org/) 通过基于浏览器的游戏来学习 Git ; # 课后练习 [习题解答]({{site.url}}/{{site.solution_url}}/{{page.solution.url}}) 1. 如果您之前从来没有用过 Git,推荐您阅读 [Pro Git](https://git-scm.com/book/en/v2) 的前几章,或者完成像 [Learn Git Branching](https://learngitbranching.js.org/) 这样的教程。重点关注 Git 命令和数据模型相关内容; 2. 克隆 [本课程网站的仓库](https://github.com/missing-semester-cn/missing-semester-cn.github.io.git) 1. 将版本历史可视化并进行探索 2. 是谁最后修改了 `README.md` 文件?(提示:使用 `git log` 命令并添加合适的参数) 3. 最后一次修改 `_config.yml` 文件中 `collections:` 行时的提交信息是什么?(提示:使用 `git blame` 和 `git show`) 3. 使用 Git 时的一个常见错误是提交本不应该由 Git 管理的大文件,或是将含有敏感信息的文件提交给 Git 。尝试向仓库中添加一个文件并添加提交信息,然后将其从历史中删除 ( [这篇文章也许会有帮助](https://help.github.com/articles/removing-sensitive-data-from-a-repository/)); 4. 从 GitHub 上克隆某个仓库,修改一些文件。当您使用 `git stash` 会发生什么?当您执行 `git log --all --oneline` 时会显示什么?通过 `git stash pop` 命令来撤销 `git stash` 操作,什么时候会用到这一技巧? 5. 与其他的命令行工具一样,Git 也提供了一个名为 `~/.gitconfig` 配置文件 (或 dotfile)。请在 `~/.gitconfig` 中创建一个别名,使您在运行 `git graph` 时,您可以得到 `git log --all --graph --decorate --oneline` 的输出结果; 6. 您可以通过执行 `git config --global core.excludesfile ~/.gitignore_global` 来设置全局忽略文件的位置,这会告诉 Git 使用该文件,但您仍需要手动在该路径创建 `~/.gitignore_global` 文件。配置您的全局 gitignore 文件来自动忽略系统或编辑器的临时文件,例如 `.DS_Store`; 7. Fork [本课程网站的仓库](https://github.com/missing-semester-cn/missing-semester-cn.github.io.git),找找有没有错别字或其他可以改进的地方,在 GitHub 上发起拉取请求(Pull Request); ================================================ FILE: _config.yml ================================================ # Setup title: 'the missing semester of your cs education' url: https://missing-semester-cn.github.io solution_url: missing-notes-and-solutions/2020/solutions/ # Settings markdown: kramdown kramdown: input: GFM hard_wrap: false highlighter: rouge permalink: /:title/ future: true # safe: true # breaks local rendering if enabled timezone: America/New_York analytics: tracking_id: UA-53167467-11 collections: '2019': output: true '2020': output: true # Excludes exclude: - README.md - Gemfile - Gemfile.lock ================================================ FILE: _includes/head.html ================================================ {% if page.description %} {% assign description = page.description | strip_newlines %} {% endif %} {% if page.short_title %} {% assign title = page.short_title %} {% elsif page.title %} {% assign title = page.title %} {% else %} {% assign title = site.title %} {% endif %} {% if page.thumbnail %} {% endif %} {% if page.title %} {{ page.title }} · {{ site.title }} {% else %} {{ site.title }} {% endif %} ================================================ FILE: _includes/nav.html ================================================ {% comment %} {% endcomment %} ================================================ FILE: _includes/scaled_image.html ================================================ {{ include.alt }} ================================================ FILE: _includes/scaled_video.html ================================================ ================================================ FILE: _includes/video.html ================================================ ================================================ FILE: _layouts/default.html ================================================ {% include head.html %} {% include nav.html %}
{{ content }}
================================================ FILE: _layouts/lecture.html ================================================ --- layout: default ---

{{ page.title }}{% if page.subtitle %} {{ page.subtitle }}{% endif %}

{% if page.video.id %}
{% elsif page.video %}

Lecture video coming soon!

{% endif %} {{ content }}

Edit this page.

Licensed under CC BY-NC-SA.

================================================ FILE: _layouts/page.html ================================================ --- layout: default ---

{{ page.title }}{% if page.subtitle %} {{ page.subtitle }}{% endif %}

{{ content }} ================================================ FILE: _layouts/redirect.html ================================================ --- layout: null --- {{ site.title }} -- {{ page.title }}

Redirecting you to {{ page.redirect }}

================================================ FILE: about.md ================================================ --- layout: lecture title: "开设此课程的动机" --- 在传统的计算机科学课程中,从操作系统、编程语言到机器学习,这些高大上课程和主题已经非常多了。 然而有一个至关重要的主题却很少被专门讲授,而是留给学生们自己去探索。 这部分内容就是:精通工具。 这些年,我们在麻省理工学院参与了许多课程的助教活动,过程当中愈发意识到很多学生对于工具的了解知之甚少。 计算机设计的初衷就是任务自动化,然而学生们却常常陷在大量的重复任务中,或者无法完全发挥出诸如 版本控制、文本编辑器等工具的强大作用。效率低下和浪费时间还是其次,更糟糕的是,这还可能导致数据丢失或 无法完成某些特定任务。 这些主题不是大学课程的一部分:学生一直都不知道如何使用这些工具,或者说,至少是不知道如何高效 地使用,因此浪费了时间和精力在本来可以更简单的任务上。标准的计算机科学课程缺少了这门能让计算 变得更简捷的关键课程。 # The missing semester of your CS education 为了解决这个问题,我们开设了一个课程,涵盖各项对成为高效率计算机科学家或程序员至关重要的 主题。这个课程实用且具有很强的实践性,提供了各种能够立即广泛应用解决问题的趁手工具指导。 该课在 2020 年 1 月“独立活动期”开设,为期一个月,是学生开办的短期课程。虽然该课程针对 麻省理工学院,但我们公开提供了全部课程的录制视频与相关资料。 如果该课程适合你,那么以下还有一些具体的课程示例: ## 命令行与 shell 工具 如何使用别名、脚本和构建系统来自动化执行通用重复的任务。不再总是从文档中拷贝粘贴 命令。不要再“逐个执行这 15 个命令”,不要再“你忘了执行这个命令”、“你忘了传那个 参数”,类似的对话不要再有了。 例如,快速搜索历史记录可以节省大量时间。在下面这个示例中,我们展示了如何通过`convert`命令 在历史记录中跳转的一些技巧。 ## 版本控制 如何**正确地**使用版本控制,利用它避免尴尬的情况发生。与他人协作,并且能够快速定位 有问题的提交 不再大量注释代码。不再为解决 bug 而找遍所有代码。不再“我去,刚才是删了有用的代码?!”。 我们将教你如何通过拉取请求来为他人的项目贡献代码。 下面这个示例中,我们使用`git bisect`来定位哪个提交破坏了单元测试,并且通过`git revert`来进行修复。 ## 文本编辑 不论是本地还是远程,如何通过命令行高效地编辑文件,并且充分利用编辑器特性。不再来回复制文件。不再重复编辑文件。 Vim 的宏是它最好的特性之一,在下面这个示例中,我们使用嵌套的 Vim 宏快速地将 html 表格转换成了 csv 格式。 ## 远程服务器 使用 SSH 密钥连接远程机器进行工作时如何保持连接,并且让终端能够复用。不再为了仅执行个别命令 总是打开许多命令行终端。不再每次连接都总输入密码。不再因为网络断开或必须重启笔记本时 就丢失全部上下文。 以下示例,我们使用`tmux`来保持远程服务器的会话存在,并使用`mosh`来支持网络漫游和断开连接。 ## 查找文件 如何快速查找你需要的文件。不再挨个点击项目中的文件,直到找到你所需的代码。 以下示例,我们通过`fd`快速查找文件,通过`rg`找代码片段。我们也用到了`fasd`快速`cd`并`vim`最近/常用的文件/文件夹。 ## 数据处理 如何通过命令行直接轻松快速地修改、查看、解析、绘制和计算数据和文件。不再从日志文件拷贝 粘贴。不再手动统计数据。不再用电子表格画图。 ## 虚拟机 如何使用虚拟机尝试新操作系统,隔离无关的项目,并且保持宿主机整洁。不再因为做安全实验而 意外损坏你的计算机。不再有大量随机安装的不同版本软件包。 ## 安全 如何在不泄露隐私的情况下畅游互联网。不再抓破脑袋想符合自己疯狂规则的密码。不再连接不安全的开放 WiFi 网络。不再传输未加密的信息。 # 结论 这 12 节课将包括但不限于以上内容,同时每堂课都提供了能帮助你熟悉这些工具的练手小测验。如果不能 等到一月,你也可以看下[黑客工具](https://hacker-tools.github.io/lectures/),这是我们去年的 试讲。它是本课程的前身,包含许多相同的主题。 无论面对面还是远程在线,欢迎你的参与。 Happy hacking,
Anish, Jose, and Jon ================================================ FILE: index.md ================================================ --- layout: page title: 计算机教育中缺失的一课 --- # The Missing Semester of Your CS Education 中文版 大学里的计算机课程通常专注于讲授从操作系统到机器学习这些学院派的课程或主题,而对于如何精通工具这一主题则往往会留给学生自行探索。在这个系列课程中,我们讲授命令行、强大的文本编辑器的使用、使用版本控制系统提供的多种特性等等。学生在他们受教育阶段就会和这些工具朝夕相处(在他们的职业生涯中更是这样)。 因此,花时间打磨使用这些工具的能力并能够最终熟练地、流畅地使用它们是非常有必要的。 精通这些工具不仅可以帮助您更快的使用工具完成任务,并且可以帮助您解决在之前看来似乎无比复杂的问题。 关于 [开设此课程的动机](/about/)。 {% comment %} # Registration Sign up for the IAP 2020 class by filling out this [registration form](https://forms.gle/TD1KnwCSV52qexVt9). {% endcomment %} # 日程
    {% assign lectures = site['2020'] | sort: 'date' %} {% for lecture in lectures %} {% if lecture.phony != true and lecture.solution !=true %}
  • {{ lecture.date | date: '%-m/%d' }}: {% if lecture.ready%} {{ lecture.title }} {% else %} {{ lecture.title }} {% if lecture.noclass %}[no class]{% endif %} {% endif %} {% if lecture.sync %} {% else %} {% endif %} {% if lecture.solution.ready%} {% else %} {% endif %}
  • {% endif %} {% endfor %}
讲座视频可以在 [ YouTube](https://www.youtube.com/playlist?list=PLyzOVJj3bHQuloKGG59rS43e29ro7I57J) 上找到。 # 关于本课程 **教员**:本课程由 [Anish](https://www.anishathalye.com/)、[Jon](https://thesquareplanet.com/) 和 [Jose](http://josejg.com/) 讲授。 **问题**:请通过 [missing-semester@mit.edu](mailto:missing-semester@mit.edu) 联系我们。 # 在 MIT 之外 我们也将本课程分享到了 MIT 之外,希望其他人也能受益于这些资源。您可以在下面这些地方找到相关文章和讨论。 - [Hacker News](https://news.ycombinator.com/item?id=22226380) - [Lobsters](https://lobste.rs/s/ti1k98/missing_semester_your_cs_education_mit) - [/r/learnprogramming](https://www.reddit.com/r/learnprogramming/comments/eyagda/the_missing_semester_of_your_cs_education_mit/) - [/r/programming](https://www.reddit.com/r/programming/comments/eyagcd/the_missing_semester_of_your_cs_education_mit/) - [Twitter](https://twitter.com/jonhoo/status/1224383452591509507) - [YouTube](https://www.youtube.com/playlist?list=PLyzOVJj3bHQuloKGG59rS43e29ro7I57J) # 译文 - [繁体中文](https://missing-semester-zh-hant.github.io/) - [Japanese](https://missing-semester-jp.github.io/) - [Korean](https://missing-semester-kr.github.io/) - [Portuguese](https://missing-semester-pt.github.io/) - [Russian](https://missing-semester-rus.github.io/) - [Serbian](https://netboxify.com/missing-semester/) - [Spanish](https://missing-semester-esp.github.io/) - [Turkish](https://missing-semester-tr.github.io/) - [Vietnamese](https://missing-semester-vn.github.io/) 注意:上述链接为社区翻译,我们并未验证其内容。 ## 致谢 感谢 Elaine Mello, Jim Cain 以及 [MIT Open Learning](https://openlearning.mit.edu/) 帮助我们录制讲座视频。 感谢 Anthony Zolnik 和 [MIT AeroAstro](https://aeroastro.mit.edu/) 提供 A/V 设备。 感谢 Brandi Adams 和 [MIT EECS](https://www.eecs.mit.edu/) 对本课程的支持。 ---

Source code.

Licensed under CC BY-NC-SA.

See here for contribution & translation guidelines.

================================================ FILE: lectures.html ================================================ --- layout: redirect redirect: /2020/ title: Lectures --- ================================================ FILE: license.md ================================================ --- layout: default title: "License" permalink: /license --- # License All the content in this course, including the website source code, lecture notes, exercises, and lecture videos is licensed under Attribution-NonCommercial-ShareAlike 4.0 International [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). This means that you are free to: - **Share** — copy and redistribute the material in any medium or format - **Adapt** — remix, transform, and build upon the material Under the following terms: - **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. - **NonCommercial** — You may not use the material for commercial purposes. - **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. This is a human-readable summary of (and not a substitute for) the [license](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). ## Contribution guidelines You can submit corrections and suggestions to the course material by submitting issues and pull requests on our GitHub [repo](https://github.com/missing-semester/missing-semester). This includes the captions for the video lectures which are also in the repo (see [here](https://github.com/missing-semester/missing-semester/tree/master/static/files/subtitles/2020)). ## Translation guidelines You are free to translate the lecture notes and exercises as long as you follow the license terms. If your translation mirrors the course structure, please contact us so we can link your translated version from our page. For translating the video captions, please submit your translations as community contributions in YouTube. ================================================ FILE: robots.txt ================================================ User-agent: * Disallow: ================================================ FILE: static/css/main.css ================================================ /* Copyright (c) 2017 Anish Athalye */ @import url(https://fonts.googleapis.com/css?family=Source+Sans+Pro); @import url(https://fonts.googleapis.com/css?family=Source+Code+Pro); /* Basic styling */ * { box-sizing: border-box; margin: 0; padding: 0; text-rendering: geometricPrecision; } html { /* font-size: 14px; font-family: "Source Sans Pro", "Helvetica Neue", Helvetica, Arial, sans-serif;*/ font-family: "Source Sans Pro", sans-serif; font-size: 14pt; line-height: 1.5; } @media(min-width: 480px) { html { /*font-size: 16px;*/ font-size: 14pt; } } body { margin: 0; color: #000; background-color: #fff; overflow-y: scroll; } h1, h2, h3, h4, h5, h6 { margin-bottom: 1rem; font-weight: bold; /*text-decoration: underline;*/ line-height: 1.25; font-size: 1rem; } h1 { margin-top: 1.25rem; font-size: 1.5rem; } h2 { margin-top: 1.25rem; font-size: 1.1rem; } p { margin-top: 0; margin-bottom: 1rem; } strong { font-weight: bold; } em { font-style: italic; } ul { list-style-position: inside; padding-left: 1rem; } ol { margin-left: 1rem; } li > ul { padding-left: 2rem; } ul li { list-style-type: none; } ul, ol { margin-bottom: 1rem; } ul ul, ol ul, ul ol, ol ol { margin-bottom: inherit; } ul li:before { content: "\2013 "; /* note: extra space needed because first is consumed by css parser */ position: absolute; margin-left: -1rem; } ul.double-spaced li { margin-top: 1rem; } pre, code { font-family: "Source Code Pro", "Menlo", "DejaVu Sans Mono", "Lucida Console", monospace; } code { background-color: rgba(27,31,35,.05); border-radius: 3px; padding: 0 0.2rem; /*font-size: 0.9em;*/ font-size: 12pt; } pre { color: #000; margin: 1rem; padding: 0.5rem 0.7rem; border: 1px dashed #444; /*font-size: .8rem;*/ font-size: 11pt; overflow-x: auto; } pre code { color: inherit; background: none; font-size: 100%; padding: 0; } a { color: #54008c; text-decoration: underline; } a:hover { color: #fff; background-color: #54008c; } img, video { display: block; margin-left: auto; margin-right: auto; border-radius: 5px; max-width: 100%; max-height: 80vh; } video { margin-bottom: 1rem; } summary { outline: none; user-select: none; } hr { position: relative; margin: 1.5rem 0; border: 0; border-top: 1px solid #eee; border-bottom: 1px solid #fff; } /* Classes */ .title { font-size: 2rem; } .subtitle { font-size: 1.5rem; margin-left: 1rem; } .small { font-size: 0.75rem; } .small p { margin-bottom: 0; } .center { text-align: center; } .gap { margin-top: 4rem; margin-bottom: 4rem; } .accent { color: #8c0038; } .youtube-wrapper { position: relative; height: 0; margin-bottom: 1rem; } .youtube-wrapper iframe { position: absolute; top: 0; left: 0; width: 100%; height: 100%; } /* Elements */ #content { max-width: 35rem; margin: auto; margin-bottom: 2rem; padding: 1rem 1rem 0 1rem; } .demo { margin-top: 2em; margin-bottom: 2em; } #nav-bg { margin: 0; padding: 0.25rem 1rem; font-family: "Source Code Pro", "Menlo", "DejaVu Sans Mono", "Lucida Console", monospace; background: #54008c; color: #fff; } #top-nav { max-width: 75rem; /* padding-left:8rem; */ margin: auto; text-align: center; } #top-nav a { color: #fff; text-decoration: none; } #top-nav a:hover { color: #000; background-color: #fff; } a#logo { color: #f2deff; } a:hover#logo { color: #000; } #menu-icon { display: none; } .trigger { display: none; } input[type=checkbox]:checked ~ .trigger { display:block; margin: auto; } .menu-label { font-family: "Source Code Pro", "Menlo", "DejaVu Sans Mono", "Lucida Console", monospace; } input[type=checkbox] ~ .menu-label:after { content: "(+)"; } input[type=checkbox]:checked ~ .menu-label:after { content: "(-)"; } .nav-link { display: block; } .trigger-child { display: inline-block; text-align: initial; } .nav-link:before { content: "- "; } /* in terms of our fixed-width layout; if smaller than this, we want to * collapse the menu */ @media (min-width: 40rem) { .menu-label { display: none; } .trigger { display: inline; padding-top: inherit; } .trigger-child { display: inline; text-align: initial; } .nav-link { display: initial; } .nav-link:before { content: "| "; } } @media (prefers-color-scheme: dark) { body { background-color: #303030; color: #ddd } a { color: #66D9EF; /*color: #A6E22E;*/ text-decoration: none; } a:hover { color: #000; background-color: #66D9EF; text-decoration: none; } h1, h2, h3, h4, h5, h6 { color: #eee; } #nav-bg, a#logo, #top-nav a { background-color: #A6E22E; color: #202020; } a:hover > code { background-color: #66D9EF; } .accent { color: #F92672; } } @media print { #nav-bg, #logo, #top-nav { display: none; } h1.title ~ p.center.gap.accent { display: none; } .youtube-wrapper { display: none; } html { font-size: 1em; font-family: sans-serif; } body { background: none; } #content { max-width: none; } h1.title { text-align: center; } h1, h2, h3, h4, h5, h6 { break-after: avoid-page; page-break-after: avoid; } #content hr:last-of-type { display: none; } #content pre { break-inside: avoid-page; page-break-inside: avoid; } #content div.small:last-of-type { display: none; } } .ribbon { background-color: #8cbcea; overflow: hidden; white-space: nowrap; /* top left corner */ position: absolute; right: -50px; top: 40px; /* 45 deg ccw rotation */ -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); transform: rotate(45deg); /* shadow */ -webkit-box-shadow: 0 0 10px #888; -moz-box-shadow: 0 0 10px #888; /* box-shadow: 0 0 10px #888; */ box-shadow: 0px -1px 20px 0px #562c8c6b; } .ribbon a { border: 1px solid #000; color: #000; display: block; font: bold 81.25% "Helvetica Neue", Helvetica, Arial, sans-serif; margin: 1px 0; padding: 10px 50px; text-align: center; text-decoration: none; /* shadow */ /* text-shadow: 0 0 5px #444; */ } ================================================ FILE: static/css/syntax.css ================================================ pre.highlight { background-color: #f9f9f9; background-clip: border-box } .highlight .c { color: #999988; font-style: italic } /* Comment */ .highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */ .highlight .k { color: #000000; font-weight: bold } /* Keyword */ .highlight .o { color: #000000; font-weight: bold } /* Operator */ .highlight .cm { color: #999988; font-style: italic } /* Comment.Multiline */ .highlight .cp { color: #999999; font-weight: bold; font-style: italic } /* Comment.Preproc */ .highlight .c1 { color: #999988; font-style: italic } /* Comment.Single */ .highlight .cs { color: #999999; font-weight: bold; font-style: italic } /* Comment.Special */ .highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */ .highlight .ge { color: #000000; font-style: italic } /* Generic.Emph */ .highlight .gr { color: #aa0000 } /* Generic.Error */ .highlight .gh { color: #999999 } /* Generic.Heading */ .highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */ .highlight .go { color: #888888 } /* Generic.Output */ .highlight .gp { color: #995c8b } /* Generic.Prompt */ .highlight .gs { font-weight: bold } /* Generic.Strong */ .highlight .gu { color: #aaaaaa } /* Generic.Subheading */ .highlight .gt { color: #aa0000 } /* Generic.Traceback */ .highlight .kc { color: #000000; font-weight: bold } /* Keyword.Constant */ .highlight .kd { color: #000000; font-weight: bold } /* Keyword.Declaration */ .highlight .kn { color: #000000; font-weight: bold } /* Keyword.Namespace */ .highlight .kp { color: #000000; font-weight: bold } /* Keyword.Pseudo */ .highlight .kr { color: #000000; font-weight: bold } /* Keyword.Reserved */ .highlight .kt { color: #445588; font-weight: bold } /* Keyword.Type */ .highlight .m { color: #009999 } /* Literal.Number */ .highlight .s { color: #d01040 } /* Literal.String */ .highlight .na { color: #008080 } /* Name.Attribute */ .highlight .nb { color: #0086b3 } /* Name.Builtin */ .highlight .nc { color: #445588; font-weight: bold } /* Name.Class */ .highlight .no { color: #008080 } /* Name.Constant */ .highlight .nd { color: #3c5d5d; font-weight: bold } /* Name.Decorator */ .highlight .ni { color: #800080 } /* Name.Entity */ .highlight .ne { color: #990000; font-weight: bold } /* Name.Exception */ .highlight .nf { color: #990000; font-weight: bold } /* Name.Function */ .highlight .nl { color: #990000; font-weight: bold } /* Name.Label */ .highlight .nn { color: #555555 } /* Name.Namespace */ .highlight .nt { color: #000080 } /* Name.Tag */ .highlight .nv { color: #008080 } /* Name.Variable */ .highlight .ow { color: #000000; font-weight: bold } /* Operator.Word */ .highlight .w { color: #bbbbbb } /* Text.Whitespace */ .highlight .mf { color: #009999 } /* Literal.Number.Float */ .highlight .mh { color: #009999 } /* Literal.Number.Hex */ .highlight .mi { color: #009999 } /* Literal.Number.Integer */ .highlight .mo { color: #009999 } /* Literal.Number.Oct */ .highlight .sb { color: #d01040 } /* Literal.String.Backtick */ .highlight .sc { color: #d01040 } /* Literal.String.Char */ .highlight .sd { color: #d01040 } /* Literal.String.Doc */ .highlight .s2 { color: #d01040 } /* Literal.String.Double */ .highlight .se { color: #d01040 } /* Literal.String.Escape */ .highlight .sh { color: #d01040 } /* Literal.String.Heredoc */ .highlight .si { color: #d01040 } /* Literal.String.Interpol */ .highlight .sx { color: #d01040 } /* Literal.String.Other */ .highlight .sr { color: #009926 } /* Literal.String.Regex */ .highlight .s1 { color: #d01040 } /* Literal.String.Single */ .highlight .ss { color: #990073 } /* Literal.String.Symbol */ .highlight .bp { color: #999999 } /* Name.Builtin.Pseudo */ .highlight .vc { color: #008080 } /* Name.Variable.Class */ .highlight .vg { color: #008080 } /* Name.Variable.Global */ .highlight .vi { color: #008080 } /* Name.Variable.Instance */ .highlight .il { color: #009999 } /* Literal.Number.Integer.Long */ @media (prefers-color-scheme: dark) { code { background-color: #232323; } pre code { color: #ddd; } pre.highlight { background-color: #232323; } .highlight .hll { background-color: #232323; } .highlight .c { color: #75715e } /* Comment */ .highlight .err { color: #960050; background-color: #1e0010 } /* Error */ .highlight .k { color: #66d9ef } /* Keyword */ .highlight .l { color: #ae81ff } /* Literal */ .highlight .n { color: #f8f8f2 } /* Name */ .highlight .o { color: #f92672 } /* Operator */ .highlight .p { color: #f8f8f2 } /* Punctuation */ .highlight .cm { color: #75715e } /* Comment.Multiline */ .highlight .cp { color: #75715e } /* Comment.Preproc */ .highlight .c1 { color: #75715e } /* Comment.Single */ .highlight .cs { color: #75715e } /* Comment.Special */ .highlight .ge { font-style: italic } /* Generic.Emph */ .highlight .gs { font-weight: bold } /* Generic.Strong */ .highlight .kc { color: #66d9ef } /* Keyword.Constant */ .highlight .kd { color: #66d9ef } /* Keyword.Declaration */ .highlight .kn { color: #f92672 } /* Keyword.Namespace */ .highlight .kp { color: #66d9ef } /* Keyword.Pseudo */ .highlight .kr { color: #66d9ef } /* Keyword.Reserved */ .highlight .kt { color: #66d9ef } /* Keyword.Type */ .highlight .ld { color: #e6db74 } /* Literal.Date */ .highlight .m { color: #ae81ff } /* Literal.Number */ .highlight .s { color: #e6db74 } /* Literal.String */ .highlight .na { color: #a6e22e } /* Name.Attribute */ .highlight .nb { color: #f8f8f2 } /* Name.Builtin */ .highlight .nc { color: #a6e22e } /* Name.Class */ .highlight .no { color: #66d9ef } /* Name.Constant */ .highlight .nd { color: #a6e22e } /* Name.Decorator */ .highlight .ni { color: #f8f8f2 } /* Name.Entity */ .highlight .ne { color: #a6e22e } /* Name.Exception */ .highlight .nf { color: #a6e22e } /* Name.Function */ .highlight .nl { color: #f8f8f2 } /* Name.Label */ .highlight .nn { color: #f8f8f2 } /* Name.Namespace */ .highlight .nx { color: #a6e22e } /* Name.Other */ .highlight .py { color: #f8f8f2 } /* Name.Property */ .highlight .nt { color: #f92672 } /* Name.Tag */ .highlight .nv { color: #f8f8f2 } /* Name.Variable */ .highlight .ow { color: #f92672 } /* Operator.Word */ .highlight .w { color: #f8f8f2 } /* Text.Whitespace */ .highlight .mf { color: #ae81ff } /* Literal.Number.Float */ .highlight .mh { color: #ae81ff } /* Literal.Number.Hex */ .highlight .mi { color: #ae81ff } /* Literal.Number.Integer */ .highlight .mo { color: #ae81ff } /* Literal.Number.Oct */ .highlight .sb { color: #e6db74 } /* Literal.String.Backtick */ .highlight .sc { color: #e6db74 } /* Literal.String.Char */ .highlight .sd { color: #e6db74 } /* Literal.String.Doc */ .highlight .s2 { color: #e6db74 } /* Literal.String.Double */ .highlight .se { color: #ae81ff } /* Literal.String.Escape */ .highlight .sh { color: #e6db74 } /* Literal.String.Heredoc */ .highlight .si { color: #e6db74 } /* Literal.String.Interpol */ .highlight .sx { color: #e6db74 } /* Literal.String.Other */ .highlight .sr { color: #e6db74 } /* Literal.String.Regex */ .highlight .s1 { color: #e6db74 } /* Literal.String.Single */ .highlight .ss { color: #e6db74 } /* Literal.String.Symbol */ .highlight .bp { color: #f8f8f2 } /* Name.Builtin.Pseudo */ .highlight .vc { color: #f8f8f2 } /* Name.Variable.Class */ .highlight .vg { color: #f8f8f2 } /* Name.Variable.Global */ .highlight .vi { color: #f8f8f2 } /* Name.Variable.Instance */ .highlight .il { color: #ae81ff } /* Literal.Number.Integer.Long */ .highlight .gh { } /* Generic Heading & Diff Header */ .highlight .gu { color: #75715e; } /* Generic.Subheading & Diff Unified/Comment? */ .highlight .gd { color: #f92672; } /* Generic.Deleted & Diff Deleted */ .highlight .gi { color: #a6e22e; } /* Generic.Inserted & Diff Inserted */ } ================================================ FILE: static/files/logger.py ================================================ import logging import sys class CustomFormatter(logging.Formatter): """Logging Formatter to add colors and count warning / errors""" grey = "\x1b[38;21m" yellow = "\x1b[33;21m" red = "\x1b[31;21m" bold_red = "\x1b[31;1m" reset = "\x1b[0m" format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s (%(filename)s:%(lineno)d)" FORMATS = { logging.DEBUG: grey + format + reset, logging.INFO: grey + format + reset, logging.WARNING: yellow + format + reset, logging.ERROR: red + format + reset, logging.CRITICAL: bold_red + format + reset } def format(self, record): log_fmt = self.FORMATS.get(record.levelno) formatter = logging.Formatter(log_fmt) return formatter.format(record) # create logger with 'spam_application' logger = logging.getLogger("Sample") # create console handler with a higher log level ch = logging.StreamHandler() ch.setLevel(logging.DEBUG) if len(sys.argv)> 1: if sys.argv[1] == 'log': ch.setFormatter(logging.Formatter('%(asctime)s : %(levelname)s : %(name)s : %(message)s')) elif sys.argv[1] == 'color': ch.setFormatter(CustomFormatter()) if len(sys.argv) > 2: logger.setLevel(logging.__getattribute__(sys.argv[2])) else: logger.setLevel(logging.DEBUG) logger.addHandler(ch) # logger.debug("debug message") # logger.info("info message") # logger.warning("warning message") # logger.error("error message") # logger.critical("critical message") import random import time for _ in range(100): i = random.randint(0, 10) if i <= 4: logger.info("Value is {} - Everything is fine".format(i)) elif i <= 6: logger.warning("Value is {} - System is getting hot".format(i)) elif i <= 8: logger.error("Value is {} - Dangerous region".format(i)) else: logger.critical("Maximum value reached") time.sleep(0.3) ================================================ FILE: static/files/sorts.py ================================================ import random def test_sorted(fn, iters=1000): for i in range(iters): l = [random.randint(0, 100) for i in range(0, random.randint(0, 50))] assert fn(l) == sorted(l) # print(fn.__name__, fn(l)) def insertionsort(array): for i in range(len(array)): j = i-1 v = array[i] while j >= 0 and v < array[j]: array[j+1] = array[j] j -= 1 array[j+1] = v return array def quicksort(array): if len(array) <= 1: return array pivot = array[0] left = [i for i in array[1:] if i < pivot] right = [i for i in array[1:] if i >= pivot] return quicksort(left) + [pivot] + quicksort(right) def quicksort_inplace(array, low=0, high=None): if len(array) <= 1: return array if high is None: high = len(array)-1 if low >= high: return array pivot = array[high] j = low-1 for i in range(low, high): if array[i] <= pivot: j += 1 array[i], array[j] = array[j], array[i] array[high], array[j+1] = array[j+1], array[high] quicksort_inplace(array, low, j) quicksort_inplace(array, j+2, high) return array if __name__ == '__main__': for fn in [quicksort, quicksort_inplace, insertionsort]: test_sorted(fn) ================================================ FILE: static/files/subtitles/2020/command-line.sbv ================================================ 0:00:00.480,0:00:02.480 Okay, can everyone hear me okay? 0:00:03.720,0:00:06.160 Okay, so welcome back. 0:00:06.160,0:00:10.320 I'm gonna address a couple of items in kind of the administratrivia. 0:00:10.640,0:00:13.080 With the end of the first week, 0:00:13.179,0:00:16.349 we sent an email, noticing you that 0:00:16.600,0:00:20.219 we have uploaded the videos for the first week, so you can now find them online. 0:00:20.470,0:00:26.670 They have all the screen recordings for the things that we were doing, so you can go back to them. 0:00:26.830,0:00:31.439 Look if you're were confused about if we did something quick and, again, 0:00:31.440,0:00:37.560 feel free to ask us any questions if anything in the lecture notes is not clear. We also sent you a 0:00:37.880,0:00:42.360 survey so you can give us feedback about what was not clear, 0:00:42.360,0:00:46.280 what items you would want a more thorough explanation or 0:00:47.110,0:00:51.749 just any other item, if you're finding the exercises too hard, too easy, 0:00:52.239,0:00:55.288 go into that URL and we'll really 0:00:55.960,0:01:00.040 appreciate getting that feedback, because that will make the course better 0:01:00.480,0:01:03.800 for the remaining lectures and for future iterations of the course. 0:01:05.080,0:01:07.080 With that out of the way 0:01:07.080,0:01:10.840 Oh, and we're gonna try to upload the videos in a more timely manner. 0:01:11.200,0:01:16.040 We don't want to kind of wait until the end of the week for that. So keep tuned for that. 0:01:18.760,0:01:19.840 That out of the way, 0:01:19.920,0:01:20.800 now I'm gonna 0:01:21.120,0:01:24.960 This lecture's called command-line environment and we're 0:01:25.160,0:01:28.440 going to cover a few different topics. So the 0:01:28.990,0:01:30.990 main topics we're gonna 0:01:32.040,0:01:34.520 cover, so you can keep track, 0:01:34.680,0:01:36.400 it's probably better here, 0:01:36.400,0:01:37.720 keep track of what I'm talking. 0:01:37.920,0:01:41.560 The first is gonna be job control. 0:01:42.040,0:01:44.280 The second one is going to be 0:01:44.600,0:01:46.600 terminal multiplexers. 0:01:51.720,0:01:57.360 Then I'm going to explain what dotfiles are and how to configure your shell. 0:01:57.360,0:02:03.240 And lastly, how to efficiently work with remote machines. So if things are not 0:02:05.110,0:02:07.589 fully clear, kind of keep the structure. 0:02:08.200,0:02:12.320 They all kind of interact in some way, of how you use your terminal, 0:02:12.880,0:02:17.280 but they are somewhat separate topics, so keep that in mind. 0:02:17.600,0:02:23.800 So let's go with job control. So far we have been using the shell in a very, kind of 0:02:24.800,0:02:27.720 mono-command way. Like, you execute a command and then 0:02:27.840,0:02:31.800 the command executes, then you get some output, and that's all about what you can do. 0:02:32.200,0:02:36.520 And if you want to run several things, it's not clear 0:02:36.540,0:02:41.099 how you will do it. Or if you want to stop the execution of a program, it's again, 0:02:41.099,0:02:43.768 like how do I know how to stop a program? 0:02:44.650,0:02:47.940 Let's showcase this with a command called sleep. 0:02:48.160,0:02:50.320 Sleep is a command that takes an argument, 0:02:50.320,0:02:54.360 and that argument is going to be an integer number, and it will sleep. 0:02:54.360,0:02:58.440 It will just kind of be there, on the background, for that many seconds. 0:02:58.440,0:03:03.539 So if we do something like sleep 20, this process is gonna be sleeping for 20 seconds. 0:03:03.539,0:03:07.720 But we don't want to wait 20 seconds for the command to complete. 0:03:08.040,0:03:10.800 So what we can do is type "Ctrl+C". 0:03:10.840,0:03:12.580 By typing "Ctrl+C" 0:03:12.580,0:03:17.840 We can see that, here, the terminal let us know, 0:03:18.880,0:03:22.840 and it's part of the syntax that we covered in the editors / Vim lecture, 0:03:23.000,0:03:27.200 that we typed "Ctrl+C" and it stopped the execution of the process. 0:03:27.640,0:03:29.640 What is actually going on here 0:03:29.880,0:03:34.840 is that this is using a UNIX communication mechanism called signals. 0:03:35.120,0:03:37.360 When we type "Ctrl+C", 0:03:37.800,0:03:42.080 what the terminal did for us, or the shell did for us, 0:03:42.160,0:03:45.960 is send a signal called SIGINT, 0:03:45.960,0:03:51.320 that stands for SIGnal INTerrupt, that tells the program to stop itself. 0:03:51.680,0:03:57.520 And there are many, many, many signals of this kind. If you do man signal, 0:03:58.880,0:04:05.060 and just go down a little bit, here you have a list of them. 0:04:05.060,0:04:07.040 They all have number identifiers, 0:04:07.520,0:04:10.640 they have kind of a short name and you can find a description. 0:04:10.960,0:04:16.400 So for example, the one I have just described is here, number 2, SIGINT. 0:04:16.520,0:04:22.200 This is the signal that a terminal will send to a program when it wants to interrupt its execution. 0:04:22.520,0:04:25.840 A few more to be familiar with 0:04:26.460,0:04:28.530 is SIGQUIT, this is 0:04:29.229,0:04:34.409 again, if you work from a terminal and you want to quit the execution of a program. 0:04:34.409,0:04:37.720 For most programs it will do the same thing, 0:04:37.720,0:04:41.120 but we're gonna showcase now a program which will be different, 0:04:41.440,0:04:43.760 and this is the signal that will be sent. 0:04:44.680,0:04:49.229 It can be confusing sometimes. Looking at these signals, for example, the SIGTERM is 0:04:50.080,0:04:54.100 for most cases equivalent to SIGINT and SIGQUIT 0:04:54.480,0:04:58.380 but it's just when it's not sent through a terminal. 0:04:59.680,0:05:01.680 A few more that we're gonna 0:05:01.900,0:05:06.209 cover is SIGHUP, it's when there's like a hang-up in the terminal. 0:05:06.210,0:05:10.199 So for example, when you are in your terminal, if you close your terminal 0:05:10.199,0:05:13.348 and there are still things running in the terminal, 0:05:13.480,0:05:17.000 that's the signal that the program is gonna send 0:05:17.000,0:05:19.960 to all the processes to tell that they should close, 0:05:19.960,0:05:25.080 like there was a hang-up in the command line communication 0:05:25.080,0:05:26.800 and they should close now. 0:05:28.400,0:05:34.260 Signals can do more things than just stopping, interrupting programs and asking them to finish. 0:05:34.260,0:05:36.840 You can for example use the 0:05:37.520,0:05:43.840 SIGSTOP to pause the execution of the program, and then you can use the 0:05:44.480,0:05:50.160 SIGCONT command for continuing, to continue the execution of the program at a point later in time. 0:05:51.160,0:05:55.440 Since all of this might be slightly too abstract, let's see a few examples. 0:05:58.040,0:06:00.560 First, let's showcase a 0:06:01.960,0:06:06.240 Python program. I'm going to very quickly go through the program. 0:06:06.440,0:06:08.360 This is a Python program, 0:06:08.720,0:06:10.760 that like most python programs, 0:06:11.520,0:06:13.960 is importing this signal library and 0:06:14.960,0:06:20.400 is defining this handler here. And this handler is writing, 0:06:20.440,0:06:23.040 "Oh, I got a SIGINT, but I'm not gonna stop here". 0:06:23.480,0:06:24.960 And after that, 0:06:24.960,0:06:30.720 we tell Python that we want this program, when it gets a SIGINT, to stop. 0:06:31.120,0:06:34.880 The rest of the program is a very silly program that is just going to be printing numbers. 0:06:35.060,0:06:37.540 So let's see this in action. 0:06:37.560,0:06:39.560 We do Python SIGINT. 0:06:39.880,0:06:44.970 And it's counting. We try doing "Ctrl+C", this sends a SIGINT, 0:06:44.970,0:06:50.000 but the program didn't actually stop. This is because we have a way in the program of 0:06:50.400,0:06:54.600 dealing with this exception, and we didn't want to exit. 0:06:54.760,0:06:57.600 If we send a SIGQUIT, which is done through 0:06:57.800,0:07:03.680 "Ctrl+\", here, we can see that since the program doesn't have a way of dealing with SIGQUIT, 0:07:03.730,0:07:06.269 it does the default operation, which is 0:07:06.820,0:07:08.800 terminate the program. 0:07:09.080,0:07:11.460 And you could use this, for example, 0:07:11.880,0:07:15.880 if someone Ctrl+C's your program, and your program is supposed to do something, 0:07:16.040,0:07:19.320 like you maybe want to save the intermediate state of your program 0:07:19.320,0:07:21.520 to a file, so you can recover it for later. 0:07:21.600,0:07:25.640 This is how you could write a handler like this. 0:07:29.520,0:07:30.720 Can you repeat the question? 0:07:30.880,0:07:32.280 What did you type right now, when it stopped? 0:07:32.480,0:07:34.480 So I... 0:07:34.630,0:07:38.880 So what I typed is, I type "Ctrl+C" to try to stop it 0:07:38.880,0:07:42.869 but it didn't, because SIGINT is captured by the program. Then I type 0:07:43.120,0:07:48.040 "Ctrl+\", which sends a SIGQUIT, which is a different signal, 0:07:49.000,0:07:51.720 and this signal is not captured by the program. 0:07:52.090,0:07:54.869 It's also worth mentioning that there is a couple of 0:07:54.970,0:07:59.970 signals that cannot be captured by software. There is a couple of signals 0:08:00.820,0:08:02.820 like SIGKILL 0:08:03.940,0:08:06.600 that cannot be captured. Like that, it will 0:08:06.660,0:08:09.300 terminate the execution of the process, no matter what. 0:08:09.300,0:08:12.000 And it can be sometimes harmful. You do not want to be using it by 0:08:12.000,0:08:16.460 default, because this can leave for example an orphan child, orphaned children processes. 0:08:16.470,0:08:20.940 Like if a process has other small children processes that it started, and you 0:08:21.400,0:08:25.470 SIGKILL it, all of those will keep running in there, 0:08:25.760,0:08:30.800 but they won't have a parent, and you can maybe have a really weird behavior going on. 0:08:32.040,0:08:35.680 What signal is given to the program if we log off? 0:08:35.800,0:08:37.440 If you log off? 0:08:37.920,0:08:41.920 That would be... so for example, if you're in an SSH connection and you close the connection, 0:08:41.920,0:08:45.600 that is the hang-up signal, 0:08:45.600,0:08:51.200 SIGHUP, which I'm gonna cover in an example. So this is what would be sent up. 0:08:51.560,0:08:56.360 And you could write for example, if you want the process to keep working even if you close 0:08:56.960,0:09:02.560 that, you can write a wrapper around that to ignore that signal. 0:09:04.720,0:09:09.760 Let's display what we could do with the stop and continue. 0:09:09.980,0:09:16.389 So, for example, we can start a really long process. Let's sleep a thousand, we're gonna take forever. 0:09:16.960,0:09:18.920 We can control-c, 0:09:18.920,0:09:20.360 "Ctrl+Z", sorry, 0:09:20.360,0:09:25.280 and if we do "Ctrl+Z" we can see that the terminal is saying "it's suspended". 0:09:25.400,0:09:31.520 What this actually meant is that this process was sent a SIGSTOP signal and now is 0:09:31.900,0:09:36.900 still there, you could continue its execution, but right now it's completely stopped and in the background 0:09:38.580,0:09:41.720 and we can launch a different program. 0:09:41.720,0:09:43.680 When we try to run this program, 0:09:43.680,0:09:46.620 please notice that I have included an "&" at the end. 0:09:46.820,0:09:52.380 This tells bash that I want this program to start running in the background. 0:09:52.560,0:09:55.660 This is kind of related to all these 0:09:55.660,0:09:59.720 concepts of running programs in the shell, but backgrounded. 0:10:00.350,0:10:04.359 And what is gonna happen is the program is gonna start 0:10:04.720,0:10:07.580 but it's not gonna take over my prompt. 0:10:07.580,0:10:11.540 If I just ran this command without this, I could not do anything. 0:10:11.540,0:10:15.820 I would have no access to the prompt until the command either finished 0:10:16.060,0:10:19.380 or I ended it abruptly. But if I do this, 0:10:19.520,0:10:23.080 it's saying "there's a new process which is this". 0:10:23.080,0:10:25.180 This is the process identifying number, 0:10:25.180,0:10:26.940 we can ignore this for now. 0:10:27.800,0:10:32.919 If I type the command "jobs", I get the output that I have a suspended job 0:10:32.920,0:10:35.800 that is the "sleep 1000" job. 0:10:36.040,0:10:38.100 And then I have another running job, 0:10:38.120,0:10:42.200 which is this "NOHUP sleep 2000". 0:10:42.640,0:10:45.660 Say I want to continue the first job. 0:10:45.660,0:10:48.520 The first job is suspended, it's not executing anymore. 0:10:48.640,0:10:52.600 I can continue that doing "BG %1" 0:10:53.870,0:10:58.359 That "%" is referring to the fact that I want to refer to this specific 0:11:00.280,0:11:04.280 process. And now, if I do that and I look at the jobs, 0:11:04.300,0:11:06.460 now this job is running again. Now 0:11:06.460,0:11:08.940 both of them are running. 0:11:09.300,0:11:13.820 If I wanted to stop these all, I can use the kill command. 0:11:14.040,0:11:16.060 The kill command 0:11:16.220,0:11:18.620 is for killing jobs, 0:11:19.180,0:11:22.080 which is just stopping them, intuitively, 0:11:22.120,0:11:23.760 but actually it's really useful. 0:11:23.860,0:11:28.200 The kill command just allows you to send any sort of Unix signal. 0:11:28.360,0:11:32.220 So here for example, instead of killing it completely, 0:11:32.220,0:11:34.640 we just send a stop signal. 0:11:34.640,0:11:39.160 Here I'm gonna send a stop signal, which is gonna pause the process again. 0:11:39.160,0:11:41.280 I still have to include the identifier, 0:11:41.600,0:11:46.480 because without the identifier the shell wouldn't know whether to stop the first one or the second one. 0:11:47.480,0:11:52.480 Now it's said this has been suspended, because there was a signal sent. 0:11:52.620,0:11:57.360 If I do "jobs", again, we can see that the second one is running 0:11:57.460,0:12:00.740 and the first one has been stopped. 0:12:01.420,0:12:04.300 Going back to one of the questions, 0:12:04.300,0:12:06.980 what happens when you close the cell, for example, 0:12:06.980,0:12:12.860 and why sometimes people will say that you should use this NOHUP command 0:12:12.860,0:12:15.960 before your run jobs in a remote session. 0:12:16.220,0:12:23.120 This is because if we try to send a hung up command to the first job 0:12:23.560,0:12:27.820 it's gonna, in a similar fashion as the other signals, 0:12:27.820,0:12:32.280 it's gonna hang it up and that's gonna terminate the job. 0:12:32.800,0:12:35.960 And the first job isn't there anymore 0:12:36.320,0:12:39.140 whereas we have still the second job running. 0:12:39.400,0:12:42.920 However, if we try to send the signal to the second job 0:12:42.920,0:12:46.060 what will happen if we close our terminal right now 0:12:47.040,0:12:48.660 is it's still running. 0:12:48.660,0:12:52.480 Like NOHUP, what it's doing is kind of encapsulating 0:12:52.480,0:12:54.480 whatever command you're executing and 0:12:54.740,0:12:58.720 ignoring wherever you get a hang up signal, 0:12:58.900,0:13:03.680 and just ignoring that so it can keep running. 0:13:05.060,0:13:08.500 And if we send the "kill" signal to the second job, 0:13:08.500,0:13:12.820 that one can't be ignored and that will kill the job, no matter what. 0:13:13.280,0:13:15.780 And we don't have any jobs anymore. 0:13:17.000,0:13:22.540 That kind of completes the section on job control. 0:13:22.740,0:13:27.100 Any questions so far? Anything that wasn't fully clear? 0:13:29.040,0:13:30.400 What does BG do? 0:13:30.960,0:13:31.800 So BG... 0:13:31.800,0:13:36.860 There are like two commands. Whenever you have a command that has been backgrounded 0:13:37.200,0:13:41.820 and is stopped you can use BG (short for background) 0:13:41.820,0:13:44.180 to continue that process running on the background. 0:13:44.440,0:13:47.400 That's equivalent of just kind of sending it 0:13:47.680,0:13:50.820 a continue signal, so it keeps running. 0:13:50.820,0:13:54.820 And then there's another one which is called FG, if you want to 0:13:54.860,0:13:59.580 recover it to the foreground and you want to reattach your standard output. 0:14:04.760,0:14:06.760 Okay, good. 0:14:07.120,0:14:11.420 Jobs are useful and in general, I think knowing about signals can be 0:14:11.420,0:14:14.360 really beneficial when dealing with some part of Unix 0:14:14.360,0:14:19.420 but most of the time what you actually want to do is something along the lines of 0:14:19.670,0:14:24.099 having your editor in one side and then the program in another, and maybe 0:14:24.720,0:14:28.280 monitoring what the resource consumption is in our tab. 0:14:28.680,0:14:33.640 We could achieve this using probably what you have seen a lot of the time, 0:14:33.640,0:14:35.200 which is just opening more windows. 0:14:35.200,0:14:37.200 We can keep opening terminal windows. 0:14:37.320,0:14:41.280 But the fact is there are kind of more convenient solutions to this and 0:14:41.280,0:14:43.800 this is what a terminal multiplexer does. 0:14:44.080,0:14:48.520 A terminal multiplexer like tmux 0:14:48.840,0:14:52.160 will let you create different workspaces that you can work in, 0:14:52.640,0:14:54.280 and quickly kind of, 0:14:54.280,0:14:56.960 this has a huge variety of functionality, 0:14:57.320,0:15:02.760 It will let you rearrange the environment and it will let you have different sessions. 0:15:03.400,0:15:05.400 There's another more... 0:15:05.600,0:15:07.640 older command, which is called "screen", 0:15:07.640,0:15:09.360 that might be more readily available. 0:15:09.360,0:15:12.200 But I think the concept kind of extrapolates to both. 0:15:12.600,0:15:15.400 We recommend tmux, that you go and learn it. 0:15:15.400,0:15:17.480 And in fact, we have exercises on it. 0:15:17.480,0:15:20.240 I'm gonna showcase a different scenario right now. 0:15:20.320,0:15:22.000 So whenever I talked... 0:15:22.320,0:15:24.880 Oh, let me make a quick note. 0:15:25.200,0:15:28.800 There are kind of three core concepts in tmux, that I'm gonna go through and 0:15:30.110,0:15:33.130 the main idea is that there are what is called 0:15:35.180,0:15:37.180 "sessions". 0:15:37.760,0:15:40.510 Sessions have "windows" and 0:15:42.019,0:15:44.019 windows have "panes". 0:15:45.709,0:15:49.539 It's gonna be kind of useful to keep this hierarchy in mind. 0:15:50.760,0:15:57.280 You can pretty much equate "windows" to what "tabs" are in other editors and others, 0:15:57.280,0:16:00.720 like for example your web browser. 0:16:01.280,0:16:06.440 I'm gonna go through the features, mainly what you can do at the different levels. 0:16:07.000,0:16:10.480 So first, when we do tmux, that starts a session. 0:16:11.360,0:16:14.960 And here right now it seems like nothing changed 0:16:14.960,0:16:20.360 but what's happening right now is we're within a shell that is different from the one we started before. 0:16:20.640,0:16:24.840 So in our shell we started a process, that is tmux 0:16:24.840,0:16:28.840 and that tmux started a different process, which is the shell we're currently in. 0:16:28.980,0:16:30.400 And the nice thing about this is that 0:16:30.580,0:16:34.740 that tmux process is separate from the original shell process. 0:16:34.860,0:16:36.860 So 0:16:40.580,0:16:44.460 here, we can do things. 0:16:44.480,0:16:48.600 We can do "ls -la", for example, to tell us what is going on in here. 0:16:48.920,0:16:53.960 And then we can start running our program, and it will start running in there 0:16:54.160,0:16:57.880 and we can do "Ctrl+A d", for example, to detach 0:17:12.760,0:17:15.960 to detach from the session. 0:17:16.140,0:17:19.120 And if we do "tmux a" 0:17:19.160,0:17:21.560 that's gonna reattach us to the session. 0:17:21.560,0:17:22.300 So the process, 0:17:22.300,0:17:25.180 we abandon the process counting numbers. 0:17:25.180,0:17:28.300 This really silly Python program that was just counting numbers, 0:17:28.340,0:17:30.160 we left it running there. 0:17:30.200,0:17:31.720 And if we tmux... 0:17:31.720,0:17:33.760 Hey, the process is still running there. 0:17:33.780,0:17:37.820 And we could close this entire terminal and open a new one and 0:17:37.880,0:17:41.860 we could still reattach because this tmux session is still running. 0:17:43.340,0:17:45.340 Again, we can... 0:17:46.640,0:17:48.640 Before I go any further. 0:17:48.920,0:17:53.740 Pretty much... Unlike Vim, where you have this notion of modes, 0:17:53.960,0:17:58.180 tmux will work in a more emacsy way, which is 0:17:58.180,0:18:04.140 every command, pretty much every command in tmux, 0:18:04.220,0:18:06.020 you could enter it through the... 0:18:06.020,0:18:08.160 it has a command line, that we could use. 0:18:08.240,0:18:11.320 But I recommend you to get familiar with the key bindings. 0:18:11.880,0:18:15.080 It can be somehow non intuitive at first, 0:18:15.300,0:18:17.880 but once you get used to them... 0:18:22.140,0:18:23.020 "Ctrl+C", yeah 0:18:24.440,0:18:30.760 When you get familiar with them, you will be much faster just using the key bindings than using the commands. 0:18:31.280,0:18:35.980 One note about the key bindings: all the key bindings have a form that is like 0:18:36.140,0:18:39.840 you type a prefix and then some key. 0:18:40.060,0:18:44.000 So for example, to detach we do "Ctrl+A" and then "D". 0:18:44.160,0:18:50.140 This means you press "Ctrl+A" first, you release that, and then press "D" to detach. 0:18:50.380,0:18:54.200 On default tmux, the prefix is "Ctrl+B", 0:18:54.200,0:18:58.780 but you will find that most people will have this remapped to "Ctrl+A" 0:18:58.780,0:19:02.680 because it's a much more ergonomic type on the keyboard. 0:19:02.700,0:19:06.420 You can find more about how to do these things in one of the exercises, 0:19:06.960,0:19:12.780 where we link you to the basics and how to do some kind of quality of life modifications to tmux. 0:19:13.380,0:19:16.720 Going back to the concept of sessions, 0:19:16.960,0:19:22.120 we can create a new session just doing something like tmux new 0:19:22.320,0:19:24.540 and we can give sessions names. 0:19:24.760,0:19:27.220 So we can do like "tmux new -t foobar" 0:19:27.220,0:19:30.900 and this is a completely different session, that we have started. 0:19:32.240,0:19:36.360 We can work here, we can detach from it. 0:19:36.360,0:19:40.000 "tmux ls" will tell us that we have two different sessions: 0:19:40.000,0:19:43.460 the first one is named "0", because I didn't give it a name, 0:19:43.500,0:19:45.820 and the second one is called "foobar". 0:19:46.580,0:19:51.020 I can attach the foobar session 0:19:51.020,0:19:53.700 and I can end it. 0:19:54.680,0:19:56.340 And it's really nice because 0:19:56.340,0:20:00.139 having this you can kind of work in completely different projects. 0:20:00.140,0:20:04.340 For example, having two different tmux sessions and different 0:20:04.480,0:20:08.440 editor sessions, different processes running... 0:20:10.160,0:20:15.100 When you are within a session, we start with the concept of windows. 0:20:15.100,0:20:21.160 Here we have a single window, but we can use "Ctrl+A c" (for "create") 0:20:21.160,0:20:23.720 to open a new window. 0:20:24.000,0:20:26.340 And here nothing is executing. 0:20:26.380,0:20:29.420 What it's doing is, tmux has opened a new shell for us 0:20:30.360,0:20:34.840 and we can start running another one of these programs here. 0:20:35.460,0:20:42.460 And to quickly jump between the tabs, we can do "Ctrl+A" and "previous", 0:20:42.460,0:20:44.520 "p" for "previous", 0:20:45.220,0:20:48.020 and that will go up to the previous window. 0:20:48.020,0:20:50.920 "Ctrl+A" "next", to go to the next window. 0:20:51.260,0:20:56.060 You can also use the numbers. So if we start opening a lot of these tabs, 0:20:56.200,0:21:00.160 we could use "Ctrl+A 1", to specifically jump to the 0:21:00.240,0:21:04.400 to the window that is number "1". 0:21:04.780,0:21:08.620 And, lastly, it's also pretty useful to know sometimes 0:21:08.660,0:21:10.400 that you can rename them. 0:21:10.400,0:21:13.380 For example here I'm executing this Python process, 0:21:13.580,0:21:16.800 but that might not be really informative and I want... 0:21:16.880,0:21:21.160 I maybe want to have something like execution or something like that and 0:21:21.740,0:21:26.840 that will rename the name of that window so you can have this really neatly organized. 0:21:27.080,0:21:33.500 This still doesn't solve the need when you want to have two things at the same time in your terminal, 0:21:33.680,0:21:35.740 like in the same display. 0:21:35.740,0:21:38.320 This is what panes are for. Right now, here 0:21:38.420,0:21:40.420 we have a window with a single pane 0:21:40.420,0:21:43.540 (all the windows that we have opened so far have a single pane). 0:21:43.640,0:21:50.800 But if we do 'Ctrl+A "' 0:21:51.040,0:21:56.540 this will split the current display into two different panes. 0:21:56.540,0:22:01.400 So, you see, the one we open below is a different shell from the one we have above, 0:22:01.640,0:22:05.440 and we can run any process that we want here. 0:22:05.620,0:22:09.900 We can keep splitting this, if we do "Ctrl+A %" 0:22:10.080,0:22:15.000 that will split vertically. And you can kind of 0:22:15.000,0:22:18.220 rearrange these tabs using a lot of different commands. 0:22:18.220,0:22:22.620 One that I find very useful, when you are starting and it's kind of frustrating, 0:22:23.540,0:22:26.000 rearranging them. 0:22:26.160,0:22:30.160 Before I explain that, to move through these panes, which is 0:22:30.300,0:22:32.280 something you want to be doing all the time 0:22:32.460,0:22:37.060 You just do "Ctrl+A" and the arrow keys, and that will let you quickly 0:22:37.460,0:22:43.960 navigate through the different windows, and execute again... 0:22:44.340,0:22:46.300 I'm doing a lot of "ls -a" 0:22:47.340,0:22:52.780 I can do "HTOP", that we'll explain in the debugging and profiling lecture. 0:22:53.540,0:22:55.920 And we can just navigate through them, again 0:22:55.920,0:22:59.040 like to rearrange there's another slew of commands, 0:22:59.080,0:23:01.080 you will go through some in the Exercises 0:23:02.400,0:23:07.160 "Ctrl+A" space is pretty neat, because it will kind of equispace the current ones 0:23:07.160,0:23:10.260 and let you through different layouts. 0:23:11.480,0:23:14.260 Some of them are too small for my current 0:23:14.840,0:23:19.220 terminal config, but that covers, I think, most of it. 0:23:19.440,0:23:21.440 Oh, there's also, 0:23:22.660,0:23:29.200 here, for example, this Vim execution that we have started, 0:23:29.200,0:23:33.380 is too small for what the current tmux pane is. 0:23:33.720,0:23:38.240 So one of the things that really is much more convenient to do in tmux, 0:23:39.180,0:23:42.500 in contrast to having multiple terminal windows, is that 0:23:42.560,0:23:48.400 you can zoom into this, you can ask by doing "Ctrl+A z", for "zoom". 0:23:48.400,0:23:52.960 It will expand the pane to take over all the space, 0:23:52.960,0:23:56.660 and then "Ctrl+A z", again will go back to it. 0:24:02.760,0:24:08.080 Any questions for terminal multiplexers, or like, tmux concretely? 0:24:14.140,0:24:16.780 Is it running all the same thing? 0:24:18.680,0:24:22.700 Like, is there any difference in execution between running it in different windows? 0:24:24.880,0:24:28.640 Is it really just doing it all the same, so that you can see it? 0:24:28.800,0:24:34.900 Yeah, it wouldn't be any different from having two terminal windows open in your computer. 0:24:34.920,0:24:39.220 Like both of them are gonna be running. Of course, when it gets to the CPU, 0:24:39.220,0:24:41.400 this is gonna be multiplexed again. 0:24:41.460,0:24:44.400 Like there's like a timesharing mechanism going there 0:24:44.480,0:24:45.920 but there's no difference. 0:24:46.040,0:24:52.260 tmux is just making this much more convenient to use by giving you this visual layout 0:24:52.560,0:24:55.020 that you can quickly manipulate through. 0:24:55.020,0:24:59.860 And one of the main advantages will come when we reach the remote machines 0:24:59.860,0:25:05.300 because you can leave one of these, we can detach from one of these tmux systems, 0:25:05.300,0:25:09.120 close the connection and even if we close the connection and 0:25:09.120,0:25:11.640 and the terminal is gonna send a hang-up signal, 0:25:11.680,0:25:15.420 that's not gonna close all the tmux's that have been started. 0:25:17.110,0:25:19.110 Any other questions? 0:25:23.620,0:25:27.980 Let me disable the key-caster. 0:25:33.580,0:25:38.040 So now we're gonna move into the topic of dotfiles and, in general, 0:25:38.040,0:25:42.460 how to kind of configure your shell to do the things you want to do 0:25:42.460,0:25:45.580 and mainly how to do them quicker and in a more convenient way. 0:25:46.360,0:25:49.260 I'm gonna motivate this using aliases first. 0:25:49.380,0:25:51.060 So what an alias is, 0:25:51.060,0:25:54.260 is that by now, you might be starting to do something like 0:25:54.920,0:26:01.680 a lot of the time, I just want to LS a directory and I want to display all the contents into a list format 0:26:02.180,0:26:05.040 and in a human readable thing. 0:26:05.260,0:26:07.400 And it's fine. Like it's not that long of a command. 0:26:07.400,0:26:10.300 But as you start building longer and longer commands, 0:26:10.320,0:26:14.440 it can become kind of bothersome having to retype them again and again. 0:26:14.440,0:26:17.540 This is one of the reasons why aliases are useful. 0:26:17.540,0:26:21.740 Alias is a command that will be a built-in in your shell, 0:26:21.960,0:26:23.680 and what it will do is 0:26:23.680,0:26:27.540 it will remap a short sequence of characters to a longer sequence. 0:26:27.780,0:26:31.500 So if I do, for example, here 0:26:31.500,0:26:36.840 alias ll="ls -lah" 0:26:37.440,0:26:42.520 If I execute this command, this is gonna call the "alias" command with this argument 0:26:42.520,0:26:44.320 and the LS is going to update 0:26:44.540,0:26:49.040 the environment in my shell to be aware of this mapping. 0:26:49.320,0:26:52.920 So if I now do LL, 0:26:52.920,0:26:57.520 it's executing that command without me having to type the entire command. 0:26:57.720,0:27:01.180 It can be really handy for many, many reasons. 0:27:01.180,0:27:04.740 One thing to note before I go any further is that 0:27:05.000,0:27:09.960 here, alias is not anything special compared to other commands, 0:27:09.960,0:27:11.400 it's just taking a single argument. 0:27:11.680,0:27:15.600 And there is no space around this equals and that's 0:27:16.020,0:27:18.720 because alias takes a single argument 0:27:18.720,0:27:21.640 and if you try doing 0:27:21.960,0:27:25.120 something like this, that's giving it more than one argument 0:27:25.120,0:27:28.360 and that's not gonna work because that's not the format it expects. 0:27:29.520,0:27:33.680 So other use cases that work for aliases, 0:27:34.720,0:27:36.549 as I was saying, 0:27:36.549,0:27:39.920 for some things it may be much more convenient, 0:27:40.040,0:27:41.020 like 0:27:41.020,0:27:43.200 one of my favorites is git status. 0:27:43.200,0:27:47.500 It's extremely long, and I don't like typing that long of a command every so often, 0:27:47.560,0:27:48.960 because you end up taking a lot of time. 0:27:49.120,0:27:53.000 So GS will replace for doing the git status 0:27:53.820,0:27:58.620 You can also use them to alias things that you mistype often, 0:27:58.620,0:28:01.160 so you can do "sl=ls", 0:28:01.160,0:28:02.540 that will work. 0:28:05.800,0:28:10.620 Other useful mappings are, 0:28:10.680,0:28:15.460 you might want to alias a command to itself 0:28:15.740,0:28:17.520 but with a default flag. 0:28:17.520,0:28:21.100 So here what is going on is I'm creating an alias 0:28:21.100,0:28:23.100 which is an alias for the move command, 0:28:23.300,0:28:29.780 which is MV and I'm aliasing it to the same command but adding the "-i" flag. 0:28:29.980,0:28:34.460 And this "-i" flag, if you go through the man page and look at it, it stands for "interactive". 0:28:34.780,0:28:39.880 And what it will do is it will prompt me before I do an overwrite. 0:28:39.880,0:28:44.420 So once I have executed this, I can do something like 0:28:44.700,0:28:47.360 I want to move "aliases" into "case". 0:28:47.700,0:28:53.140 By default "move" won't ask, and if "case" already exists, it will be over. 0:28:53.160,0:28:55.780 That's fine, I'm going to overwrite whatever that's there. 0:28:56.020,0:28:58.580 But here it's now expanded, 0:28:58.580,0:29:01.660 "move" has been expanded into this "move -i" 0:29:01.660,0:29:03.540 and it's using that to ask me 0:29:03.540,0:29:07.400 "Oh, are you sure you want to overwrite this?" 0:29:07.700,0:29:11.780 And I can say no, I don't want to lose that file. 0:29:12.180,0:29:15.820 Lastly, you can use "alias move" 0:29:15.820,0:29:18.520 to ask for what this alias stands for. 0:29:19.100,0:29:22.060 So it will tell you so you can quickly make sure 0:29:22.080,0:29:25.400 what the command that you are executing actually is. 0:29:27.040,0:29:31.400 One inconvenient part about, for example, having aliases is how will you go about 0:29:31.760,0:29:35.340 persisting them into your current environment? 0:29:35.500,0:29:38.120 Like, if I were to close this terminal now, 0:29:38.280,0:29:40.160 all these aliases will go away. 0:29:40.160,0:29:43.020 And you don't want to be kind of retyping these commands 0:29:43.020,0:29:46.760 and more generally, if you start configuring your shell more and more, 0:29:46.860,0:29:50.880 you want some way of bootstrapping all this configuration. 0:29:51.380,0:29:56.780 You will find that most shell command programs 0:29:56.880,0:30:01.440 will use some sort of text based configuration file. 0:30:01.440,0:30:06.740 And this is what we usually call "dotfiles", because they start with a dot for historical reasons. 0:30:07.060,0:30:13.160 So for bash in our case, which is a shell, 0:30:13.160,0:30:15.560 we can look at the bashrc. 0:30:16.180,0:30:19.840 For demonstration purposes, here I have been using ZSH, 0:30:19.900,0:30:24.460 which is a different shell, and I'm going to be configuring bash, and starting bash. 0:30:24.640,0:30:29.640 So if I create an entry here and I say 0:30:29.940,0:30:31.960 SL maps to LS 0:30:32.600,0:30:36.020 And I have modified that, and now I start bash. 0:30:36.540,0:30:40.660 Bash is kind of completely unconfigured, but now if I do SL... 0:30:41.360,0:30:44.040 Hm, that's unexpected. 0:30:46.280,0:30:48.000 Oh, good. Good getting that. 0:30:48.300,0:30:52.200 So it matters where you config file is, 0:30:52.200,0:30:55.260 your config file needs to be in your home folder. 0:30:55.640,0:31:00.940 So your configuration file for bash will live in that "~", 0:31:00.940,0:31:03.940 which will expand to your home directory, 0:31:03.940,0:31:05.560 and then bashrc. 0:31:06.160,0:31:08.840 And here we can create the alias 0:31:12.040,0:31:15.840 and now we start a bash session and we do SL. 0:31:15.840,0:31:21.500 Now it has been loaded, and this is loaded at the beginning when this 0:31:22.300,0:31:24.300 bash program is started. 0:31:24.700,0:31:31.200 All this configuration is loaded and you can, not only use aliases, they can have a lot of parts of configuration. 0:31:31.390,0:31:35.729 So for example here, I have a prompt which is fairly useless. 0:31:35.730,0:31:38.429 It has just given me the name of the shell, which is bash, 0:31:38.640,0:31:43.820 and the version, which is 5.0. I don't want this to be displayed and 0:31:44.360,0:31:48.540 as with many things in your shell, this is just an environment variable. 0:31:48.600,0:31:53.120 So the "PS1" is just the prompt string 0:31:53.710,0:31:55.480 for your prompt and 0:31:55.480,0:32:02.520 we can actually modify this to just be a "> " symbol. 0:32:02.520,0:32:08.280 and now that has been modified, and we have that. But if we exit and call bash again, 0:32:08.620,0:32:15.059 that was lost. However, if we add this entry and say, oh we want "PS1" 0:32:15.760,0:32:17.230 to be 0:32:17.230,0:32:19.179 this and 0:32:19.179,0:32:24.689 we call bash again, this has been persisted. And we can keep modifying this configuration. 0:32:25.090,0:32:27.209 So maybe we want to include 0:32:27.880,0:32:29.880 where the 0:32:30.370,0:32:32.939 working directory that we are is in, and 0:32:34.140,0:32:37.380 that's telling us the same information that we had in the other shell. 0:32:37.380,0:32:40.480 And there are many, many options, 0:32:40.780,0:32:45.060 shells are highly, highly configurable, and 0:32:45.700,0:32:49.920 it's not only cells that are configured through these files, 0:32:50.590,0:32:55.740 there are many other programs. As we saw for example in the editors lecture, Vim is also 0:32:55.840,0:33:02.900 configured this way. We gave you this vimrc file and told you to put it under your 0:33:03.460,0:33:06.380 home/.vimrc 0:33:06.380,0:33:11.800 and this is the same concept, but just for Vim. It's just giving it a set of 0:33:12.160,0:33:18.340 instructions that it should load when it's started, so you can keep a configuration that you want. 0:33:19.140,0:33:21.240 And even non... 0:33:21.580,0:33:27.140 kind of a lot of programs will support this. For instance, my terminal emulator, which is another concept, 0:33:27.260,0:33:30.159 which is the program that is 0:33:30.159,0:33:35.459 running the shell, in a way, and displaying this into the screen in my computer. 0:33:35.950,0:33:38.610 It can also be configured this way, so 0:33:39.940,0:33:43.620 if I modify this I can 0:33:46.510,0:33:53.279 change the size of the font. Like right now, for example, I have increased the font size a lot 0:33:53.279,0:33:55.768 for demonstration purposes, but 0:33:56.440,0:34:00.360 if I change this entry and make it for example 0:34:01.320,0:34:06.820 28 and write this value, you see that the size of the font has changed, 0:34:06.820,0:34:12.920 because I edited this text file that specifies how my terminal emulator should work. 0:34:19.480,0:34:20.900 Any questions so far? 0:34:20.900,0:34:22.280 With dotfiles. 0:34:28.040,0:34:35.940 Okay, it can be a bit daunting knowing that there is like this endless wall of configurations, 0:34:35.940,0:34:40.600 and how do you go about learning about what can be configured? 0:34:42.020,0:34:44.300 The good news is that 0:34:44.640,0:34:48.900 we have linked you to really good resources in the lecture notes. 0:34:48.960,0:34:56.440 But the main idea is that a lot of people really like just configuring these tools and have uploaded 0:34:56.640,0:35:01.140 their configuration files to GitHub, another different kind of repositories online. 0:35:01.140,0:35:03.300 So for example, here we are on GitHub, 0:35:03.300,0:35:06.640 we search for dotfiles, and can see that there are like 0:35:06.780,0:35:12.540 thousands of repositories of people sharing their configuration files. We have also... 0:35:12.540,0:35:15.460 Like, the class instructors have linked our dotfiles. 0:35:15.460,0:35:19.420 So if you really want to know how any part of our setup is working 0:35:19.420,0:35:22.220 you can go through it and try to figure it out. 0:35:22.220,0:35:24.220 You can also feel free to ask us. 0:35:24.380,0:35:27.060 If we go for example to this repository here 0:35:27.210,0:35:30.649 we can see that there's many, many files that you can configure. 0:35:30.650,0:35:37.520 For example, there is one for bash, the first couple of ones are for git, that will be probably be covered in the 0:35:38.610,0:35:40.819 version control lecture tomorrow. 0:35:41.400,0:35:48.500 If we go for example to the bash profile, which is a different form of what we saw in the bashrc, 0:35:49.400,0:35:52.900 it can be really useful because you can learn through 0:35:53.940,0:35:58.320 just looking at the manual page, but the manual pages is, a lot of the time 0:35:58.480,0:36:03.520 just kind of like a descriptive explanation of all the different options 0:36:03.520,0:36:04.880 and sometimes it's more helpful 0:36:04.880,0:36:09.600 going through examples of what people have done and trying to understand why they did it 0:36:09.600,0:36:12.200 and how it's helping their workflow. 0:36:12.960,0:36:17.300 We can say here that this person has done case-insensitive globbing. 0:36:17.320,0:36:21.220 We covered globbing as this kind of filename expansion 0:36:22.100,0:36:25.760 trick in the shell scripting and tools. 0:36:25.900,0:36:28.800 And here you say no, I don't want this to matter, 0:36:28.800,0:36:30.760 whether using uppercase and lowercase, 0:36:30.760,0:36:32.760 and just setting this option in the shell for these things to work this way 0:36:35.360,0:36:38.140 Similarly, there is for example aliases. 0:36:38.140,0:36:42.220 Here you can see a lot of aliases that this person is doing. For example, "d" for 0:36:44.200,0:36:47.400 "d" for "Dropbox", sorry, because that's just much shorter. 0:36:47.400,0:36:49.200 "g" for "git"... 0:36:49.740,0:36:54.560 Say we go, for example, with vimrc. It can be actually very, very informative, 0:36:54.560,0:36:58.860 going through this and trying to extract useful information. 0:36:59.000,0:37:06.420 We do not recommend just kind of getting one huge blob of this and copying this into your config files, 0:37:07.110,0:37:12.439 because maybe things are prettier, but you might not really understand what is going on. 0:37:15.150,0:37:19.579 Lastly one thing I want to mention about dotfiles is that 0:37:20.460,0:37:23.390 people not only try to push these 0:37:24.660,0:37:28.849 files into GitHub just so other people can read it, that's 0:37:29.400,0:37:33.319 one reason. They also make really sure they can 0:37:34.140,0:37:39.440 reproduce their setup. And to do that they use a slew of different tools. 0:37:39.440,0:37:41.280 Oops, went a little too far. 0:37:41.280,0:37:44.840 So GNU Stow is, for example, one of them 0:37:45.720,0:37:49.060 and the trick that they are doing is 0:37:50.280,0:37:54.520 they are kind of putting all their dotfiles in a folder and they are 0:37:55.200,0:37:59.520 faking to the system, using a tool called symlinks, 0:37:59.520,0:38:02.440 that they are actually what they're not. I'm gonna 0:38:03.150,0:38:05.150 draw really quick what I mean by that. 0:38:05.790,0:38:10.939 So a common folder structure might look like you have your home folder and 0:38:11.670,0:38:14.300 in this home folder you might have your 0:38:16.050,0:38:21.380 bashrc, that contains your bash configuration, you might have your vimrc and 0:38:22.500,0:38:25.760 it would be really great if you could keep this under version control. 0:38:26.580,0:38:29.300 But the thing is, you might not want to have a git repository, 0:38:29.300,0:38:31.300 which will be covered tomorrow, 0:38:31.300,0:38:32.300 in your home folder. 0:38:32.300,0:38:37.360 So what people usually do is they create a dotfiles repository, 0:38:38.280,0:38:42.160 and then they have entries here for their 0:38:43.050,0:38:47.239 bashrc and their vimrc. And this is where actually 0:38:47.820,0:38:49.820 the files are 0:38:50.100,0:38:52.400 and what they are doing is they're just 0:38:53.460,0:38:56.510 telling the OS to forward, whenever anyone 0:38:56.760,0:39:01.849 wants to read this file or write to this file, just forward this to this other file. 0:39:03.000,0:39:05.719 This is a concept called symlinks 0:39:06.690,0:39:08.630 and it's useful in this scenario, 0:39:08.630,0:39:12.600 but it in general it's a really useful tool in UNIX 0:39:12.700,0:39:14.700 that we haven't covered so far in the lectures 0:39:14.960,0:39:16.740 but you might be... 0:39:16.740,0:39:18.740 that you should be familiar with. 0:39:19.100,0:39:22.840 And in general, the syntax will be "ln -s" 0:39:22.840,0:39:29.980 for specifying a symbolic link and then you will put the path to the file 0:39:30.570,0:39:33.049 that you want to create and then the 0:39:33.780,0:39:35.780 symlink that you want to create. 0:39:39.390,0:39:41.390 And 0:39:41.880,0:39:45.619 All these all these kind of fancy tools that we're seeing here listed, 0:39:45.810,0:39:52.159 they all amount to doing some sort of this trick, so that you can have all your dotfiles neat and tidy 0:39:52.680,0:39:57.829 into a folder, and then they can be version-controlled, and they can be 0:39:58.349,0:40:02.689 symlinked so the rest of the programs can find them in their default locations. 0:40:06.720,0:40:09.020 Any questions regarding dotfiles? 0:40:13.200,0:40:20.200 Do you need to have the dotfiles in your home folder, and then also dotfiles in the version control folder? 0:40:20.780,0:40:24.640 So what you will have is, pretty much every program, 0:40:24.640,0:40:26.180 for example bash, 0:40:26.180,0:40:29.560 will always look for "home/.bashrc". 0:40:29.560,0:40:33.480 That's where the program is going to look for. 0:40:33.820,0:40:40.200 What you do when you do a symlink is, you place your "home/.bashrc" 0:40:40.200,0:40:44.900 it's just a file that is kind of a special file in UNIX, 0:40:45.150,0:40:49.609 that says oh, whenever you want to read this file go to this other file. 0:40:51.500,0:40:53.440 There's no content, like there is no... 0:40:53.600,0:40:58.099 your aliases are not part of this dotfile. That file is just kind of like a pointer, saying now you should 0:40:58.100,0:40:59.400 go that other way. 0:40:59.400,0:41:02.600 And by doing that you can have your other file 0:41:02.600,0:41:04.400 in that other folder. 0:41:04.560,0:41:06.360 If version controlling is not useful, think about 0:41:06.360,0:41:10.740 what if you want to have them in your Dropbox folder, so they're synced to the cloud, 0:41:10.759,0:41:15.019 for example. That's kind of another use case where like symlinks could be really useful 0:41:16.240,0:41:21.040 So you don't need the folder dotfiles to be in the home directory, right? 0:41:21.040,0:41:23.820 Because you can just use the symlink, that points somewhere else. 0:41:23.960,0:41:29.760 As long as you have a way for the default path to resolve wherever you have it, yeah. 0:41:35.100,0:41:38.000 Last thing I want to cover in the lecture... 0:41:38.000,0:41:40.380 Oh, sorry, any other questions about dotfiles? 0:41:49.200,0:41:52.580 Last thing I want to cover in the lecture is working with remote machines, 0:41:52.580,0:41:55.549 which is a thing that you will run into, 0:41:55.559,0:41:56.900 sooner or later. 0:41:56.900,0:42:02.238 And there are a few things that will make your life much easier when dealing with remote machines 0:42:03.180,0:42:05.180 if you know about them. 0:42:05.220,0:42:08.380 Right now maybe because you are using the Athena cluster, 0:42:08.380,0:42:10.740 but later on, during your programming career, 0:42:10.740,0:42:11.960 it's pretty sure that 0:42:11.960,0:42:15.400 there is a fairly ubiquitous concept of having your 0:42:15.400,0:42:20.380 local working environment and then having some production server that is actually running the 0:42:20.970,0:42:23.239 code, so it is really good to get familiar 0:42:24.480,0:42:26.749 about how to work in/with remote machines. 0:42:27.420,0:42:35.180 So the main command for working with remote machines is SSH. 0:42:37.760,0:42:43.900 SSH is just like a secure shell, it's just gonna take the responsibility for 0:42:43.900,0:42:46.540 reaching wherever we want or tell it to go 0:42:47.560,0:42:50.700 and trying to open a session there. 0:42:50.700,0:42:52.400 So here the syntax is: 0:42:53.130,0:42:56.660 "JJGO" is the user that I want to use in the remote machine, 0:42:56.660,0:42:58.430 and this is because the user is 0:42:58.529,0:43:03.460 different from the one I have my local machine, which will be the case a lot of the time, 0:43:03.460,0:43:07.400 then the "@" is telling the terminal that this separates 0:43:07.400,0:43:12.540 what the user is from what the address is. 0:43:12.540,0:43:16.540 And here I'm using an IP address because what I'm actually doing is 0:43:16.540,0:43:20.500 I have a virtual machine in my computer, 0:43:20.500,0:43:23.240 that is the one that is remote right now. 0:43:23.240,0:43:26.400 And I'm gonna be SSH'ing into it. This is the 0:43:26.580,0:43:27.880 URL that I'm using, 0:43:27.880,0:43:29.860 sorry, the IP that I'm using, 0:43:29.860,0:43:32.280 but you might also see things like 0:43:32.360,0:43:36.820 oh I want to SSH as "JJGO" 0:43:36.820,0:43:39.840 at "foobar.mit.edu" 0:43:39.840,0:43:42.960 That's probably something more common, if you are using some 0:43:42.960,0:43:47.260 remote server that has a DNS name. 0:43:48.180,0:43:51.860 So going back to a regular command, 0:43:53.220,0:43:56.580 we try to SSH, it asks us for a password, 0:43:56.580,0:43:58.180 really common thing. 0:43:58.190,0:43:59.480 And now we're there. We have... 0:43:59.480,0:44:02.629 we're still in our same terminal emulator 0:44:02.630,0:44:09.529 but right now SSH is kind of forwarding the entire virtual display to display what the 0:44:09.869,0:44:14.358 remote shell is displaying. And we can execute commands here and 0:44:15.630,0:44:17.630 we'll see the remote files 0:44:18.390,0:44:22.819 A couple of handy things to know about SSH, that were briefly covered in the 0:44:23.220,0:44:27.080 data wrangling lecture, is that SSH is not only good for just 0:44:28.280,0:44:33.760 opening connections. It will also let you just execute commands remotely. 0:44:33.770,0:44:36.979 So for example, if I do that, it's gonna ask me 0:44:37.710,0:44:39.020 what is my password?, again. 0:44:39.020,0:44:41.059 And it's executing this command 0:44:41.279,0:44:43.420 then coming back to my terminal 0:44:43.420,0:44:47.420 and piping the output of what that command was, in the remote machine, 0:44:47.420,0:44:50.480 through the standard output in my current cell. 0:44:50.480,0:44:53.940 And I could have this in... 0:44:58.100,0:45:00.480 I could have this in a pipe, and 0:45:00.980,0:45:03.580 this will work and we'll just 0:45:03.600,0:45:06.100 drop all this output and then have a local pipe 0:45:06.100,0:45:07.879 where I can keep working. 0:45:08.640,0:45:12.140 So far, it has been kind of inconvenient, having to type our password. 0:45:12.630,0:45:14.820 There's one really good trick for this. 0:45:14.820,0:45:16.880 It's we can use something called "SSH keys". 0:45:17.140,0:45:20.660 SSH keys just use public key encryption 0:45:20.660,0:45:24.980 to create a pair of SSH keys, a public key and a private key, and then 0:45:25.170,0:45:29.320 you can give the server the public part of the key. 0:45:29.320,0:45:32.810 So you copy the public key and then whenever you try to 0:45:33.390,0:45:37.129 authenticate instead of using your password, it's gonna use the private key to 0:45:37.820,0:45:40.800 prove to the server that you are actually who you say you are. 0:45:43.860,0:45:48.020 We can quickly showcase how you will go 0:45:48.020,0:45:49.400 about doing this. 0:45:49.400,0:45:53.180 Right now I don't have any SSH keys, so I'm gonna create a couple of them. 0:45:53.940,0:45:58.250 First thing, it's just gonna ask me where I want this key to live. 0:45:58.980,0:46:00.640 Unsurprisingly, it's doing this. 0:46:00.640,0:46:04.820 This is my home folder and then it's using this ".ssh" path, 0:46:05.460,0:46:08.750 which refers back to the same concept that we covered earlier about having 0:46:08.850,0:46:12.439 dotfiles. Like ".ssh" is a folder that contains a lot of the 0:46:13.320,0:46:16.540 configuration files for how you want SSH to behave. 0:46:17.060,0:46:19.420 So it will ask us a passphrase. 0:46:19.680,0:46:23.120 The passphrase is to encrypt the private part of the key 0:46:23.120,0:46:27.160 because if someone gets your private key, if you don't have a password protected 0:46:27.920,0:46:29.580 private key, if they get that key 0:46:29.580,0:46:32.240 they can use that key to impersonate you in any server. 0:46:32.310,0:46:34.360 Whereas if you add a passphrase, 0:46:34.360,0:46:37.640 they will have to know what the passphrase is to actually use the key. 0:46:40.800,0:46:51.740 It has created a keeper. We can check that these two files are now under ssh. 0:46:51.740,0:46:53.920 And we can see... 0:46:57.720,0:47:02.960 We have these two files: we have the 25519 and the public key. 0:47:03.320,0:47:06.300 And if we "cat" through the output, 0:47:06.300,0:47:09.760 that key is actually not like any fancy binary file, it's 0:47:15.430,0:47:20.760 just a text file that has the contents of the public key and some 0:47:23.050,0:47:26.729 alias name for it, so we can know what this public key is. 0:47:26.950,0:47:32.220 The way we can tell the server that we're authorized to SSH there 0:47:32.260,0:47:38.400 is by just actually copying this file, like copying this string into a file, 0:47:38.400,0:47:41.540 that is ".ssh/authorized_keys". 0:47:42.100,0:47:46.160 So here what I'm doing is I'm 0:47:46.960,0:47:49.770 catting the output of this file 0:47:49.800,0:47:53.920 which is just this line of text that we want to copy 0:47:53.920,0:47:57.440 and I'm piping that into SSH and then remotely 0:47:57.960,0:48:02.080 I'm asking "tee" to dump the contents of the standard input 0:48:02.080,0:48:05.220 into ".ssh/authorized_keys". 0:48:05.440,0:48:10.360 And if we do that, obviously it's gonna ask us for a password. 0:48:14.800,0:48:18.740 It was copied, and now we can check that if we try 0:48:19.690,0:48:21.690 to SSH again, 0:48:21.960,0:48:24.840 It's going to first ask us for a passphrase 0:48:24.840,0:48:29.100 but you can arrange that so that it's saved in the session 0:48:29.460,0:48:34.840 and we didn't actually have to type the key for the server. 0:48:34.840,0:48:36.840 And I can kind of show that again. 0:48:45.820,0:48:47.540 More things that are useful. 0:48:47.540,0:48:49.040 Oh, we can do... 0:48:49.220,0:48:51.880 If that command seemed a little bit janky, 0:48:51.980,0:48:55.000 you can actually use this command that is built for this, 0:48:55.000,0:49:00.640 so you don't have to kind of craft this "ssh t" command. 0:49:00.640,0:49:03.800 That is just called "ssh-copy-id". 0:49:05.000,0:49:08.080 And we can do the same 0:49:08.080,0:49:09.660 and it's gonna copy the key. 0:49:09.660,0:49:14.280 And now, if we try to SSH, 0:49:14.500,0:49:18.320 we can SSH without actually typing any key at all, 0:49:18.860,0:49:20.320 or any password. 0:49:20.660,0:49:21.520 More things. 0:49:21.520,0:49:23.520 We will probably want to copy files. 0:49:23.740,0:49:25.310 You cannot use "CP" 0:49:25.310,0:49:29.720 but you can use "SCP", for "SSH copy". 0:49:29.720,0:49:34.500 And here we can specify that we want to copy this local file called notes 0:49:34.500,0:49:36.880 and the syntax is kind of similar. 0:49:36.880,0:49:39.760 We want to copy to this remote and 0:49:39.920,0:49:44.020 then we have a semicolon to separate what the path is going to be. 0:49:44.020,0:49:45.040 And then we have 0:49:45.040,0:49:46.620 oh, we want to copy this as notes 0:49:46.620,0:49:51.000 but we could also copy this as foobar. 0:49:51.740,0:49:55.600 And if we do that, it has been executed 0:49:55.780,0:49:59.280 and it's telling us that all the contents have been copied there. 0:49:59.540,0:50:02.200 If you're gonna be copying a lot of files, 0:50:02.200,0:50:05.100 there is a better command that you should be using 0:50:05.100,0:50:07.740 that is called "RSYNC". For example, here 0:50:07.900,0:50:10.780 just by specifying these three flags, 0:50:10.820,0:50:15.960 I'm telling RSYNC to kind of preserve all the permissions whenever possible 0:50:16.240,0:50:19.740 to try to check if the file has already been copied. 0:50:19.740,0:50:24.100 For example, SCP will try to copy files that are already there. 0:50:24.200,0:50:26.440 This will happen for example if you are trying to copy 0:50:26.440,0:50:29.060 and the connection interrupts in the middle of it. 0:50:29.120,0:50:32.060 SCP will start from the very beginning, trying to copy every file, 0:50:32.080,0:50:36.600 whereas RSYNC will continue from where it stopped. 0:50:37.240,0:50:38.440 And here, 0:50:39.060,0:50:42.760 we ask it to copy the entire folder and 0:50:43.780,0:50:46.560 it's just really quickly copied the entire folder. 0:50:48.080,0:50:54.100 One of the other things to know about SSH is that 0:50:54.320,0:50:59.860 the equivalent of the dot file for SSH is the "SSH config". 0:50:59.860,0:51:06.340 So if we edit the SSH config to be 0:51:13.120,0:51:17.940 If I edit the SSH config to look something like this, 0:51:17.940,0:51:22.900 instead of having to, every time, type "ssh jjgo", 0:51:23.040,0:51:27.760 having this really long string so I can like refer to this specific remote, 0:51:27.760,0:51:30.140 I want to refer, with the specific user name, 0:51:30.140,0:51:32.760 I can have something here that says 0:51:33.160,0:51:35.680 this is the username, this is the host name, that this 0:51:36.860,0:51:40.540 host is referring to and you should use this identity file. 0:51:41.460,0:51:43.960 And if I copy this, 0:51:43.960,0:51:46.100 this is right now in my local folder, 0:51:46.100,0:51:49.000 I can copy this into ssh. 0:51:49.600,0:51:53.520 Now, instead of having to do this really long command, I can just say 0:51:53.520,0:51:57.100 I just want to SSH into the host called VM. 0:51:58.260,0:52:03.220 And by doing that, it's grabbing all that configuration from the SSH config 0:52:03.220,0:52:05.220 and applying it here. 0:52:05.240,0:52:10.060 This solution is much better than something like creating an alias for SSH, 0:52:10.360,0:52:13.360 because other programs like SCP and RSYNC 0:52:13.360,0:52:19.440 also know about the dotfiles for SSH and will use them whenever they are there. 0:52:22.820,0:52:30.400 Last thing I want to cover about remote machines is that here, for example, we'll have tmux and we can, 0:52:31.760,0:52:35.780 like I was saying before, we can start editing some file 0:52:39.160,0:52:44.500 and we can start running some job. 0:52:54.200,0:52:56.180 For example, something like HTOP. 0:52:56.180,0:52:58.720 And this is running here, we can 0:52:59.320,0:53:01.320 detach from it, 0:53:01.430,0:53:03.430 close the connection and 0:53:03.740,0:53:07.780 then SSH back. And then, if you do "tmux a", 0:53:07.780,0:53:11.340 everything is as you left it, like nothing has really changed. 0:53:11.340,0:53:15.220 And if you have things executing there in the background, they will keep executing. 0:53:17.500,0:53:23.300 I think that, pretty much, ends all I have to say for this tool. 0:53:23.300,0:53:26.420 Any questions related to remote machines? 0:53:32.860,0:53:36.780 That's a really good question. So what I do for that, 0:53:38.700,0:53:39.460 Oh, yes, sorry. 0:53:39.460,0:53:44.880 So the question is, how do you deal with trying to use tmux in your local machine, 0:53:44.880,0:53:47.640 and also trying to use tmux in the remote machine? 0:53:48.400,0:53:50.760 There are a couple of tricks for dealing with that. 0:53:50.760,0:53:53.220 The first one is changing the prefix. 0:53:53.360,0:53:55.340 So what I do, for example, is 0:53:55.340,0:54:00.020 in my local machine the prefix I have changed from "Ctrl+B" to "Ctrl+A" and 0:54:00.220,0:54:02.580 then in remove machines this is still "Ctrl+B". 0:54:02.800,0:54:05.580 So I can kind of swap between, 0:54:05.580,0:54:09.840 if I want to do things to the local tmux I will do "Ctrl+A" 0:54:09.840,0:54:13.460 and if I want to do things to the remote tmux I would do "Ctrl+B". 0:54:15.080,0:54:19.900 Another thing is that you can have separate configs, 0:54:20.080,0:54:24.100 so I can do something like this, and then... 0:54:27.260,0:54:31.040 Ah, because I don't have my own ssh config, yeah. 0:54:32.240,0:54:33.000 But if you... 0:54:33.000,0:54:34.420 Um, I can SSH "VM". 0:54:36.820,0:54:38.900 Here, what you see, 0:54:38.900,0:54:41.000 the difference between these two bars, for example, 0:54:41.000,0:54:43.680 is because the tmux config is different. 0:54:44.380,0:54:48.500 As you will see in the exercises, the tmux configuration is in 0:54:50.320,0:54:53.780 the tmux.conf 0:54:56.720,0:54:58.140 And in tmux.conf, 0:54:58.140,0:55:02.020 here you can do a lot of things like changing the color depending on the host you are 0:55:02.210,0:55:06.879 so you can get like quick visual feedback about where you are, or 0:55:06.880,0:55:10.240 if you have a nested session. Also, tmux will, 0:55:10.520,0:55:15.280 if you're in the same host and you try to tmux within a tmux session, 0:55:15.290,0:55:18.759 it will kind of prevent you from doing it so you don't run into issues. 0:55:21.700,0:55:25.400 Any other questions related, to kind of all the topics we have covered. 0:55:29.100,0:55:32.720 Another answer to that question is also, if you type the prefix twice, 0:55:32.880,0:55:35.760 it sends it once to the underlying shell. 0:55:35.920,0:55:40.100 So the local binding is "Ctrl+A" and the remote binding is "Ctrl+A", 0:55:40.100,0:55:45.260 You could type "Ctrl+A", "Ctrl+A" and then "D", for example, detaches from the remote, basically. 0:55:52.480,0:55:59.660 I think that ends the class for today, there's a bunch of exercises related to all these main topics and 0:56:00.380,0:56:05.410 we're gonna be holding office hours today, too. So feel free to come and ask us any questions. ================================================ FILE: static/files/subtitles/2020/debugging-profiling.sbv ================================================ 0:00:00.000,0:00:04.200 So welcome back. Today we are gonna cover debugging and profiling. 0:00:04.720,0:00:09.340 Before I get into it we're gonna make another reminder to fill in the survey. 0:00:09.520,0:00:14.580 Just one of the main things we want to get from you is questions, because the last day 0:00:14.820,0:00:18.080 is gonna be questions from you guys: about things that 0:00:18.080,0:00:22.020 we haven't covered, or like you want us to kind of talk more in depth. 0:00:23.350,0:00:26.969 The more questions we get, the more interesting we can make that section, 0:00:26.970,0:00:28.900 so please go on and fill in the survey. 0:00:28.900,0:00:35.660 So today's lecture is gonna be a lot of topics. All the topics revolve around the concept of 0:00:35.820,0:00:39.920 what do you do when you have a program that has some bugs. 0:00:39.920,0:00:42.520 Which is most of the time, like when you are programming, you're kind of thinking 0:00:42.720,0:00:47.400 about how you implement something and there's like a half life of fixing all the issues that 0:00:47.620,0:00:52.140 that program has. And even if your program behaves like you want, it might be that it's 0:00:52.390,0:00:55.680 really slow, or it's taking a lot of resources in the process. 0:00:55.680,0:01:00.569 So today we're gonna see a lot of different approaches of dealing with these problems. 0:01:01.300,0:01:05.099 So first, the first section is on debugging. 0:01:06.159,0:01:08.279 Debugging can be done in many different ways, 0:01:08.380,0:01:10.119 there are all kinds of... 0:01:10.120,0:01:13.640 The most simple approach that, pretty much, all 0:01:13.640,0:01:17.140 CS students will go through, will be just: you have some code, and it's not behaving 0:01:17.160,0:01:20.280 like you want, so you probe the code by adding 0:01:20.280,0:01:23.420 print statements. This is called "printf debugging" and 0:01:23.440,0:01:24.450 it works pretty well. 0:01:24.450,0:01:26.680 Like, I have to be honest, 0:01:26.820,0:01:33.120 I use it a lot of the time because of how simple to set up and how quick the feedback can be. 0:01:34.360,0:01:39.320 One of the issues with printf debugging is that you can get a lot of output 0:01:39.320,0:01:40.740 and maybe you don't want 0:01:40.800,0:01:43.240 to get as much output as you're getting. 0:01:43.780,0:01:49.349 There has... people have thought of slightly more complex ways of doing printf debugging and 0:01:53.920,0:01:58.320 one of these ways is what is usually referred to as "logging". 0:01:58.420,0:02:04.530 So the advantage of doing logging versus doing printf debugging is that, when you're creating logs, 0:02:05.080,0:02:09.780 you're not necessarily creating the logs because there's a specific issue you want to fix; 0:02:09.780,0:02:12.460 it's mostly because you have built a 0:02:12.480,0:02:16.840 more complex software system and you want to log when some events happen. 0:02:17.360,0:02:21.560 One of the core advantages of using a logging library is that 0:02:22.180,0:02:27.040 you can can define severity levels, and you can filter based on those. 0:02:27.400,0:02:31.620 Let's see an example of how we can do something like that. 0:02:32.320,0:02:35.840 Yeah, everything fits here. This is a really silly example: 0:02:36.340,0:02:37.520 We're just gonna 0:02:37.520,0:02:40.980 sample random numbers and, depending on the value of the number, 0:02:41.120,0:02:44.720 that we can interpret as a kind of "how wrong things are going". 0:02:44.740,0:02:48.760 We're going to log the value of the number and then 0:02:49.340,0:02:51.640 we can see what is going on. 0:02:52.580,0:02:59.280 I need to disable these formatters... 0:02:59.620,0:03:03.720 And if we were just to execute the code as it is, 0:03:04.160,0:03:07.420 we just get the output and we just keep getting more and more output. 0:03:07.420,0:03:13.599 But you have to kind of stare at it and make sense of what is going on, and we don't know 0:03:13.600,0:03:19.629 what is the relative timing between printfs, we don't really know whether this is just an information message 0:03:19.630,0:03:22.960 or a message of whether something went wrong. 0:03:23.810,0:03:25.810 If we just go in, 0:03:27.320,0:03:29.780 and undo, not that one... 0:03:34.220,0:03:37.140 That one, we can set that formatter. 0:03:38.620,0:03:41.600 Now the output looks something more like this 0:03:41.620,0:03:44.840 So for example, if you have several different modules that you are programming with, 0:03:44.840,0:03:46.940 you can identify them with like different levels. 0:03:46.940,0:03:49.800 Here, we have, we have debug levels, 0:03:50.330,0:03:51.890 we have critical 0:03:51.890,0:03:57.540 info, different levels. And it might be handy because here we might only care about the error messages. 0:03:57.740,0:04:00.640 Like those are like, the... We have been 0:04:00.700,0:04:03.960 working on our code, so far so good, and suddenly we get some error. 0:04:03.960,0:04:06.540 We can log that to identify where it's happening. 0:04:06.580,0:04:11.640 But maybe there's a lot of information messages, but we can deal with that 0:04:12.709,0:04:16.809 by just changing the level to error level. 0:04:17.400,0:04:17.900 And 0:04:18.890,0:04:22.960 now if we were to run this again, we are only going to get those 0:04:23.620,0:04:28.160 errors in the output, and we can just look through those to make sense of what is going on. 0:04:28.920,0:04:33.320 Another really useful tool when you're dealing with logs is 0:04:34.130,0:04:36.670 As you kind of look at this, 0:04:36.670,0:04:42.580 it has become easier because now we have this critical and error levels that we can quickly identify. 0:04:43.310,0:04:46.750 But since humans are fairly visual creatures, 0:04:48.680,0:04:53.109 one thing that you can do is use colors from your terminal to 0:04:53.630,0:04:57.369 identify these things. So now, changing the formatter, 0:04:57.369,0:05:03.320 what I've done is slightly change how the output is formatted. 0:05:03.580,0:05:09.340 When I do that, now whenever I get a warning message, it's color coded by yellow; 0:05:09.340,0:05:10.880 whenever I get like an error, 0:05:10.960,0:05:16.140 faded red; and when it's critical, I have a bold red indicating something went wrong. 0:05:16.280,0:05:22.620 And here it's a really short output, but when you start having thousands and thousands of lines of log, 0:05:22.620,0:05:26.380 which is not unrealistic and happens every single day in a lot of apps, 0:05:27.140,0:05:32.500 quickly browsing through them and identifying where the error or the red patches are 0:05:32.600,0:05:35.320 can be really useful. 0:05:35.600,0:05:41.400 A quick aside is, you might be curious about how the terminal is displaying these colors. 0:05:41.580,0:05:45.320 At the end of the day, the terminal is only outputting characters. 0:05:47.160,0:05:49.480 Like, how is this program or how are other programs, like LS, 0:05:50.060,0:05:56.050 that has all these fancy colors. How are they telling the terminal that it should use these different colors? 0:05:56.360,0:05:58.779 This is nothing extremely fancy, 0:05:59.440,0:06:03.440 what these tools are doing, is something along these lines. 0:06:03.740,0:06:04.540 Here we have... 0:06:05.420,0:06:08.340 I can clear the rest of the output, so we can focus on this. 0:06:08.660,0:06:14.000 There's some special characters, some escape characters here, 0:06:14.260,0:06:19.740 then we have some text and then we have some other special characters. And if we execute this line 0:06:19.940,0:06:22.360 we get a red "This is red". 0:06:22.480,0:06:26.640 And you might have picked up on the fact that we have a "255;0;0" here, 0:06:26.720,0:06:31.400 this is just telling the RGB values of the color we want in the terminal. 0:06:31.400,0:06:38.100 And you pretty much can do this in any piece of code that you have, and like that you can color code the output. 0:06:38.100,0:06:42.540 Your terminal is fairly fancy and supports a lot of different colors in the output. 0:06:42.550,0:06:45.400 This is not even all of them, this is like a sixteenth of them. 0:06:46.100,0:06:49.119 I think it can be fairly useful to know about that. 0:06:52.100,0:06:55.960 Another thing is maybe you don't enjoy or you don't think 0:06:56.200,0:06:58.620 logs are really fit for you. 0:06:58.620,0:07:02.480 The thing is a lot of other systems that you might start using will use logs. 0:07:02.840,0:07:05.360 As you start building larger and larger systems, 0:07:05.360,0:07:10.140 you might rely on other dependencies. Common dependencies might be web servers or 0:07:10.220,0:07:12.320 databases, it's a really common one. 0:07:12.440,0:07:17.740 And those will be logging their errors or exceptions in their own logs. 0:07:17.740,0:07:20.540 Of course, you will get some client-side error, 0:07:20.620,0:07:25.140 but those sometimes are not informative enough for you to figure out what is going on. 0:07:25.900,0:07:33.940 In most UNIX systems, the logs are usually placed under a folder called "/var/log" 0:07:33.940,0:07:37.980 and if we list it, we can see there's a bunch of logs in here. 0:07:42.680,0:07:48.040 So we have like the shutdown monitor log, or some weekly logs. 0:07:49.669,0:07:56.199 Things related to the Wi-Fi, for example. And if we output the 0:07:57.560,0:08:00.840 System log, which contains a lot of information about the system, 0:08:00.840,0:08:03.940 we can get information about what's going on. 0:08:04.120,0:08:06.780 Similarly, there are tools that will let you 0:08:07.460,0:08:13.090 more sanely go through this output. But here, looking at the system log, 0:08:13.090,0:08:15.520 I can look at this, and say: 0:08:15.760,0:08:20.040 oh there's some service that is exiting with some abnormal code 0:08:20.420,0:08:25.460 and based on that information, I can go and try to figure out what's going on, 0:08:25.510,0:08:27.500 like what's going wrong. 0:08:29.020,0:08:32.000 One thing to know when you're working with logs is that 0:08:32.000,0:08:35.900 more traditionally, every software had their own 0:08:35.920,0:08:42.540 log, but it has been increasingly more popular to have a unified system log where everything is placed. 0:08:43.010,0:08:49.299 Pretty much any application can log into the system log, but instead of being in a plain text format, 0:08:49.300,0:08:52.380 it will be compressed in some special format. 0:08:52.380,0:08:56.460 An example of this, it was what we covered in the data wrangling lecture. 0:08:56.520,0:08:59.900 In the data wrangling lecture we were using the "journalctl", 0:09:00.200,0:09:04.280 which is accessing the log and outputting all that output. 0:09:04.340,0:09:07.380 Here in Mac, now the command is "log show", 0:09:07.380,0:09:10.020 which will display a lot of information. 0:09:10.100,0:09:15.760 I'm gonna just display the last ten seconds, because logs are really, really verbose and 0:09:17.060,0:09:23.720 just displaying the last 10 seconds is still gonna output a fairly large amount of lines. 0:09:23.900,0:09:28.240 So if we go back through what's going on, 0:09:28.240,0:09:33.460 we here see that a lot of Apple things are going on, since this is a macbook. 0:09:33.500,0:09:38.460 Maybe we could find errors about like some system issue here. 0:09:39.280,0:09:46.920 Again they're fairly verbose, so you might want to practice your data wrangling techniques here, 0:09:46.920,0:09:50.440 like 10 seconds equal to like 500 lines of logs, so you can kind of 0:09:50.960,0:09:54.960 get an idea of how many lines per second you're getting. 0:09:56.360,0:10:01.060 They're not only useful for figuring out some other programs' output, 0:10:01.060,0:10:05.619 they're also useful for you, if you want to log there instead of into your own file. 0:10:05.779,0:10:11.319 So using the "logger" command, in both linux and mac, 0:10:11.839,0:10:13.480 You can say okay 0:10:13.480,0:10:18.880 I'm gonna log this "Hello Logs" into this system log. 0:10:18.880,0:10:21.939 We execute the command and then 0:10:22.760,0:10:27.640 we can check by going through the last minute of logs, 0:10:27.640,0:10:31.760 since it's gonna be fairly recent, and grepping for that "Hello" 0:10:31.760,0:10:38.260 we find our entry. Fairly recent entry, that we just created that said "Hello Logs". 0:10:39.220,0:10:46.840 As you become more and more familiar with these tools, you will find yourself using 0:10:48.800,0:10:51.279 the logs more and more often, since 0:10:51.529,0:10:56.349 even if you have some bug that you haven't detected, and the program has been running for a while, 0:10:56.349,0:11:02.240 maybe the information is already in the log and can tell you enough to figure out what is going on. 0:11:02.800,0:11:08.260 However, printf debugging is not everything. So now I'm going to be covering debuggers. 0:11:08.260,0:11:10.380 But first any questions on logs so far? 0:11:11.720,0:11:15.040 So what kind of things can you figure out from the logs? 0:11:15.040,0:11:18.800 like this Hello Logs says that you did something with Hello at that time? 0:11:18.940,0:11:25.040 Yeah, like say, for example, I can write a bash script that detects... 0:11:25.060,0:11:29.480 Well, that checks every time what Wi-Fi network I'm connected to. 0:11:29.480,0:11:34.150 And every time it detects that it has changed, it makes an entry in the logs and says 0:11:34.150,0:11:37.440 Oh now it looks like we have changed Wi-Fi networks. 0:11:37.440,0:11:41.400 and then you might go back and parse through the logs and take like, okay 0:11:41.510,0:11:47.559 When did my computer change from one Wi-Fi network to another. And this is just kind of a simple example 0:11:47.560,0:11:50.260 But there are many, many ways, 0:11:50.660,0:11:54.020 many types of information that you could be logging here. 0:11:54.020,0:11:59.040 More commonly, you will probably want to check if your computer, for example, is 0:11:59.100,0:12:02.540 entering sleep, for example, for some unknown reason. 0:12:02.680,0:12:04.660 Like it's on hibernation mode. 0:12:04.820,0:12:09.100 There's probably some information in the logs about who asked that to happen, 0:12:09.100,0:12:10.240 or why it's that happening. 0:12:11.720,0:12:14.880 Any other questions? Okay. 0:12:14.880,0:12:17.380 So when printf debugging is not enough, 0:12:18.320,0:12:22.360 the best alternative after that is using... 0:12:23.360,0:12:25.360 [Exit that] 0:12:28.480,0:12:30.260 So, it's using a debugger. 0:12:30.580,0:12:37.620 So a debugger is a tool that will wrap around your code and will let you run your code, 0:12:38.120,0:12:40.480 but it will kind of keep control over it. 0:12:40.480,0:12:42.500 So it will let you step 0:12:42.500,0:12:47.080 through the code and execute it and set breakpoints. 0:12:47.080,0:12:50.020 You probably have seen debuggers in some way, if you have 0:12:50.020,0:12:55.800 ever used something like an IDE, because IDEs have this kind of fancy: set a breakpoint here, execute, ... 0:12:56.080,0:12:59.040 But at the end of the day what these tools are using is just 0:12:59.040,0:13:04.740 these command line debuggers and they're just presenting them in a really fancy format. 0:13:04.850,0:13:09.969 Here we have a completely broken bubble sort, a simple sorting algorithm. 0:13:10.000,0:13:11.560 Don't worry about the details, 0:13:11.560,0:13:14.980 but we just want to sort this array that we have here. 0:13:17.360,0:13:19.460 We can try doing that by just doing 0:13:21.340,0:13:23.340 Python bubble.py 0:13:23.500,0:13:28.360 And when we do that... Oh there's some index error, list index out of range. 0:13:28.480,0:13:31.200 We could start adding prints 0:13:31.200,0:13:33.740 but if have a really long string, we can get a lot of information. 0:13:33.820,0:13:37.820 So how about we go up to the moment that we crashed? 0:13:37.900,0:13:41.020 We can go to that moment and examine what the 0:13:41.020,0:13:43.360 current state of the program was. 0:13:43.520,0:13:49.080 So for doing that I'm gonna run the program using the Python debugger. 0:13:49.080,0:13:53.820 Here I'm using technically the ipython debugger, just because it has nice coloring syntax 0:13:54.060,0:13:59.140 so it's probably easier for both of us to understand 0:13:59.300,0:14:01.300 what's going on in the output. 0:14:01.310,0:14:04.929 But they're pretty much identical anyway. 0:14:05.140,0:14:09.400 So we execute this, and now we are given a prompt 0:14:09.400,0:14:13.080 where we're being told that we are here, at the very first line of our program. 0:14:13.100,0:14:15.440 And we can... 0:14:15.980,0:14:20.380 "L" stands for "List", so as with many of these tools 0:14:21.140,0:14:24.400 there's kind of like a language of operations that you can do, 0:14:24.400,0:14:28.220 and they are often mnemonic, as it was the case with VIM or TMUX. 0:14:28.860,0:14:32.940 So here, "L" is for "Listing" the code, and we can see the entire code. 0:14:34.540,0:14:38.880 "S" is for "Step" and will let us kind of one 0:14:38.880,0:14:42.180 line at a time, go through the execution. 0:14:42.300,0:14:47.360 The thing is we're only triggering the error some time later. 0:14:47.360,0:14:48.710 So 0:14:48.710,0:14:55.150 we can restart the program and instead of trying to step until we get to the issue, 0:14:55.150,0:15:00.820 we can just ask for the program to continue which is the "C" command and 0:15:01.480,0:15:04.160 hey, we reached the issue. 0:15:04.640,0:15:08.080 We got to this line where everything crashed, 0:15:08.080,0:15:11.020 we're getting this list index out of range. 0:15:11.020,0:15:13.560 And now that we are here we can say, huh? 0:15:14.120,0:15:17.520 Okay, first, let's print the value of the array. 0:15:18.080,0:15:21.520 This is the value of the current array 0:15:23.120,0:15:26.840 So we have six items. Okay. What is the value of "J" here? 0:15:27.200,0:15:31.929 So we look at the value of "J". "J" is 5 here, which will be the last element, but 0:15:32.480,0:15:37.119 "J" plus 1 is going to be 6, so that's triggering the out of bounds error. 0:15:37.970,0:15:40.389 So what we have to do is 0:15:40.660,0:15:47.660 this "N", instead of "N" has to be "N minus one". We have identified that the error lies there. 0:15:47.660,0:15:50.800 So we can quit, which is "Q". 0:15:52.010,0:15:54.729 Again, because it's a post-mortem debugger. 0:15:56.090,0:16:00.219 We go back to the code and say okay, 0:16:02.860,0:16:06.180 we need to append this "N minus one". 0:16:06.760,0:16:11.140 That will prevent the list index out of range and 0:16:11.480,0:16:14.260 if we run this again without the debugger, 0:16:15.020,0:16:18.729 okay, no errors now. But this is not our sorted list. 0:16:18.729,0:16:21.200 This is sorted, but it's not our list. 0:16:21.300,0:16:23.000 We are missing entries from our list, 0:16:23.160,0:16:27.420 so there is some behavioral issue that we're reaching here. 0:16:27.920,0:16:32.409 Again, we could start using printf debugging but kind of a hunch now 0:16:32.409,0:16:37.940 is that probably the way we're swapping entries in the bubble sort program is wrong. 0:16:38.480,0:16:45.920 We can use the debugger for this. We can go through them to the moment we're doing a swap and 0:16:46.120,0:16:48.320 check how the swap is being performed. 0:16:48.540,0:16:50.600 So a quick overview, 0:16:50.600,0:16:56.590 we have two for loops and in the most nested loop, 0:16:56.720,0:17:03.220 we are checking if the array is larger than the other array. The thing is if we just try to execute until this line, 0:17:03.589,0:17:06.609 it's only going to trigger whenever we make a swap. 0:17:06.700,0:17:11.640 So what we can do is we can set a breakpoint in the sixth line. 0:17:11.820,0:17:15.520 We can create a breakpoint in this line and then 0:17:15.580,0:17:20.820 the program will execute and the moment we try to swap variables is when the program is going to stop. 0:17:21.080,0:17:22.940 So we create a breakpoint there 0:17:22.940,0:17:27.000 and then we continue the execution of the program. The program halts 0:17:27.000,0:17:30.520 and says hey, I have executed and I have reached this line. 0:17:30.820,0:17:31.860 Now 0:17:31.920,0:17:39.120 I can use "locals()", which is a Python function that returns a dictionary with all the values 0:17:39.120,0:17:41.220 to quickly see the entire context. 0:17:43.100,0:17:48.140 The string, the array is fine and is six, again, just the beginning and 0:17:48.680,0:17:51.100 I step, go to the next line. 0:17:51.780,0:17:52.620 Oh, 0:17:52.620,0:17:57.000 and I identify the issue: I'm swapping one item at a time, instead of simultaneously, 0:17:57.020,0:18:01.840 so that's what's triggering the fact that we're losing variables as we go through. 0:18:03.200,0:18:06.729 That's kind of a very simple example, but 0:18:07.490,0:18:09.050 debuggers are really powerful. 0:18:09.050,0:18:13.320 Most programming languages will give you some sort of debugger, 0:18:13.540,0:18:19.920 and when you go to more low level debugging you might run into tools like... 0:18:19.920,0:18:21.920 You might want to use something like 0:18:25.340,0:18:27.340 GDB. 0:18:31.580,0:18:34.360 And GDB has one nice property: 0:18:34.460,0:18:37.740 GDB works really well with C/C++ and all these C-like languages. 0:18:37.780,0:18:42.720 But GDB actually lets you work with pretty much any binary that you can execute. 0:18:42.720,0:18:47.800 So for example here we have sleep, which is just a program that's going to sleep for 20 seconds. 0:18:48.520,0:18:55.340 It's loaded and then we can do run, and then we can interrupt this sending an interrupt signal. 0:18:55.340,0:19:02.020 And GDB is displaying for us, here, very low-level information about what's going on in the program. 0:19:02.030,0:19:06.820 So we're getting the stack trace, we're seeing we are in this nanosleep function, 0:19:07.060,0:19:11.660 we can see the values of all the hardware registers in your machine. So 0:19:12.300,0:19:17.160 you can get a lot of low-level detail using these tools. 0:19:18.560,0:19:22.520 I think that's all I want to cover for debuggers. 0:19:22.520,0:19:25.540 Any questions related to that? 0:19:33.520,0:19:39.040 Another interesting tool when you're trying to debug is that sometimes you want to debug as if 0:19:39.480,0:19:42.220 your program is a black box. 0:19:42.220,0:19:46.059 So you, maybe, know what the internals of the program but at the same time 0:19:46.430,0:19:52.119 your computer knows whenever your program is trying to do some operations. 0:19:52.280,0:19:54.729 So this is in UNIX systems, 0:19:54.760,0:19:58.060 there's this notion of like user level code and kernel level code. 0:19:58.060,0:20:03.180 And when you try to do some operations like reading a file or like reading the network connection 0:20:03.340,0:20:06.020 you will have to do something called system calls. 0:20:06.180,0:20:12.560 You can get a program and go through those operations and ask 0:20:14.000,0:20:18.300 what operations did this software do? 0:20:18.300,0:20:20.920 So for example, if you have like a Python function 0:20:20.980,0:20:26.660 that is only supposed to do a mathematical operation and you run it through this program, 0:20:26.660,0:20:28.460 and it's actually reading files, 0:20:28.460,0:20:31.940 Why is it reading files? It shouldn't be reading files. So, let's see. 0:20:34.520,0:20:37.200 This is "strace". 0:20:37.200,0:20:38.740 So for example, we can do it something like this. 0:20:38.740,0:20:41.260 So here we're gonna run the "LS - L" 0:20:42.220,0:20:47.900 And then we're ignoring the output of LS, but we are not ignoring the output of STRACE. 0:20:47.900,0:20:49.740 So if we execute that... 0:20:52.300,0:20:54.720 We're gonna get a lot of output. 0:20:54.920,0:20:58.740 This is all the different system calls 0:21:00.520,0:21:02.080 That this 0:21:02.090,0:21:07.510 LS has executed. You will see a bunch of OPEN, you will see FSTAT. 0:21:08.150,0:21:14.170 And for example, since it has to list all the properties of the files that are in this folder, we can 0:21:15.110,0:21:20.410 check for the LSTAT call. So the LSTAT call will check for the properties of the files and 0:21:21.020,0:21:27.420 we can see that, effectively, all the files and folders that are in this directory 0:21:27.700,0:21:31.540 have been accessed through a system call, through LS. 0:21:34.120,0:21:43.400 Interestingly, sometimes you actually don't need to run your code to 0:21:44.360,0:21:47.000 figure out that there is something wrong with your code. 0:21:47.960,0:21:52.449 So far we have seen enough ways of identifying issues by running the code, 0:21:52.450,0:21:54.410 but what if you... 0:21:54.410,0:21:58.980 you can look at a piece of code like this, like the one I have shown right now in this screen, 0:21:58.980,0:22:00.560 and identify an issue. 0:22:00.560,0:22:02.030 So for example here, 0:22:02.030,0:22:06.670 we have some really silly piece of code. It defines a function, prints a few variables, 0:22:07.720,0:22:11.780 multiplies some variables, it sleeps for a while and then we try to print BAZ. 0:22:12.020,0:22:14.840 And you could try to look at this and say, hey, BAZ has 0:22:15.500,0:22:20.650 never been defined anywhere. This is a new variable. You probably meant to say BAR 0:22:20.650,0:22:22.540 but you just mistyped it. 0:22:22.540,0:22:26.480 Thing is, if we try to run this program, 0:22:28.820,0:22:36.820 it's gonna take 60 seconds, because like we have to wait until this time.sleep function finishes. Here, sleep is just for 0:22:37.790,0:22:42.070 motivating the example but in general you may be loading a data set that takes really long 0:22:42.140,0:22:44.740 because you have to copy everything into memory. 0:22:44.740,0:22:48.780 And the thing is, there are programs that will take source code as input, 0:22:49.340,0:22:54.940 will process it and will say, oh probably this is wrong about this piece of code. So in Python, 0:22:55.760,0:23:00.600 or in general, these are called static analysis tools. 0:23:00.780,0:23:02.860 In Python we have for example pyflakes. 0:23:02.860,0:23:06.640 If we get this piece of code and run it through pyflakes, 0:23:06.860,0:23:09.820 pyflakes is gonna give us a couple of issues. 0:23:10.040,0:23:15.700 First one is the one.... The second one is the one we identified: here's an undefined name called BAZ. 0:23:15.700,0:23:17.760 You probably should be doing something about that. 0:23:17.760,0:23:22.720 And the other one is like oh, you're redefining the 0:23:23.060,0:23:27.240 the FOO variable name in that line. 0:23:27.540,0:23:31.400 So here we have a FOO function and then we are kind of 0:23:31.400,0:23:34.620 shadowing that function by using a loop variable here. 0:23:34.760,0:23:38.460 So now that FOO function that we defined is not accessible anymore 0:23:38.470,0:23:41.650 and then if we try to call it afterwards, we will get into errors. 0:23:43.520,0:23:45.520 There are other types of 0:23:46.250,0:23:53.170 Static Analysis tools. MYPY is a different one. MYPY is gonna report the same two errors, but it's also 0:23:53.840,0:24:00.160 going to complain about type checking. So it's gonna say, oh here you're multiplying an int by a float and 0:24:00.680,0:24:06.320 if you care about the type checking of your code, you should not be mixing those up. 0:24:07.490,0:24:12.219 it can be kind of inconvenient, having to run this, look at the line, going back to your 0:24:12.800,0:24:17.409 VIM or like your editor, and figuring out what the error matches to. 0:24:18.380,0:24:21.190 There are already solutions for that. One 0:24:22.340,0:24:27.069 way is that you can integrate most editors with these tools and here.. 0:24:28.279,0:24:34.059 You can see there is like some red highlighting on the bash, and it will read the last line here. 0:24:34.059,0:24:36.059 So, undefined named 'baz'. 0:24:36.160,0:24:39.080 So as I'm editing this piece of Python code, 0:24:39.080,0:24:43.360 my editor is gonna give me feedback about what's going wrong with this. 0:24:43.560,0:24:48.480 Or like here have another one saying the redefinition of unused foo. 0:24:49.849,0:24:51.849 And 0:24:53.080,0:24:56.060 even, there are some stylistic complaints. 0:24:56.060,0:24:58.060 So, oh, I will expect two empty lines. 0:24:58.120,0:25:03.660 So like in Python, you should be having two empty lines between a function definition. 0:25:05.779,0:25:07.009 There are... 0:25:07.009,0:25:09.280 there is a resource on the lecture notes 0:25:09.280,0:25:13.160 about pretty much static analyzers for a lot of different programming languages. 0:25:13.700,0:25:18.460 There are even static analyzers for English. 0:25:18.840,0:25:24.260 So I have my notes 0:25:24.580,0:25:30.280 for the class here, and if I run it through this static analyzer for English, that is "writegood". 0:25:30.409,0:25:33.008 It's going to complain about some stylistic properties. 0:25:33.009,0:25:33.489 So like, oh, 0:25:33.489,0:25:37.460 I'm using "very", which is a weasel word and I shouldn't be using it. 0:25:37.480,0:25:43.080 Or "quickly" can weaken meaning, and you can have this for spell checking, or for a lot of different 0:25:43.600,0:25:48.000 types of stylistic analysis. 0:25:48.760,0:25:52.020 Any questions so far? 0:25:57.500,0:25:59.490 Oh, 0:25:59.490,0:26:01.490 I forgot to mention... 0:26:01.640,0:26:07.320 Depending on the task that you're performing, there will be different types of debuggers. 0:26:07.320,0:26:09.740 For example, if you're doing web development, 0:26:09.860,0:26:13.520 both Firefox and Chrome 0:26:13.740,0:26:20.600 have a really really good set of tools for doing debugging for websites. 0:26:20.600,0:26:23.880 So here we go and say inspect element, 0:26:23.880,0:26:25.880 we can get the... do you know? how to make this larger... 0:26:27.660,0:26:29.220 We're getting 0:26:29.220,0:26:33.380 the entire source code for the web page for the class. 0:26:35.549,0:26:37.549 Oh, yeah, here we go. 0:26:38.640,0:26:40.640 Is that better? 0:26:40.799,0:26:47.149 And we can actually go and change properties about the course. So we can say... we can edit the title. 0:26:47.400,0:26:51.280 Say, this is not a class on debugging and profiling. 0:26:51.620,0:26:53.940 And now the code for the website has changed. 0:26:54.120,0:26:56.000 This is one of the reasons why you should never trust 0:26:56.200,0:27:00.560 any screenshots of websites, because they can be completely modified. 0:27:01.320,0:27:05.030 And you can also modify this style. Like, here I have things 0:27:06.120,0:27:07.559 using the 0:27:07.560,0:27:09.500 the dark mode preference, 0:27:09.680,0:27:11.900 but we can alter that. 0:27:11.900,0:27:16.560 Because at the end of the day, the browser is rendering this for us. 0:27:17.840,0:27:21.780 We can check the cookies, but there's like a lot of different operations. 0:27:21.799,0:27:27.619 There's also a built-in debugger for JavaScript, so you can step through JavaScript code. 0:27:27.620,0:27:34.020 So kind of the takeaway is, depending on what you are doing, you will probably want to search for what tools 0:27:34.320,0:27:36.820 programmers have built for them. 0:27:44.880,0:27:47.630 Now I'm gonna switch gears and 0:27:48.200,0:27:51.800 stop talking about debugging, which is kind of finding issues with the code, right? 0:27:51.800,0:27:54.200 kind of more about the behavior, and then start talking 0:27:54.200,0:27:56.860 about like how you can use profiling. 0:27:56.860,0:27:59.240 And profiling is how to optimize the code. 0:28:01.100,0:28:05.940 It might be because you want to optimize the CPU, the memory, the network, ... 0:28:06.330,0:28:09.889 There are many different reasons that you want to be optimizing it. 0:28:10.440,0:28:14.000 As it was the case with debugging, the kind of first-order approach 0:28:14.000,0:28:16.680 that a lot of people have experience with already is 0:28:16.880,0:28:21.880 oh, let's use just printf profiling, so to say, like we can just take... 0:28:22.770,0:28:25.610 Let me make this larger. We can 0:28:26.130,0:28:28.110 take the current time here, 0:28:28.110,0:28:34.610 then we can check, we can do some execution and then we can take the time again and 0:28:35.060,0:28:37.320 subtract it from the original time. 0:28:37.320,0:28:39.320 And by doing this you can kind of narrow down 0:28:39.540,0:28:46.040 and fence some different parts of your code and try to figure out what is the time taken between those two parts. 0:28:47.040,0:28:52.639 And that's good. But sometimes it can be interesting, the results. So here, we're sleeping for 0:28:53.730,0:28:59.809 0.5 seconds and the output is saying, oh it's 0.5 plus some extra time, 0:28:59.810,0:29:05.929 which is kind of interesting. And if we keep running it, we see there's like some small error and the thing is 0:29:06.240,0:29:11.680 here, what we're actually measuring is what is usually referred to as the "real time". 0:29:12.060,0:29:14.340 Real time is as if you get 0:29:14.340,0:29:15.930 like a 0:29:15.930,0:29:19.249 clock, and you start it when your program starts, and you stop it when your program ends. 0:29:19.500,0:29:23.060 But the thing is, in your computer it is not only your program that is running. 0:29:23.060,0:29:27.460 There are many other programs running at the same time and those might 0:29:27.760,0:29:34.640 be the ones that are taking the CPU. So, to try to make sense of that, 0:29:35.790,0:29:39.259 A lot of... you'll see a lot of programs 0:29:40.620,0:29:43.250 using the terminology that is 0:29:44.100,0:29:46.760 real time, user time and system time. 0:29:46.760,0:29:51.460 Real time is what I explained, which is kind of the entire length of time from start to finish. 0:29:51.840,0:29:59.780 Then there is the user time, which is the amount of time your program spent on the CPU doing user level cycles. 0:29:59.780,0:30:06.100 So as I was mentioning, in UNIX, you can be running user level code or kernel level code. 0:30:06.920,0:30:12.940 System is kind of the opposite, it's the amount of CPU, like the amount of time that your program spent on the CPU 0:30:13.500,0:30:18.480 executing kernel mode instructions. So let's show this with an example. 0:30:18.620,0:30:22.180 Here I'm going to "time", which is a command, 0:30:22.460,0:30:27.840 a shell command that's gonna get these three metrics for the following command, and then I'm just 0:30:28.100,0:30:30.560 grabbing a URL from 0:30:31.160,0:30:36.760 a website that is hosted in Spain. So that's gonna take some extra time to go over there and then go back. 0:30:37.410,0:30:39.499 If we see, here, if we were to just... 0:30:39.780,0:30:43.670 We have two prints, between the beginning and the end of the program. 0:30:43.670,0:30:49.039 We could think that this program is taking like 600 milliseconds to execute, but actually 0:30:49.500,0:30:56.930 most of that time was spent just waiting for the response on the other side of the network and 0:30:57.330,0:31:04.880 we actually only spent 16 milliseconds at the user level and like 9 seconds, in total 25 milliseconds, actually 0:31:05.280,0:31:08.149 executing CURL code. Everything else was just waiting. 0:31:12.090,0:31:14.480 Any questions related to timing? 0:31:19.860,0:31:21.860 Ok, so 0:31:21.990,0:31:23.580 timing can be 0:31:23.580,0:31:29.480 can become tricky, it's also kind of a black box solution. Or if you start adding print statements, 0:31:29.660,0:31:35.860 it's kind of hard to add print statements, with time everywhere. So programmers have figured out better tools. 0:31:36.140,0:31:38.700 These are usually referred to as "profilers". 0:31:39.980,0:31:44.260 One quick note that I'm gonna make, is that 0:31:44.720,0:31:46.720 profilers, like usually when people 0:31:46.800,0:31:48.800 refer to profilers they usually talk about 0:31:49.050,0:31:55.190 CPU profilers because they are the most common, at identifying where like time is being spent on the CPU. 0:31:56.790,0:31:59.180 Profilers usually come in kind of two flavors: 0:31:59.180,0:32:02.140 there's tracing profilers and sampling profilers. 0:32:02.140,0:32:06.380 and it's kind of good to know the difference because the output might be different. 0:32:07.640,0:32:10.300 Tracing profilers kind of instrument your code. 0:32:10.680,0:32:15.799 So they kind of execute with your code and every time your code enters a function call, 0:32:15.800,0:32:20.479 they kind of take a note of it. It's like, oh we're entering this function call at this moment in time and 0:32:21.860,0:32:24.860 they keep going and, once they finish, they can report 0:32:24.860,0:32:28.300 oh, you spent this much time executing in this function and 0:32:28.580,0:32:33.760 this much time in this other function. So on, so forth, which is the example that we're gonna see now. 0:32:34.590,0:32:38.329 Another type of tools are tracing, sorry, sampling profilers. 0:32:38.430,0:32:44.840 The issue with tracing profilers is they add a lot of overhead. Like you might be running your code and having these kind of 0:32:46.280,0:32:49.400 profiling next to you making all these counts, 0:32:49.400,0:32:54.340 will hinder the performance of your program, so you might get counts that are slightly off. 0:32:55.380,0:32:59.450 A sampling profiler, what it's gonna do is gonna execute your program and every 0:32:59.940,0:33:05.239 100 milliseconds, 10 milliseconds, like some defined period, it's gonna stop your program. It's gonna halt it, 0:33:05.580,0:33:12.379 it's gonna look at the stack trace and say, oh, you're right now in this point in the hierarchy, and 0:33:12.630,0:33:15.530 identify which function is gonna be executing at that point. 0:33:16.260,0:33:19.760 The idea is that as long as you execute this for long enough, 0:33:19.760,0:33:24.290 you're gonna get enough statistics to know where most of the time is being spent. 0:33:25.800,0:33:28.800 So, let's see an example of a tracing profiling. 0:33:28.800,0:33:32.340 So here we have a piece of code that is just like a 0:33:33.480,0:33:35.540 really simple re-implementation of grep 0:33:36.330,0:33:38.330 done in Python. 0:33:38.400,0:33:44.030 What we want to check is what is the bottleneck of this program? Like we're just opening a bunch of files, 0:33:44.900,0:33:49.620 trying to match this pattern, and then printing whenever we find a match. 0:33:49.620,0:33:52.340 And maybe it's the regex, maybe it's the print... 0:33:52.460,0:33:53.940 We don't really know. 0:33:53.940,0:33:59.040 So to do this in Python, we have the "cProfile". 0:33:59.040,0:34:00.080 And 0:34:00.990,0:34:06.620 here I'm just calling this module and saying I want to sort this by the total amount of time, that 0:34:06.780,0:34:13.429 we're gonna see briefly. I'm calling the program we just saw in the editor. 0:34:13.429,0:34:18.679 I'm gonna execute this a thousand times and then I want to match (the grep 0:34:18.960,0:34:21.770 Arguments here) is I want to match these regex 0:34:22.919,0:34:27.469 to all the Python files in here. And this is gonna output some... 0:34:30.780,0:34:34.369 This is gonna produce some output, then we're gonna look at it. First, 0:34:34.369,0:34:38.539 is all the output from the greps, but at the very end, we're getting 0:34:39.119,0:34:42.979 output from the profiler itself. If we go up 0:34:44.129,0:34:46.939 we can see that, hey, 0:34:47.730,0:34:55.250 by sorting we can see that the total number of calls. So we did 8000 calls, because we executed this 1000 times and 0:34:57.360,0:35:03.440 this is the total amount of time we spent in this function (cumulative time). And here we can start to identify 0:35:03.920,0:35:06.040 where the bottleneck is. 0:35:06.050,0:35:11.449 So here, this built-in method IO open, is saying that we're spending a lot of the time just waiting for 0:35:12.080,0:35:14.340 reading from the disk or... 0:35:14.340,0:35:15.680 There, we can check, hey, 0:35:15.680,0:35:19.840 a lot of time is also being spent trying to match the regex. 0:35:19.840,0:35:22.640 Which is something that you will expect. 0:35:22.640,0:35:26.220 One of the caveats of using this 0:35:26.480,0:35:29.540 tracing profiler is that, as you can see, here 0:35:29.540,0:35:35.239 we're seeing our function but we're also seeing a lot of functions that correspond to built-ins. 0:35:35.240,0:35:35.910 So like, 0:35:35.910,0:35:41.899 functions that are third party functions from the libraries. And as you start building more and more complex code, 0:35:41.900,0:35:43.560 This is gonna be much harder. 0:35:44.200,0:35:44.760 So 0:35:46.080,0:35:49.720 here is another piece of Python code that, 0:35:51.540,0:35:53.779 don't read through it, what it's doing is just 0:35:54.420,0:35:57.589 grabbing the course website and then it's printing all the... 0:35:58.440,0:36:01.960 It's parsing it, and then it's printing all the hyperlinks that it has found. 0:36:01.960,0:36:03.520 So there are like these two operations: 0:36:03.520,0:36:07.800 going there, grabbing a website, and then parsing it, printing the links. 0:36:07.800,0:36:09.740 And we might want to get a sense of 0:36:09.740,0:36:16.180 how those two operations compare to each other. If we just try to execute the 0:36:16.680,0:36:18.680 cProfiler here and 0:36:19.260,0:36:24.949 we're gonna do the same, this is not gonna print anything. I'm using a tool we haven't seen so far, 0:36:24.950,0:36:25.700 but I think it's pretty nice. 0:36:25.700,0:36:32.810 It's "TAC", which is the opposite of "CAT", and it is going to reverse the output so I don't have to go up and look. 0:36:33.430,0:36:35.430 So we do this and... 0:36:36.250,0:36:39.179 Hey, we get some interesting output. 0:36:39.880,0:36:46.200 we're spending a bunch of time in this built-in method socket_getaddr_info and like in _imp_create_dynamic and 0:36:46.510,0:36:48.540 method_connect and posix_stat... 0:36:49.210,0:36:55.740 nothing in my code is directly calling these functions so I don't really know what is the split between the operation of 0:36:56.349,0:37:03.929 making a web request and parsing the output of that web request. So, for that, we can use 0:37:04.900,0:37:07.920 a different type of profiler which is 0:37:09.819,0:37:14.309 a line profiler. And the line profiler is just going to present the same results 0:37:14.310,0:37:20.879 but in a more human-readable way, which is just, for this line of code, this is the amount of time things took. 0:37:24.819,0:37:31.079 So it knows it has to do that, we have to add a decorator to the Python function, we do that. 0:37:34.869,0:37:36.869 And as we do that, 0:37:37.119,0:37:39.749 we now get slightly cropped output, 0:37:39.750,0:37:46.169 but the main idea, we can look at the percentage of time and we can see that making this request, get operation, took 0:37:46.450,0:37:52.829 88% of the time, whereas parsing the response took only 10.9% of the time. 0:37:54.069,0:38:00.869 This can be really informative and a lot of different programming languages will support this type of a line profiling. 0:38:04.569,0:38:07.439 Sometimes, you might not care about CPU. 0:38:07.440,0:38:15.000 Maybe you care about the memory or like some other resource. Similarly, there are memory profilers: in Python 0:38:15.000,0:38:21.599 there is "memory_profiler", for C you will have "Valgrind". So here is a fairly simple example, 0:38:21.760,0:38:28.530 we just create this list with a million elements. That's going to consume like megabytes of space and 0:38:29.200,0:38:33.920 we do the same, creating another one with 20 million elements. 0:38:34.860,0:38:38.180 To check, what was the memory allocation? 0:38:38.980,0:38:44.369 How it's gonna happen, what's the consumption? We can go through one memory profiler and 0:38:44.950,0:38:46.619 we execute it, 0:38:46.620,0:38:51.380 and it's telling us the total memory usage and the increments. 0:38:51.380,0:38:57.980 And we can see that we have some overhead, because this is an interpreted language and when we create 0:38:58.450,0:39:00.599 this million, 0:39:03.520,0:39:07.340 this list with a million entries, we're gonna need this many megabytes of information. 0:39:07.660,0:39:15.299 Then we were getting another 150 megabytes. Then, we're freeing this entry and that's decreasing the total amount. 0:39:15.299,0:39:19.169 We are not getting a negative increment because of a bug, probably in the profiler. 0:39:19.509,0:39:26.549 But if you know that your program is taking a huge amount of memory and you don't know why, maybe because you're copying 0:39:26.920,0:39:30.269 objects where you should be doing things in place, then 0:39:31.140,0:39:33.320 using a memory profiler can be really useful. 0:39:33.320,0:39:37.780 And in fact there's an exercise that will kind of work you through that, comparing 0:39:37.980,0:39:39.980 an in-place version of quicksort with like a 0:39:40.059,0:39:44.008 non-inplace, that keeps making new and new copies. And if you using the memory profiler 0:39:44.009,0:39:47.909 you can get a really good comparison between the two of them 0:39:51.069,0:39:53.459 Any questions so far, with profiling? 0:39:53.460,0:39:57.940 Is the memory profiler running the program in order to get that? 0:39:58.140,0:40:03.180 Yeah... you might be able to figure out like just looking at the code. 0:40:03.180,0:40:05.759 But as you get more and more complex (for this code at least) 0:40:06.009,0:40:10.738 But you get more and more complex programs what this is doing is running through the program 0:40:10.739,0:40:16.739 and for every line, at the very beginning, it's looking at the heap and saying 0:40:16.739,0:40:19.319 "What are the objects that I have allocated now?" 0:40:19.319,0:40:22.979 "I have seven megabytes of objects", and then goes to the next line, 0:40:23.190,0:40:27.869 looks again, "Oh now I have 50, so I have now added 43 there". 0:40:28.839,0:40:34.709 Again, you could do this yourself by asking for those operations in your code, every single line. 0:40:34.920,0:40:39.899 But that's not how you should be doing things since people have already written these tools for you to use. 0:40:43.089,0:40:46.078 As it was the case with... 0:40:51.480,0:40:58.220 So as in the case with strace, you can do something similar in profiling. 0:40:58.340,0:41:03.380 You might not care about the specific lines of code that you have, 0:41:03.440,0:41:08.200 but maybe you want to check for outside events. Like, you maybe want to check how many 0:41:09.410,0:41:14.469 CPU cycles your computer program is using, or how many page faults it's creating. 0:41:14.469,0:41:19.239 Maybe you have like bad cache locality and that's being manifested somehow. 0:41:19.340,0:41:22.960 So for that, there is the "perf" command. 0:41:22.960,0:41:27.220 The perf command is gonna do this, where it is gonna run your program and it's gonna 0:41:28.720,0:41:33.360 keep track of all these statistics and report them back to you. And this can be really helpful if you are 0:41:33.680,0:41:36.060 working at a lower level. So 0:41:37.300,0:41:42.840 we execute this command, I'm gonna explain briefly what it's doing. 0:41:48.650,0:41:51.639 And this stress program is just 0:41:52.219,0:41:54.698 running in the CPU, and it's just a program to just 0:41:54.829,0:41:59.528 hog one CPU and like test that you can hog the CPU. And now if we Ctrl-C, 0:42:00.619,0:42:02.708 we can go back and 0:42:03.410,0:42:08.559 we get some information about the number of page faults that we have or the number of 0:42:09.769,0:42:11.769 CPU cycles that we utilize, and other 0:42:12.469,0:42:14.329 useful 0:42:14.329,0:42:18.968 metrics from our code. For some programs you can 0:42:21.469,0:42:25.089 look at what the functions that were being used were. 0:42:26.120,0:42:30.140 So we can record what this program is doing, 0:42:30.940,0:42:34.920 which we don't know about because it's a program someone else has written. 0:42:35.240,0:42:37.240 And 0:42:38.180,0:42:42.279 we can report what it was doing by looking at the stack trace and we can say oh, 0:42:42.279,0:42:44.279 It's spending a bunch of time in this 0:42:44.660,0:42:46.640 __random_r 0:42:46.640,0:42:53.229 standard library function. And it's mainly because the way of hogging a CPU is by just creating more and more pseudo-random numbers. 0:42:53.779,0:42:55.779 There are some other 0:42:55.819,0:42:58.149 functions that have not been mapped, because they 0:42:58.369,0:43:01.448 belong to the program, but if you know about your program 0:43:01.448,0:43:05.140 you can display this information using more flags, about perf. 0:43:05.140,0:43:10.220 There are really good tutorials online about how to use this tool. 0:43:12.010,0:43:14.010 Oh 0:43:14.119,0:43:17.349 One one more thing regarding profilers is, so far, 0:43:17.350,0:43:20.109 we have seen that these profilers are really good at 0:43:20.510,0:43:25.419 aggregating all this information and giving you a lot of these numbers so you can 0:43:25.790,0:43:29.739 optimize your code or you can reason about what is happening, but 0:43:30.560,0:43:31.550 the thing is 0:43:31.550,0:43:35.949 humans are not really good at making sense of lots of numbers and since 0:43:36.080,0:43:39.249 humans are more visual creatures, it's much 0:43:39.920,0:43:42.980 easier to kind of have some sort of visualization. 0:43:42.980,0:43:48.700 Again, programmers have already thought about this and have come up with solutions. 0:43:49.480,0:43:56.160 A couple of popular ones, is a FlameGraph. A FlameGraph is a 0:43:56.780,0:44:00.160 sampling profiler. So this is just running your code and taking samples 0:44:00.160,0:44:03.280 And then on the y-axis here 0:44:03.280,0:44:10.980 we have the depth of the stack so we know that the bash function called this other function, and this called this other function, 0:44:11.260,0:44:14.480 so on, so forth. And on the x-axis it's 0:44:14.630,0:44:17.500 not time, it's not the timestamps. 0:44:17.500,0:44:23.290 Like it's not this function run before, but it's just time taken. Because, again, this is a sampling profiler: 0:44:23.290,0:44:28.540 we're just getting small glimpses of what was it going on in the program. But we know that, for example, 0:44:29.119,0:44:32.949 this main program took the most time because the 0:44:33.530,0:44:35.530 x-axis is proportional to that. 0:44:36.020,0:44:43.090 They are interactive and they can be really useful to identify the hot spots in your program. 0:44:44.720,0:44:50.540 Another way of displaying information, and there is also an exercise on how to do this, is using a call graph. 0:44:50.720,0:44:58.320 So a call graph is going to be displaying information, and it's gonna create a graph of which function called which other function. 0:44:58.620,0:45:00.940 And then you get information about, like, 0:45:00.940,0:45:05.770 oh, we know that "__main__" called this "Person" function ten times and 0:45:06.050,0:45:08.919 it took this much time. And as you have 0:45:09.080,0:45:13.029 larger and larger programs, looking at one of these call graphs can be useful to identify 0:45:14.270,0:45:19.689 what piece of your code is calling this really expensive IO operation, for example. 0:45:24.560,0:45:30.360 With that I'm gonna cover the last part of the lecture, which is that 0:45:30.360,0:45:36.600 sometimes, you might not even know what exact resource is constrained in your program. 0:45:36.619,0:45:39.019 Like how do I know how much CPU 0:45:39.380,0:45:44.060 my program is using, and I can quickly look in there, or how much memory. 0:45:44.060,0:45:46.680 So there are a bunch of really 0:45:46.700,0:45:49.760 nifty tools for doing that one of them is 0:45:50.400,0:45:53.270 HTOP. so HTOP is an 0:45:54.000,0:45:59.810 interactive command-line tool and here it's displaying all the CPUs this machine has, 0:46:00.160,0:46:07.740 which is 12. It's displaying the amount of memory, it says I'm consuming almost a gigabyte of the 32 gigabytes my machine has. 0:46:07.740,0:46:11.660 And then I'm getting all the different processes. 0:46:11.730,0:46:13.290 So for example we have 0:46:13.290,0:46:20.300 zsh, mysql and other processes that are running in this machine, and I can sort through the amount of CPU 0:46:20.300,0:46:24.379 they're consuming or through the priority they're running at. 0:46:25.980,0:46:28.129 We can check this, for example. Here 0:46:28.130,0:46:30.230 we have the stress command again 0:46:30.230,0:46:31.470 and we're going to 0:46:31.470,0:46:37.040 run it to take over four CPUs and check that we can see that in HTOP. 0:46:37.040,0:46:42.880 So we did spot those four CPU jobs, and now I have seen that 0:46:43.710,0:46:46.429 besides the ones we had before, now I have this... 0:46:50.310,0:46:56.119 Like this "stress -c" command running and taking a bunch of our CPU. 0:46:56.849,0:47:03.169 Even though you could use a profiler to get similar information to this, the way HTOP displays this kind of in a live interactive 0:47:03.329,0:47:07.099 fashion can be much quicker and much easier to parse. 0:47:07.890,0:47:09.890 In the notes, there's a 0:47:10.160,0:47:15.180 really long list of different tools for evaluating different parts of your system. 0:47:15.180,0:47:17.180 So that might be tools for analyzing the 0:47:17.180,0:47:19.720 network performance, about looking the 0:47:20.430,0:47:24.530 number of IO operations, so you know whether you're saturating the 0:47:26.040,0:47:28.040 the reads from your disks, 0:47:28.829,0:47:31.429 you can also look at what is the space usage. 0:47:32.069,0:47:34.369 Which, I think, here... 0:47:38.690,0:47:44.829 So NCDU... There's a tool called "du" which stands for "disk usage" and 0:47:45.440,0:47:49.480 we have the "-h" flag for "human readable output". 0:47:51.740,0:47:58.959 We can do videos and we can get output about the size of all the files in this folder. 0:48:08.059,0:48:10.059 Yeah, there we go. 0:48:10.400,0:48:15.040 There are also interactive versions, like HTOP was an interactive version. 0:48:15.280,0:48:21.200 So NCDU is an interactive version that will let me navigate through the folders and I can see quickly that 0:48:21.200,0:48:25.740 oh, we have... This is one of the folders for the video lectures, 0:48:26.329,0:48:29.049 and we can see there are these four files 0:48:29.690,0:48:36.579 that have like almost 9 GB each and I could quickly delete them through this interface. 0:48:37.760,0:48:43.839 Another neat tool is "LSOF" which stands for "LIST OF OPEN FILES". 0:48:44.240,0:48:47.500 Another pattern that you may encounter is you know 0:48:47.780,0:48:54.609 some process is using a file, but you don't know exactly which process is using that file. Or, similarly, some process is listening in 0:48:55.400,0:48:59.020 a port, but again, how do you find out which one it is? 0:48:59.020,0:49:00.820 So to set an example. 0:49:00.820,0:49:04.280 We just run a Python HTTP server on port 0:49:05.210,0:49:06.559 444 0:49:06.559,0:49:10.899 Running there. Maybe we don't know that that's running, but then we can 0:49:13.130,0:49:15.130 use... 0:49:17.089,0:49:19.089 we can use LSOF. 0:49:22.660,0:49:29.200 Yeah, we can use LSOF, and the thing is LSOF is gonna print a lot of information. 0:49:30.440,0:49:32.740 You need SUDO permissions because 0:49:34.069,0:49:39.219 this is gonna ask for who has all these items. 0:49:39.829,0:49:43.929 Since we only care about the one who is listening in this 444 port 0:49:44.630,0:49:46.369 we can ask 0:49:46.369,0:49:47.960 grep for that. 0:49:47.960,0:49:55.750 And we can see, oh, there's like this Python process, with this identifier, that is using the port and then we can 0:49:56.660,0:49:58.009 kill it, 0:49:58.009,0:50:00.969 and that terminates that process. 0:50:02.299,0:50:06.669 Again, there's a lot of different tools. There's even tools for 0:50:08.450,0:50:10.569 doing what is called benchmarking. 0:50:11.660,0:50:18.789 So in the shell tools and scripting lecture, I said like for some tasks "fd" is much faster than "find" 0:50:18.950,0:50:21.519 But like how will you check that? 0:50:22.059,0:50:30.038 I can test that with "hyperfine" and I have here two commands: one with "fd" that is just 0:50:30.500,0:50:34.029 searching for JPEG files and the same one with "find". 0:50:34.579,0:50:41.079 If I execute them, it's gonna benchmark these scripts and give me some output about 0:50:41.869,0:50:44.108 how much faster "fd" is 0:50:45.380,0:50:47.380 compared to "find". 0:50:47.660,0:50:52.269 So I think that kind of concludes... yeah, like 23 times for this task. 0:50:52.940,0:50:55.990 So that kind of concludes the whole overview. 0:50:56.539,0:51:00.309 I know that there's like a lot of different topics and there's like a lot of 0:51:00.650,0:51:04.539 perspectives on doing these things, but again I want to reinforce the idea 0:51:04.539,0:51:08.499 that you don't need to be a master of all these topics but more... 0:51:08.750,0:51:11.229 To be aware that all these things exist. 0:51:11.230,0:51:17.559 So if you run into these issues you don't reinvent the wheel, and you reuse all that other programmers have done. 0:51:18.280,0:51:23.700 Given that, I'm happy to take any questions related to this last section or anything in the lecture. 0:51:25.900,0:51:30.060 Is there any way to sort of think about how long a program should take? 0:51:30.060,0:51:33.160 You know, if it's taking a while to run 0:51:33.160,0:51:42.840 you know, should you be worried? Or depending on your process, let me wait another ten minutes before I start looking at why it's taking so long. 0:51:43.220,0:51:45.220 Okay, so the... 0:51:46.070,0:51:49.089 The task of knowing how long a program 0:51:49.090,0:51:53.920 should run is pretty infeasible to figure out. It will depend on the type of program. 0:51:54.290,0:52:01.899 It depends on whether you're making HTTP requests or you're reading data... one thing that you can do is if you have 0:52:02.390,0:52:02.980 for example, 0:52:02.980,0:52:10.689 if you know you have to read two gigabytes from memory, like from disk, and load that into memory, you can make 0:52:11.510,0:52:16.719 back-of-the-envelope calculation. So like that shouldn't take longer than like X seconds because this is 0:52:16.940,0:52:20.050 how things are set up. Or if you are 0:52:20.840,0:52:27.460 reading some files from the network and you know kind of what the network link is and they are taking say five times longer than 0:52:27.460,0:52:29.460 what you would expect then you could 0:52:29.990,0:52:31.190 try to do that. 0:52:31.190,0:52:37.839 Otherwise, if you don't really know. Say you're trying to do some mathematical operation in your code and you're not really sure 0:52:37.840,0:52:44.050 about how long that will take you can use something like logging and try to kind of print intermediate 0:52:44.570,0:52:50.469 stages to get a sense of like, oh I need to do a thousand operations of this and 0:52:51.800,0:52:53.600 three iterations 0:52:53.600,0:53:00.700 took ten seconds. Then this is gonna take much longer than I can handle in my case. 0:53:00.920,0:53:04.599 So I think there are there are ways, it will again like depend on the task, 0:53:04.600,0:53:08.800 but definitely, given all the tools we've seen really, we probably have like 0:53:09.620,0:53:13.150 a couple of really good ways to start tackling that. 0:53:14.750,0:53:16.750 Any other questions? 0:53:16.750,0:53:18.750 You can also do things like 0:53:18.750,0:53:21.060 run HTOP and see if anything is running. 0:53:22.380,0:53:25.500 Like if your CPU is at 0%, something is probably wrong. 0:53:31.140,0:53:32.579 Okay. 0:53:32.579,0:53:38.268 There's a lot of exercises for all the topics that we have covered in today's class, 0:53:38.269,0:53:41.419 so feel free to do the ones that are more interesting. 0:53:42.180,0:53:44.539 We're gonna be holding office hours again today. 0:53:45.059,0:53:48.979 Just a reminder, office hours. You can come and ask questions about any lecture. 0:53:48.980,0:53:53.510 Like we're not gonna expect you to kind of do the exercises in a couple of minutes. 0:53:53.510,0:53:57.979 They take a really long while to get through them, but we're gonna be there 0:53:58.529,0:54:04.339 to answer any questions from previous classes, or even not related to exercises. Like if you want to know more about how you 0:54:04.619,0:54:09.889 would use TMUX in a way to kind of quickly switch between panes, anything that comes to your mind. ================================================ FILE: static/files/subtitles/2020/qa.sbv ================================================ 0:00:00.000,0:00:06.540 I guess we should do an intro to to this as well, 0:00:06.540,0:00:09.580 so this is a just sort of a 0:00:09.581,0:00:14.740 free-form Q&A lecture where you, as in the two people sitting here, but also 0:00:14.740,0:00:19.841 everyone at home who did not come here in person get to ask questions and we 0:00:19.841,0:00:22.961 have a bunch of questions people asked in advance but you can also ask 0:00:22.961,0:00:27.371 additional questions during, for the two of you who are here, you can do it either 0:00:27.371,0:00:33.611 by raising your hand or you can submit it on the forum and be anonymous, it's up to you 0:00:33.611,0:00:35.671 regardless though, what we're gonna do is just go through some of the 0:00:35.681,0:00:40.241 questions have been asked and try to give as helpful answers as we can 0:00:40.241,0:00:43.691 although they are unprepared on our side and 0:00:43.791,0:00:45.611 yeah that's the plan I guess we go 0:00:45.611,0:00:48.911 from popular to least popular 0:00:48.911,0:00:49.991 fire away 0:00:49.991,0:00:52.091 all right so for our first question any 0:00:52.091,0:00:55.961 recommendations on learning operating system related topics like processes, 0:00:55.961,0:00:59.861 virtual memory, interrupts, memory management, etc 0:00:59.861,0:01:01.811 so I think this is a 0:01:01.811,0:01:07.181 is an interesting question because these are really low level concepts that often 0:01:07.181,0:01:11.391 do not matter, unless you have to deal with this in some capacity, 0:01:11.391,0:01:12.771 right so 0:01:12.891,0:01:17.671 one instance where this matters is you're writing really low level code like 0:01:17.681,0:01:20.500 you're implementing a kernel or something like that, or you want to 0:01:20.500,0:01:22.811 just hack on the Linux kernel. 0:01:22.811,0:01:24.751 It's rare otherwise that you need to work with 0:01:24.751,0:01:27.711 especially like virtual memory and interrupts and stuff yourself 0:01:27.851,0:01:32.071 processes, I think are a more general concept that we've talked a little bit about in 0:01:32.071,0:01:36.611 this class as well and tools like htop, pgrep, kill, and signals and 0:01:36.761,0:01:37.711 that sort of stuff 0:01:37.711,0:01:39.311 in terms of learning it 0:01:39.311,0:01:45.371 maybe one of the best ways, is to try to take either an introductory class on the 0:01:45.371,0:01:51.401 topic, so for example MIT has a class called 6.828, which is where 0:01:51.401,0:01:55.091 you essentially build and develop your own operating system based on some code 0:01:55.091,0:01:58.631 that you're given, and all of those labs are publicly available and all the 0:01:58.631,0:02:01.601 resources for the class are publicly available, and so that is a good way to 0:02:01.601,0:02:04.001 really learn them is by doing them yourself. 0:02:04.001,0:02:05.201 There are also various 0:02:05.201,0:02:11.201 tutorials online that basically guide you through how do you write a kernel 0:02:11.201,0:02:15.431 from scratch. Not necessarily a very elaborate one, not one you would want 0:02:15.431,0:02:20.561 to run any real software on, but just to teach you the basics and so that would 0:02:20.561,0:02:21.930 be another thing to look up. 0:02:21.930,0:02:24.131 Like how do I write a kernel in and then your 0:02:24.131,0:02:27.611 language of choice. You will probably not find one that lets you do it in Python 0:02:27.611,0:02:33.612 but in like C, C++, Rust, there are a bunch of topics like this 0:02:33.612,0:02:36.951 one other note on operating systems 0:02:36.951,0:02:39.931 so like Jon mentioned MIT has a 6.828 class but 0:02:39.941,0:02:43.391 if you're looking for a more high-level overview, not necessarily programming or 0:02:43.391,0:02:46.001 an operating system, but just learning about the concepts another good resource 0:02:46.001,0:02:51.331 is a book called "Modern Operating Systems" by Andy Tannenbaum 0:02:51.331,0:02:58.371 there's also actually a book called the "The FreeBSD Operating System" which is really good, 0:02:58.371,0:03:03.031 It doesn't go through Linux, but it goes through FreeBSD and the BSD kernel is 0:03:03.031,0:03:07.181 arguably better organized than the Linux one and better documented and so it 0:03:07.181,0:03:11.591 might be a gentler introduction to some of those topics than trying to understand Linux 0:03:11.591,0:03:14.951 You want to check it as answered? 0:03:14.951,0:03:16.511 - Yes + Nice 0:03:16.511,0:03:17.451 Answered 0:03:17.451,0:03:19.371 For our next question 0:03:19.371,0:03:23.951 What are some of the tools you'd prioritize learning first? 0:03:23.951,0:03:29.551 - Maybe we can all go through and give our opinion on this? + Yeah 0:03:29.551,0:03:31.713 Tools to prioritize learning first? 0:03:31.713,0:03:36.451 I think learning your editor well, just serves you in all capacities 0:03:36.511,0:03:40.511 like being efficient at editing files, is just like a majority of 0:03:40.511,0:03:45.041 what you're going to spend your time doing. And in general, just using your 0:03:45.041,0:03:49.211 keyboard more in your mouse less. It means that you get to spend more of your 0:03:49.311,0:03:53.751 time doing useful things and less of your time moving 0:03:53.751,0:03:56.251 I think that would be my top priority, 0:04:04.511,0:04:06.751 so I would say that for what 0:04:06.760,0:04:09.671 tool to prioritize will depend on what exactly you're doing 0:04:09.671,0:04:16.150 I think the core idea is you should try to find the types of tasks that you are 0:04:16.151,0:04:18.371 doing repetitively and so 0:04:18.371,0:04:23.791 if you are doing some sort of like machine learning workload and 0:04:24.011,0:04:27.130 you find yourself using jupyter notebooks, like the one we presented 0:04:27.130,0:04:32.560 yesterday, a lot. Then again, using a mouse for that might not be 0:04:32.560,0:04:35.830 the best idea and you want to familiarize with the keyboard shortcuts 0:04:35.830,0:04:40.750 and pretty much with anything you will end up figuring out that there are some 0:04:40.751,0:04:45.611 repetitive tasks, and you're running a computer, and just trying to figure out 0:04:45.611,0:04:48.311 oh there's probably a better way to do this 0:04:48.431,0:04:50.871 be it a terminal, be it an editor 0:04:51.111,0:04:55.891 And it might be really interesting to learn to use some of the topics that 0:04:55.900,0:05:01.121 we have covered, but if they're not extremely useful in a everyday 0:05:01.121,0:05:05.431 basis then it might not worth prioritizing them 0:05:06.591,0:05:07.451 out of the topics 0:05:07.531,0:05:11.611 covered in this class in my opinion two of the most useful things are version 0:05:11.621,0:05:15.220 control and text editors, and I think they're a little bit different from each 0:05:15.220,0:05:18.880 other, in the sense that text editors I think are really useful to learn well 0:05:18.880,0:05:21.970 but it was probably the case that before we started using vim and all its fancy 0:05:21.970,0:05:25.390 keyboard shortcuts you had some other text editor you were using before and 0:05:25.390,0:05:29.890 you could edit text just fine maybe a little bit inefficiently whereas I think 0:05:29.890,0:05:33.100 version control is another really useful skill and that's one where if you don't 0:05:33.100,0:05:36.580 really know the tool properly, it can actually lead to some problems like loss 0:05:36.580,0:05:39.490 of data or just inability to collaborate properly with people so I 0:05:39.490,0:05:42.730 think version control is one of the first things that's worth learning well 0:05:42.730,0:05:46.871 yeah, I agree with that, I think learning a tool like Git is just 0:05:46.871,0:05:49.691 gonna save you so much heartache down the line 0:05:49.691,0:05:51.431 it also, to add on to that 0:05:51.571,0:05:57.310 It really helps you collaborate with others and Anish touched a little bit on GitHub 0:05:57.310,0:06:01.300 in the last lecture, and just learning to use that tool well in order 0:06:01.300,0:06:05.321 to work on larger software projects that other people are working on is 0:06:05.321,0:06:06.431 an invaluable skill 0:06:10.071,0:06:11.391 For our next question 0:06:11.391,0:06:12.871 when do I use Python versus a 0:06:12.881,0:06:16.051 bash script, versus some other language 0:06:16.051,0:06:19.661 This is tough, because I think this comes 0:06:19.661,0:06:21.631 down to what Jose was saying earlier too 0:06:21.771,0:06:23.731 that it really depends on what you're trying to do 0:06:23.731,0:06:27.155 For me, I think for bash scripts in particular 0:06:27.155,0:06:28.791 bash scripts are for 0:06:28.891,0:06:33.430 automating running a bunch of commands, you don't want to write any 0:06:33.430,0:06:35.411 other like business logic in bash 0:06:35.411,0:06:39.011 it is just for I want to run these 0:06:39.011,0:06:44.110 commands, in this order. Maybe with arguments, but like even that 0:06:44.110,0:06:47.581 it's unclear do you want to bash script once you start taking arguments 0:06:47.581,0:06:52.691 Similarly, once you start doing any kind of like text processing or 0:06:52.691,0:06:55.131 configuration, all that 0:06:55.131,0:06:59.111 reach for a language that is a more serious 0:06:59.111,0:07:01.031 programming language than bash is 0:07:01.091,0:07:03.451 bash is really for sort of short one-off 0:07:03.461,0:07:10.211 scripts or ones that have a very well-defined use case on the terminal in 0:07:10.211,0:07:12.851 the shell, probably 0:07:12.851,0:07:15.941 For a slightly more concrete guideline, you might say write a 0:07:15.941,0:07:19.211 bash script if it's less than a hundred lines of code or so, but once it gets 0:07:19.211,0:07:21.611 beyond that point bash is kind of unwieldy and it's probably worth 0:07:21.611,0:07:25.091 switching to a more serious programming language like Python 0:07:25.091,0:07:26.511 and to add to that 0:07:26.511,0:07:32.211 I would say that I found myself writing sometimes scripts in Python because 0:07:32.211,0:07:36.911 If I have already solved some subproblem that covers part of the problem in Python 0:07:36.911,0:07:40.631 I find it much easier to compose the previous solution that I found out in 0:07:40.631,0:07:45.731 Python and just try to reuse bash code, that I don't find as reusable as Python 0:07:45.731,0:07:49.600 And in the same way it's kind of nice that a lot of people have written something 0:07:49.600,0:07:52.631 like Python libraries or like Ruby libraries to do a lot of these things 0:07:52.631,0:07:58.451 whereas in bash is kind of hard to have like code reuse 0:07:58.451,0:08:01.720 And in fact, 0:08:01.720,0:08:07.631 I think to add to that. Usually, if you find a library in some language that 0:08:07.631,0:08:12.091 helps with the task you're trying to do, use that language for the job 0:08:12.091,0:08:15.671 And in bash there are no libraries, there are only the programs on your computer 0:08:15.771,0:08:18.931 So you probably don't want to use it unless like there's a program 0:08:18.941,0:08:23.741 you can just invoke I do think another thing worth remembering about bash 0:08:23.741,0:08:26.451 bash is really hard to get right. 0:08:26.451,0:08:30.531 It's very easy to get it right for the particular use case you're trying to solve right now 0:08:30.531,0:08:32.471 but things like 0:08:32.471,0:08:35.891 What if one of the filenames has a space in it? 0:08:35.891,0:08:38.891 It has caused so many bugs and so 0:08:38.891,0:08:43.151 many problems in bash scripts and if you use a real programming language then 0:08:43.151,0:08:46.642 those problems just go away 0:08:46.651,0:08:50.491 Checked it 0:08:50.571,0:08:51.571 For our next question 0:08:51.571,0:08:56.211 What is the difference between sourcing a script and executing that script ? 0:08:57.071,0:09:02.711 So this actually, we got in office hours a while back as well which is 0:09:02.871,0:09:06.991 Aren't they the same? like aren't they both just running the bash script? 0:09:06.991,0:09:08.051 and it is true 0:09:08.051,0:09:12.191 both of these will end up executing the lines of code that are in the script 0:09:12.191,0:09:16.571 the ways in which they differ is that sourcing a script is telling your 0:09:16.571,0:09:22.991 current bash script, your current bash session to execute that program 0:09:23.131,0:09:28.911 whereas the other one is, start up a new instance of bash and run the program there instead 0:09:29.291,0:09:34.931 And this matters for things like imagine that "script.sh" tries to change directories 0:09:34.931,0:09:37.841 If you are running the script as in the second invocation 0:09:37.841,0:09:42.761 "./script.sh", then the new process is going to change 0:09:42.761,0:09:46.891 directories but by the time that script exits and returns to your shell 0:09:46.891,0:09:51.831 your shell still remains in the same place. However, if you do CD in a script and you source it 0:09:51.831,0:09:55.241 Your current instance of bash is the one that ends up running it and 0:09:55.241,0:09:57.951 so it ends up CDing where you are 0:09:57.951,0:10:01.171 This is also why if you define functions 0:10:01.171,0:10:04.751 For example, that you may want to execute in your shell session 0:10:04.751,0:10:07.011 You need to source the script, not run it 0:10:07.011,0:10:10.261 Because if you run it, that function will be defined in the 0:10:10.261,0:10:11.931 instance of bash 0:10:11.931,0:10:16.831 In the bash process that gets launched but it will not be defined in your current shell 0:10:16.831,0:10:22.871 I think those are two of the biggest differences between the two 0:10:29.211,0:10:29.711 Next question, 0:10:29.873,0:10:35.131 What are the places where various packages and tools are stored and how does referencing them work? 0:10:35.131,0:10:39.171 What even is /bin or /lib? 0:10:39.171,0:10:45.091 So as we covered in the first lecture, there is this PATH environment variable 0:10:45.091,0:10:49.551 which is a semicolon separated string of all the places 0:10:49.551,0:10:55.111 where your shell is gonna look for binaries and if you just do something 0:10:55.111,0:10:58.171 like "echo $PATH", you're gonna get this list 0:10:58.171,0:11:02.251 and all these places are gonna be consulted in order. 0:11:02.251,0:11:03.601 It's gonna go through all of them and in fact 0:11:03.601,0:11:07.011 - There is already... Did we cover which? + Yeah 0:11:07.211,0:11:10.011 So if you run "which" and a specific command 0:11:10.021,0:11:14.071 the shell is actually is gonna tell you where it's finding this 0:11:14.071,0:11:15.391 Beyond that, 0:11:15.391,0:11:20.431 there is like some conventions where a lot of programs will install their binaries 0:11:20.431,0:11:24.071 and they're like /usr/bin (or at least they will include symlinks) 0:11:24.071,0:11:26.051 in /usr/bin so you can find them 0:11:26.191,0:11:28.211 There's also a /usr/local/bin 0:11:28.211,0:11:33.951 There are special directories. For example, /usr/sbin it's only for sudo user and 0:11:33.951,0:11:38.491 some of these conventions are slightly different between different distros so 0:11:38.491,0:11:47.571 I know like some distros for example install the user libraries under /opt for example 0:11:51.191,0:11:55.491 Yeah I think one thing just to talk a little bit of more 0:11:55.651,0:12:00.631 about /bin and then Anish maybe you can do the other folders so when it comes to 0:12:00.631,0:12:02.791 /bin the convention 0:12:02.791,0:12:10.051 There are conventions, and the conventions are usually /bin are for essential system utilities 0:12:10.051,0:12:12.531 /usr/bin are for user programs and 0:12:12.531,0:12:17.431 /usr/local/bin are for user compiled programs, sort of 0:12:17.431,0:12:21.691 so things that you installed that you intend the user to run, are in /usr/bin 0:12:21.691,0:12:26.711 things that a user has compiled themselves and stuck on your system, probably goes in /usr/local/bin 0:12:26.711,0:12:29.991 but again, this varies a lot from machine to machine, and distro to distro 0:12:29.991,0:12:33.971 On Arch Linux, for example, /bin is a symlink to /usr/bin 0:12:33.971,0:12:40.261 They're the same and as Jose mentioned, there's also /sbin which is for programs that are 0:12:40.261,0:12:43.801 intended to only be run as root, that also varies from distro to distro 0:12:43.801,0:12:47.251 whether you even have that directory, and on many systems like /usr/local/bin 0:12:47.251,0:12:51.151 might not even be in your PATH, or might not even exist on your system 0:12:51.151,0:12:55.831 On BSD on the other hand /usr/local/bin is often used a lot more heavily 0:12:56.731,0:12:57.231 yeah so 0:12:57.231,0:13:01.111 What we were talking about so far, these are all ways that files and folders are 0:13:01.111,0:13:05.071 organized on Linux things or Linux or BSD things vary a little bit between 0:13:05.071,0:13:07.151 that and macOS or other platforms 0:13:07.151,0:13:09.301 I think for the specific locations, 0:13:09.301,0:13:11.471 if you to know exactly what it's used for, you can look it up 0:13:11.471,0:13:17.291 But some general patterns to keep in mind or anything with /bin in it has binary executable programs in it, 0:13:17.291,0:13:19.891 anything with \lib in it, has libraries in it so things that 0:13:19.891,0:13:25.081 programs can link against, and then some other things that are useful to know are 0:13:25.081,0:13:29.431 there's a /etc on many systems, which has configuration files in it and 0:13:29.431,0:13:34.311 then there's /home, which underneath that directory contains each user's home directory 0:13:34.311,0:13:38.521 so like on a linux box my username or if it's Anish will 0:13:38.651,0:13:41.351 correspond to a home directory /home/anish 0:13:42.071,0:13:43.351 Yeah I guess there are 0:13:43.351,0:13:47.671 a couple of others like /tmp is usually a temporary directory that gets 0:13:47.671,0:13:51.351 erased when you reboot not always but sometimes, you should check on your system 0:13:51.731,0:13:59.211 There's a /var which often holds like files the change over time so 0:13:59.211,0:14:06.151 these these are usually going to be things like lock files for package managers 0:14:06.151,0:14:12.431 they're gonna be things like log files files to keep track of process IDs 0:14:12.431,0:14:16.471 then there's /dev which shows devices so 0:14:16.471,0:14:20.551 usually so these are special files that correspond to devices on your system we 0:14:20.551,0:14:27.391 talked about /sys, Anish mentioned /etc 0:14:29.051,0:14:36.031 /opt is a common one for just like third-party software that basically it's usually for 0:14:36.031,0:14:40.951 companies ported their software to Linux but they don't actually understand what 0:14:40.951,0:14:45.391 running software on Linux is like, and so they just have a directory with all 0:14:45.391,0:14:51.411 their stuff in it and when those get installed they usually get installed into /opt 0:14:51.411,0:14:55.651 I think those are the ones off the top of my head 0:14:55.651,0:14:57.771 yeah 0:14:57.771,0:15:02.271 And we will list these in our lecture notes which will produce after this lecture 0:15:02.271,0:15:04.431 Next question 0:15:04.431,0:15:07.080 Should I apt-get install a Python whatever 0:15:07.080,0:15:10.691 package or pip install that package 0:15:10.691,0:15:13.890 so this is a good question that I think at 0:15:13.890,0:15:17.310 a higher level this question is asking should I use my systems package manager 0:15:17.310,0:15:20.850 to install things or should I use some other package manager. Like in this case 0:15:20.850,0:15:25.021 one that's more specific to a particular language. And the answer here is also 0:15:25.021,0:15:28.590 kind of it depends, sometimes it's nice to manage things using a system package 0:15:28.590,0:15:31.950 manager so everything can be installed and upgraded in a single place but 0:15:31.950,0:15:35.160 I think oftentimes whatever is available in the system repositories the things 0:15:35.160,0:15:37.800 you can get via a tool like apt-get or something similar 0:15:37.800,0:15:41.040 might be slightly out of date compared to the more language specific repository 0:15:41.040,0:15:45.060 so for example a lot of the Python packages I use I really want the most 0:15:45.060,0:15:47.771 up-to-date version and so I use pip to install them 0:15:48.551,0:15:51.091 Then, to extend on that is 0:15:51.091,0:15:57.751 sometimes the case the system packages might require some other 0:15:57.751,0:16:02.461 dependencies that you might not have realized about, and it's also might be 0:16:02.461,0:16:07.201 the case or like for some systems, at least for like alpine Linux they 0:16:07.201,0:16:11.221 don't have wheels for like a lot of the Python packages so it will just take 0:16:11.221,0:16:15.331 longer to compile them, it will take more space because they have to compile them 0:16:15.331,0:16:20.761 from scratch. Whereas if you just go to pip, pip has binaries for a lot of 0:16:20.761,0:16:23.471 different platforms and that will probably work 0:16:23.471,0:16:29.191 You also should be aware that pip might not do the exact same thing in different computers 0:16:29.191,0:16:33.601 So, for example, if you are in a kind of laptop or like a desktop that is running like 0:16:33.601,0:16:38.971 a x86 or x86_64 you probably have binaries, but if you're running something 0:16:38.971,0:16:43.471 like Raspberry Pi or some other kind of embedded device. These are running on a 0:16:43.471,0:16:47.611 different kind of hardware architecture and you might not have binaries 0:16:47.611,0:16:51.841 I think that's also good to take into account, in that case in might be worthwhile to 0:16:51.841,0:16:58.551 use the system packages just because they will take much shorter to get them 0:16:58.551,0:17:01.691 than to just to compile from scratch the entire Python installation 0:17:01.691,0:17:06.741 Apart from that, I don't think I can think of any exceptions where I would actually use the system packages 0:17:06.741,0:17:09.251 instead of the Python provided ones 0:17:19.011,0:17:20.851 So, one other thing to keep in mind is that 0:17:20.861,0:17:26.180 sometimes you will have more than one program on your computer and you might 0:17:26.180,0:17:29.961 be developing more than one program on your computer and for some reason not 0:17:29.961,0:17:33.861 all programs are always built with the latest version of things, sometimes they 0:17:33.861,0:17:39.351 are a little bit behind, and when you install something system-wide you can 0:17:39.351,0:17:44.691 only... depends on your exact system, but often you just have one version 0:17:44.691,0:17:49.711 what pip lets you do, especially combined with something like python's virtualenv, 0:17:49.711,0:17:54.531 and similar concepts exist for other languages, where you can sort of say 0:17:54.531,0:17:59.660 I want to (NPM does the same thing as well with its node modules, for example) where 0:17:59.660,0:18:05.991 I'm gonna compile the dependencies of this package in sort of a subdirectory 0:18:05.991,0:18:10.431 of its own, and all of the versions that it requires are going to be built in there 0:18:10.431,0:18:13.910 and you can do this separately for separate projects so there they have 0:18:13.910,0:18:16.910 different dependencies or the same dependencies with different versions 0:18:16.910,0:18:20.930 they still sort of kept separate. And that is one thing that's hard to achieve 0:18:20.931,0:18:22.651 with system packages 0:18:27.131,0:18:27.851 Next question 0:18:27.911,0:18:32.771 What's the easiest and best profiling tools to use to improve performance of my code? 0:18:34.351,0:18:39.231 This is a topic we could talk about for a very long time 0:18:39.231,0:18:42.881 The easiest and best is to print stuff using time 0:18:42.881,0:18:48.431 Like, I'm not joking, very often the easiest thing is in your code 0:18:48.971,0:18:53.751 At the top you figure out what the current time is, and then you do sort of 0:18:53.751,0:18:57.920 a binary search over your program of add a print statement that prints how much 0:18:57.920,0:19:02.511 time has elapsed since the start of your program and then you do that until you 0:19:02.511,0:19:06.320 find the segment of code that took the longest. And then you go into that 0:19:06.320,0:19:09.531 function and then you do the same thing again and you keep doing this until you 0:19:09.531,0:19:14.031 find roughly where the time was spent. It's not foolproof, but it is really easy 0:19:14.031,0:19:16.721 and it gives you good information quickly 0:19:16.721,0:19:25.361 if you do need more advanced information Valgrind has a tool called cache-grind? 0:19:25.361,0:19:29.431 call grind? Cache grind? One of the two. 0:19:29.431,0:19:33.310 and this tool lets you run your program and 0:19:33.310,0:19:38.741 measure how long everything takes and all of the call stacks, like which 0:19:38.741,0:19:42.521 function called which function, and what you end up with is a really neat 0:19:42.521,0:19:47.081 annotation of your entire program source with the heat of every line basically 0:19:47.081,0:19:51.761 how much time was spent there. It does slow down your program by like an order 0:19:51.761,0:19:56.021 of magnitude or more, and it doesn't really support threads but it is really 0:19:56.021,0:20:01.121 useful if you can use it. If you can't, then tools like perf or similar tools 0:20:01.121,0:20:05.201 for other languages that do usually some kind of sampling profiling like we 0:20:05.201,0:20:09.811 talked about in the profiler lecture, can give you pretty useful data quickly, 0:20:09.811,0:20:15.160 but it's a lot of data around this, but they're a little bit 0:20:15.160,0:20:18.971 biased and what kind of things they usually highlight as a problem and it 0:20:18.971,0:20:22.961 can sometimes be hard to extract meaningful information about what should 0:20:22.961,0:20:27.701 I change in response to them. Whereas the sort of print approach very quickly 0:20:27.701,0:20:32.171 gives you like this section of code is bad or slow 0:20:32.171,0:20:34.871 I think would be my answer 0:20:34.871,0:20:40.431 Flamegraphs are great, they're a good way to visualize some of this information 0:20:41.491,0:20:45.550 Yeah I just have one thing to add, oftentimes programming languages 0:20:45.550,0:20:48.910 have language specific tools for profiling so to figure out what's the 0:20:48.910,0:20:52.191 right tool to use for your language like if you're doing JavaScript in the web browser 0:20:52.191,0:20:55.411 the web browser has a really nice tool for doing profiling you should just use that 0:20:55.411,0:21:00.471 or if you are using go, for example, go has a built-in profiler is really good you should just use that 0:21:01.711,0:21:04.251 A last thing to add to that 0:21:04.251,0:21:09.951 Sometimes you might find that doing this binary search over time that you're kind of 0:21:09.961,0:21:14.351 finding where the time is going, but this time is sometimes happening because 0:21:14.351,0:21:18.461 you're waiting on the network, or you're waiting for some file, and in that case 0:21:18.461,0:21:23.440 you want to make sure that the time that is, if I want to write 0:21:23.440,0:21:27.310 like 1 gigabyte file or like read 1 gigabyte file and put it into memory 0:21:27.310,0:21:32.260 you want to check that the actual time there, is the minimum amount of time 0:21:32.260,0:21:36.221 you actually have to wait. If it's ten times longer, you should try to use some 0:21:36.221,0:21:39.371 other tools that we covered in the debugging and profiling section to see 0:21:39.371,0:21:45.671 why you're not utilizing all your resources because that might... 0:21:50.511,0:21:56.071 Because that might be a lot of what's happening thing, like for example, in my research 0:21:56.081,0:21:59.410 in machine learning workloads, a lot of time is loading data and you have to 0:21:59.410,0:22:02.981 make sure well like the time it takes to load data is actually the minimum amount 0:22:02.981,0:22:07.500 of time you want to have that happening 0:22:08.040,0:22:13.481 And to build on that, there are actually specialized tools for doing things like 0:22:13.481,0:22:17.351 analyzing wait times. Very often when you're waiting for something what's 0:22:17.351,0:22:20.591 really happening is you're issuing your system call, and that system call takes 0:22:20.591,0:22:24.191 some amount of time to respond. Like you do a really large write, or a really large read 0:22:24.191,0:22:28.361 or you do many of them, and one thing that can be really handy here is 0:22:28.361,0:22:31.841 to try to get information out of the kernel about where your program is 0:22:31.841,0:22:37.000 spending its time. And so there's (it's not new), but there's a relatively 0:22:37.000,0:22:42.820 newly available thing called BPF or eBPF. Which is essentially kernel tracing 0:22:42.820,0:22:48.531 and you can do some really cool things with it, and that includes tracing user programs. 0:22:48.531,0:22:51.760 It can be a little bit awkward to get started with, there's a tool 0:22:51.760,0:22:56.201 called BPF trace that i would recommend you looking to, if you need to do like 0:22:56.201,0:23:00.040 this kind of low-level performance debugging. But it is really good for this 0:23:00.040,0:23:04.601 kind of stuff. You can get things like histograms over how much time was spent 0:23:04.601,0:23:06.671 in particular system calls 0:23:06.671,0:23:09.721 It's a great tool 0:23:12.251,0:23:15.351 What browser plugins do you use? 0:23:16.731,0:23:19.731 I try to use as few as I can get away with using 0:23:19.731,0:23:25.991 because I don't like things being in my browser, but there are a couple of 0:23:25.991,0:23:30.311 ones that are sort of staples. The first one is uBlock Origin. 0:23:30.311,0:23:36.611 So uBlock Origin is one of many ad blockers but it's a little bit more than an ad blocker. 0:23:36.611,0:23:42.530 It is (a what do they call it?) a network filtering tool so it lets 0:23:42.530,0:23:47.331 you do more things than just block ads. It also lets you like block connections 0:23:47.331,0:23:51.351 to certain domains, block connections for certain types of resources 0:23:51.351,0:23:56.031 So I have mine set up in what they call the Advanced Mode, where basically 0:23:56.031,0:24:02.451 you can disable basically all network requests. But it's not just Network requests, 0:24:02.451,0:24:07.430 It's also like I have disabled all inline scripts on every page and all 0:24:07.430,0:24:11.540 third-party images and resources, and then you can sort of create a whitelist 0:24:11.540,0:24:16.351 for every page so it gives you really low-level tools around how to 0:24:16.351,0:24:20.331 how to improve the security of your browsing. But you can also set it in not the 0:24:20.331,0:24:23.991 advanced mode, and then it does much of the same as a regular ad blocker would 0:24:23.991,0:24:28.101 do, although in a fairly efficient way if you're looking at an ad blocker it's 0:24:28.101,0:24:31.510 probably the one to use and it works on like every browser 0:24:31.511,0:24:34.451 That would be my top pick I think, 0:24:39.111,0:24:44.391 I think probably the one I use like the most actively 0:24:44.391,0:24:50.391 is one called Stylus. It lets you modify the CSS or like the stylesheets 0:24:50.391,0:24:54.560 that webpages have. And it's pretty neat, because sometimes you're 0:24:54.560,0:24:58.550 looking at a website and you want to hide some part of the website 0:24:58.550,0:25:04.211 you don't care about. Like maybe a ad, maybe some sidebar you're not finding useful 0:25:04.211,0:25:06.290 The thing is, at the end of the day these things are 0:25:06.290,0:25:09.591 displaying in your browser, and you have control of what code is 0:25:09.591,0:25:13.131 executing and similar to what Jon was saying, like you can customize this 0:25:13.131,0:25:18.491 to no end, and what I have for a lot of web pages like hide this this part, or 0:25:18.491,0:25:23.390 also trying to make like dark modes for them like you can change pretty much the 0:25:23.390,0:25:26.810 color for every single website. And what is actually pretty neat is that there's 0:25:26.810,0:25:31.461 like a repository online of people that have contributed this is stylesheets 0:25:31.461,0:25:35.031 for the websites. So someone probably has (done) one for GitHub 0:25:35.031,0:25:38.780 Like I want dark GitHub and someone has already contributed one that makes 0:25:38.780,0:25:44.631 that much more pleasing to browse. Apart from that, one that it's not really 0:25:44.631,0:25:49.491 fancy, but I have found incredibly helpful is one that just takes a screenshot an 0:25:49.491,0:25:53.121 entire website. And It will scroll for you and make 0:25:53.121,0:25:57.711 compound image of the entire website and that's really great for when you're trying to 0:25:57.711,0:26:00.111 print a website and is just terrible. 0:26:00.111,0:26:00.611 (It's built into Firefox) 0:26:00.611,0:26:02.671 oh interesting 0:26:02.671,0:26:05.751 oh now that you mention builtin to Firefox, another one that I really like about 0:26:05.751,0:26:09.071 Firefox is the multi account containers 0:26:09.071,0:26:10.831 (Oh yeah, it's fantastic) 0:26:10.831,0:26:12.291 Which kind of lets you 0:26:12.291,0:26:16.670 By default a lot of web browsers, like for example Chrome, have this 0:26:16.670,0:26:20.601 notion of like there's session that you have, where you have all your cookies 0:26:20.601,0:26:24.560 and they are kind of all shared from the different websites in the sense of 0:26:24.560,0:26:30.811 you keep opening new tabs and unless you go into incognito you kind of have the same profile 0:26:30.811,0:26:34.190 And that profile is the same for all websites, there is this 0:26:34.191,0:26:35.851 Is it an extension or is it built in? 0:26:35.851,0:26:40.571 (it's a mix, it's complicated) 0:26:41.091,0:26:46.211 So I think you actually have to say you want to install it or enable it and again 0:26:46.221,0:26:49.881 the name is Multi Account Containers and these let you tell Firefox to have 0:26:49.881,0:26:53.961 separate isolated sessions. So for example, you want to say 0:26:53.961,0:26:58.851 I have a separate sessions for whenever I visit to Google or whenever I visit Amazon 0:26:58.851,0:27:01.791 and that can be pretty neat, because then you can 0:27:01.791,0:27:08.171 At a browser level it's ensuring that no information sharing is happening between the two of them 0:27:08.171,0:27:11.961 And it's much more convenient than having to open a incognito window 0:27:11.961,0:27:14.471 where it's gonna clean all the time the stuff 0:27:14.471,0:27:17.311 (One thing to mention is Stylus vs Stylish) 0:27:17.531,0:27:19.651 Oh yeah, I forgot about that 0:27:19.651,0:27:24.931 One important thing is the browser extension for side loading CSS Stylesheets 0:27:24.931,0:27:31.851 it's called a Stylus and that's different from the older one that was 0:27:31.851,0:27:37.400 called Stylish, because that one got bought at some point by some shady 0:27:37.400,0:27:40.711 company, that started abusing it not only to have 0:27:40.711,0:27:45.780 that functionality, but also to read your entire browser history and send that 0:27:45.780,0:27:48.491 back to their servers so they could data mine it. 0:27:48.491,0:27:53.731 So, then people just built this open-source alternative that is called Stylus, and that's the one 0:27:53.731,0:27:58.951 we recommend. Said that, I think the repository for styles is the same for the 0:27:58.951,0:28:03.611 two of them, but I would have to double check that. 0:28:03.611,0:28:05.951 Do you have any browser plugins Anish? 0:28:06.071,0:28:09.311 Yes, so I also have some recommendations for browser plugins 0:28:09.311,0:28:13.991 I also use uBlock Origin and I also use Stylus, 0:28:13.991,0:28:18.511 but one other one that I'd recommend is integration with a password manager 0:28:18.511,0:28:21.631 So this is a topic that we have in the lecture notes for the security 0:28:21.631,0:28:24.841 lecture, but we didn't really get to talk about in detail. But basically password 0:28:24.841,0:28:27.810 managers do a really good job of increasing your security when working 0:28:27.810,0:28:31.831 with online accounts, and having browser integration with your password manager 0:28:31.831,0:28:34.410 can save you a lot of time like you can open up a website then it can 0:28:34.410,0:28:37.381 autofill your login information for you sir and you go and copy and paste it 0:28:37.381,0:28:40.320 back and forth between a separate program if it's not integrated with your 0:28:40.320,0:28:43.410 web browser, and it can also, this integration, can save you from certain 0:28:43.410,0:28:47.651 attacks that would otherwise be possible if you were doing this manual copy pasting. 0:28:47.651,0:28:50.790 For example, phishing attacks. So you find a website that looks very 0:28:50.790,0:28:54.211 similar to Facebook and you go to log in with your facebook login credentials and 0:28:54.211,0:28:56.851 you go to your password manager and copy paste the correct credentials into this 0:28:56.851,0:29:00.060 funny web site and now all of a sudden it has your password but if you have 0:29:00.060,0:29:03.091 browser integration then the extension can automatically check 0:29:03.091,0:29:06.951 like. Am I on F A C E B O O K.com,or is it some other domain 0:29:06.951,0:29:10.671 that maybe look similar and it will not enter the login information if it's the wrong domain 0:29:10.671,0:29:15.791 so browser extension for password managing is good 0:29:15.791,0:29:17.930 Yeah I agree 0:29:19.491,0:29:20.711 Next question 0:29:20.711,0:29:23.991 What are other useful data wrangling tools? 0:29:23.991,0:29:32.421 So in yesterday's lecture, I mentioned curl, so curl is a fantastic tool for just making web 0:29:32.421,0:29:35.811 requests and dumping them to your terminal. You can also use it for things 0:29:35.811,0:29:41.191 like uploading files which is really handy. 0:29:41.191,0:29:48.431 In the exercises of that lecture we also talked about JQ and pup which are command line tools that let you 0:29:48.431,0:29:52.991 basically write queries over JSON and HTML documents respectively 0:29:52.991,0:30:00.391 that can be really handy. Other data wrangling tools? 0:30:00.391,0:30:03.821 Ah Perl, the Perl programming language is 0:30:03.821,0:30:08.061 often referred to as a write only programming language because it's 0:30:08.061,0:30:13.431 impossible to read even if you wrote it. But it is fantastic at doing just like 0:30:13.431,0:30:21.561 straight up text processing, like nothing beats it there, so maybe worth learning 0:30:21.561,0:30:24.331 some very rudimentary Perl just to write some of those scripts 0:30:24.331,0:30:29.371 It's easier often than writing some like hacked-up combination of grep and awk and sed, 0:30:29.371,0:30:36.311 and it will be much faster to just tack something up than writing it up in Python, for example 0:30:36.311,0:30:44.031 but apart from that, other data wrangling 0:30:44.031,0:30:47.071 No, not off the top of my head really 0:30:47.071,0:30:53.661 column -t, if you pipe any white space separated 0:30:53.661,0:30:58.821 input into column -t it will align all the white space of the columns so that 0:30:58.821,0:31:05.771 you get nicely aligned columns that's, and head and tail but we talked about those 0:31:09.011,0:31:13.791 I think a couple of additions to that, that I find myself using commonly 0:31:13.791,0:31:19.881 one is vim. Vim can be pretty useful for like data wrangling on itself 0:31:19.881,0:31:22.461 Sometimes you might find that the operation that you're trying to do is 0:31:22.461,0:31:27.711 hard to put down in terms of piping different operators but if you 0:31:27.711,0:31:32.531 can just open the file and just record 0:31:32.531,0:31:37.301 a couple of quick vim macros to do what you want it to do, it might be like much, 0:31:37.301,0:31:42.311 much easier. That's one, and then the other one, if you're dealing with tabular 0:31:42.311,0:31:46.091 data and you want to do more complex operations like sorting by one column, 0:31:46.091,0:31:51.161 then grouping and then computing some sort of statistic, I think a lot of that 0:31:51.161,0:31:55.951 workload I ended up just using Python and pandas because it's built for that 0:31:55.951,0:32:00.190 And one of the pretty neat features that I find myself also using is that it 0:32:00.190,0:32:03.931 will export to many different formats. So this intermediate state 0:32:03.931,0:32:09.221 has its own kind of pandas dataframe object but it can 0:32:09.221,0:32:14.171 export to HTM, LaTeX, a lot of different like table formats so if your end 0:32:14.171,0:32:19.531 product is some sort of summary table, then pandas I think it's a fantastic choice for that 0:32:21.111,0:32:24.791 I would second the vim and also Python I think those are 0:32:24.791,0:32:29.051 two of my most used data wrangling tools. For the vim one, last year we had a demo 0:32:29.051,0:32:31.841 in the series in the lecture notes, but we didn't cover it in class we had a 0:32:31.841,0:32:38.051 demo of turning an XML file into a JSON version of that same data using only vim macros 0:32:38.051,0:32:40.331 And I think that's actually the way I would do it in practice 0:32:40.331,0:32:43.241 I don't want to go find a tool that does this conversion it is actually simple 0:32:43.241,0:32:45.431 to encode as a vim macro, then I just do it that way 0:32:45.431,0:32:48.991 And then also Python especially in an interactive tool like a Jupyter notebook 0:32:48.991,0:32:51.171 is a really great way of doing data wrangling 0:32:51.171,0:32:52.951 A third tool I'd mention which I don't remember if we 0:32:52.961,0:32:55.361 covered in the data wrangling lecture or elsewhere 0:32:55.361,0:32:58.751 is a tool called pandoc which can do transformations between different text 0:32:58.751,0:33:02.981 document formats so you can convert from plaintext to HTML or HTML to markdown 0:33:02.981,0:33:07.361 or LaTeX to HTML or many other formats it actually it supports a large 0:33:07.361,0:33:10.471 list of input formats and a large list of output formats 0:33:10.471,0:33:16.361 I think there's one last one which I mentioned briefly in the lecture on data wrangling which is 0:33:16.361,0:33:20.441 the R programming language, it's an awful (I think it's an awful) 0:33:20.441,0:33:25.120 language to program in. And i would never use it in the middle of a data wrangling 0:33:25.120,0:33:30.951 pipeline, but at the end, in order to like produce pretty plots and statistics R is great 0:33:30.951,0:33:35.581 Because R is built for doing statistics and plotting 0:33:35.581,0:33:40.591 there's a library for are called ggplot which is just amazing 0:33:40.591,0:33:46.551 ggplot2 i guess technically It's great, it produces very 0:33:46.551,0:33:51.431 nice visualizations and it lets you do, it does very easily do things like 0:33:51.431,0:33:57.561 If you have a data set that has like multiple facets like it's not just X and Y 0:33:57.561,0:34:03.111 it's like X Y Z and some other variable, and then you want to plot like the 0:34:03.111,0:34:07.581 throughput grouped by all of those parameters at the same time and produce 0:34:07.581,0:34:11.991 a visualization. R very easily let's you do this and I haven't seen anywhere 0:34:11.991,0:34:14.891 that lets you do that as easily 0:34:16.971,0:34:17.951 Next question, 0:34:17.951,0:34:20.511 What's the difference between Docker and a virtual machine 0:34:23.271,0:34:27.731 What's the easiest way to explain this? So docker 0:34:27.741,0:34:31.221 starts something called containers and docker is not the only program that 0:34:31.221,0:34:36.561 starts containers. There are many others and usually they rely on some feature of 0:34:36.561,0:34:40.401 the underlying kernel in the case of docker they use something called LXC 0:34:40.401,0:34:47.571 which are Linux containers and the basic premise there is if you want to start 0:34:47.571,0:34:53.181 what looks like a virtual machine that is running roughly the same operating 0:34:53.181,0:34:57.411 system as you are already running on your computer then you don't really need 0:34:57.411,0:35:04.701 to run another instance of the kernel really that other virtual machine can 0:35:04.701,0:35:09.951 share a kernel. And you can just use the kernels built in isolation mechanisms to 0:35:09.951,0:35:13.791 spin up a program that thinks it's running on its own hardware but in 0:35:13.791,0:35:18.501 reality it's sharing the kernel and so this means that containers can often run 0:35:18.501,0:35:22.611 with much lower overhead than a full virtual machine will do but you should 0:35:22.611,0:35:26.391 keep in mind that it also has somewhat weaker isolation because you are sharing 0:35:26.391,0:35:30.831 a kernel between the two if you spin up a virtual machine the only thing that's 0:35:30.831,0:35:35.931 shared is sort of the hardware and to some extent the hypervisor, whereas 0:35:35.931,0:35:40.791 with a docker container you're sharing the full kernel and the that is a 0:35:40.791,0:35:44.921 different threat model that you might have to keep in mind 0:35:47.341,0:35:52.361 One another small note there as Jon pointed out, to use containers something 0:35:52.361,0:35:55.631 like Docker you need the underlying operating system to be roughly the same 0:35:55.631,0:36:00.071 as whatever the program that's running on top of the container expects and so 0:36:00.071,0:36:03.791 if you're using macOS for example, the way you use docker is you run Linux 0:36:03.791,0:36:08.261 inside a virtual machine and then you can run Docker on top of Linux so maybe 0:36:08.261,0:36:11.741 if you're going for containers in order to get better performance your trading 0:36:11.741,0:36:15.131 isolation for performance if you're running on Mac OS that may not work out 0:36:15.131,0:36:17.451 exactly as expected 0:36:17.451,0:36:21.221 And one last note is that there is a slight difference, so 0:36:21.221,0:36:25.721 with Docker and containers, one of the gotchas you have 0:36:25.721,0:36:29.411 to be familiar with is that containers are more similar to virtual 0:36:29.411,0:36:33.071 machines in the sense of that they will persist all the storage that you 0:36:33.071,0:36:35.971 have where Docker by default won't have that. 0:36:35.971,0:36:37.791 Like Docker is supposed to be running 0:36:37.791,0:36:41.771 So the main idea is like I want to run some software and 0:36:41.771,0:36:45.671 I get the image and it runs and if you want to have any kind of persistent 0:36:45.671,0:36:50.081 storage that links to the host system you have to kind of manually specify 0:36:50.081,0:36:56.051 that, whereas a virtual machine is using some virtual disk that is being provided 0:36:56.051,0:37:02.671 Next question 0:37:02.671,0:37:05.111 What are the advantages of each operating system 0:37:05.111,0:37:08.531 and how can we choose between them? For example, choosing the best Linux 0:37:08.531,0:37:10.551 distribution for our purposes 0:37:14.251,0:37:16.811 I will say that for many, many tasks the 0:37:16.811,0:37:20.171 specific Linux distribution that you're running is not that important 0:37:20.171,0:37:23.731 the thing is, it's just what kind of 0:37:23.731,0:37:27.651 knowing that there are different types or like groups of distributions, 0:37:27.651,0:37:32.251 So for example, there are some distributions that have really frequent updates 0:37:32.251,0:37:38.971 but they kind of break more easily. So for example Arch Linux has a rolling update 0:37:38.971,0:37:43.511 way of pushing updates, where things might break but they're fine with the things 0:37:43.511,0:37:47.891 being that way. Where maybe where you have some really important web server 0:37:47.891,0:37:51.401 that is hosting all your business analytics you want that thing 0:37:51.401,0:37:55.961 to have like a much more steady way of updates. So that's for example why you 0:37:55.961,0:37:58.121 will see distributions like Debian being 0:37:58.121,0:38:02.951 much more conservative about what they push, or even for example Ubuntu makes a difference 0:38:02.951,0:38:07.001 between the Long Term Releases that they are only update every 0:38:07.001,0:38:12.281 two years and the more periodic releases of one there is a 0:38:12.281,0:38:16.661 it's like two a year that they make. So, kind of knowing that there's the 0:38:16.661,0:38:21.341 difference apart from that some distributions have different ways 0:38:21.341,0:38:27.191 of providing the binaries to you and the way they 0:38:27.191,0:38:33.791 have the repositories so I think a lot of Red Hat Linux don't want non free drivers in 0:38:33.791,0:38:37.361 their official repositories where I think Ubuntu is fine with some of 0:38:37.361,0:38:42.491 them, apart from that I think like just a lot of what is core to most Linux 0:38:42.491,0:38:47.411 distros is kind of shared between them and there's a lot of learning in the 0:38:47.411,0:38:51.431 common ground. So you don't have to worry about the specifics 0:38:52.391,0:38:56.351 Keeping with the theme of this class being somewhat opinionated, I'm gonna go ahead and say 0:38:56.351,0:39:00.041 that if you're using Linux especially for the first time choose something like 0:39:00.041,0:39:03.851 Ubuntu or Debian. So you Ubuntu to is a Debian based distribution but maybe is a 0:39:03.851,0:39:07.421 little bit more friendly, Debian is a little bit more minimalist. I use Debian 0:39:07.421,0:39:10.451 and all my servers, for example. And I use Debian desktop on my desktop computers 0:39:10.451,0:39:15.431 that run Linux if you're going for maybe trying to learn more things and you want 0:39:15.431,0:39:19.391 a distribution that trades stability for having more up-to-date software maybe 0:39:19.391,0:39:21.911 at the expense of you having to fix a broken distribution every once in a 0:39:21.911,0:39:26.911 while then maybe you can consider something like Arch Linux or Gentoo 0:39:26.911,0:39:32.681 or Slackware. Oh man, I'd say that like if you're installing Linux and just like 0:39:32.681,0:39:34.891 want to get work done Debian is a great choice 0:39:35.911,0:39:38.271 Yeah I think I agree with that. 0:39:38.271,0:39:40.971 The other observation is like you couldn't install BSD 0:39:40.971,0:39:46.691 BSD has gotten, has come a long way from where it was. There's still a bunch of 0:39:46.691,0:39:50.921 software you can't really get for BSD but it gives you a very well-documented 0:39:50.921,0:39:55.841 experience and and one thing that's different about BSD compared to Linux is 0:39:55.841,0:40:02.531 that in an BSD when you install BSD you get a full operating system, mostly 0:40:02.651,0:40:07.531 So many of the programs are maintained by the same team that maintains the kernel 0:40:07.541,0:40:11.351 and everything is sort of upgraded together, which is a little different 0:40:11.351,0:40:13.271 than how thanks work in the Linux world it does 0:40:13.271,0:40:16.751 mean that things often move a little bit slower. I would not use it for things 0:40:16.751,0:40:21.791 like gaming either, because drivers support is meh. But it is an interesting 0:40:21.791,0:40:30.661 environment to look at. And then for things like Mac OS and Windows I think 0:40:30.661,0:40:36.041 If you are a programmer, I don't know why you are using Windows unless you are 0:40:36.041,0:40:42.401 building things for Windows; or you want to be able to do gaming and stuff 0:40:42.401,0:40:46.891 but in that case, maybe try dual booting, even though that's a pain too 0:40:46.891,0:40:52.031 Mac OS is a is a good sort of middle point between the two where you get a system 0:40:52.031,0:40:57.851 that is like relatively nicely polished for you. But you still have access to 0:40:57.851,0:41:01.191 some of the lower-level bits at least to a certain extent. 0:41:01.191,0:41:07.451 it's also really easy to dual boot Mac OS and Windows it is not quite the case with like Mac OS and 0:41:07.451,0:41:09.651 Linux or Linux and Windows 0:41:13.911,0:41:15.751 Alright, for the rest of the questions so these are 0:41:15.761,0:41:18.761 all 0 upvote questions so maybe we can go through them quickly in the last five 0:41:18.761,0:41:23.471 or so minutes of class. So the next one is Vim versus Emacs? Vim! 0:41:23.471,0:41:30.911 Easy answer, but a more serious answer is like I think all three of us use vim as our primary editor 0:41:30.911,0:41:34.931 I use Emacs for some research specific stuff which requires Emacs but 0:41:34.931,0:41:38.681 at a higher level both editors have interesting ideas behind them and if you 0:41:38.681,0:41:43.061 have the time is worth exploring both to see which fits you better and also 0:41:43.061,0:41:46.811 you can use Emacs and run it in a vim emulation mode. I actually know a 0:41:46.811,0:41:49.091 good number of people who do that so they get access to some of the cool 0:41:49.091,0:41:52.631 Emacs functionality and some of the cool philosophy behind that like Emacs is 0:41:52.631,0:41:55.391 programmable through Lisp which is kind of cool. 0:41:55.391,0:41:59.411 Much better than vimscript, but people like vim's modal editing, so there's an 0:41:59.411,0:42:04.481 emacs plugin called evil mode which gives you vim modal editing within Emacs so 0:42:04.481,0:42:08.081 it's not necessarily a binary choice you can kind of combine both tools if you 0:42:08.081,0:42:11.151 want to. And it's worth exploring both if you have the time. 0:42:11.151,0:42:12.731 Next question 0:42:12.731,0:42:15.671 Any tips or tricks for machine learning applications? 0:42:19.271,0:42:22.351 I think, like knowing how 0:42:22.361,0:42:24.791 a lot of these tools, mainly the data wrangling 0:42:24.791,0:42:30.041 a lot of the shell tools, it's really important because it seems a lot 0:42:30.041,0:42:33.851 of what you're doing as machine learning researcher is trying different things 0:42:33.851,0:42:39.491 but I think one core aspect of doing that, and like a lot of scientific work is being 0:42:39.491,0:42:44.501 able to have reproducible results and logging them in a sensible way 0:42:44.501,0:42:47.711 So for example, instead of trying to come up with really hacky solutions of how 0:42:47.711,0:42:51.151 you name your folders to make sense of the experiments 0:42:51.151,0:42:53.251 Maybe it's just worth having for example 0:42:53.251,0:42:55.931 what I do is have like a JSON file that describes the 0:42:55.931,0:43:00.371 entire experiment I know like all the parameters that are within and then I can 0:43:00.371,0:43:05.111 really quickly, using the tools that we have covered, query for all the 0:43:05.111,0:43:09.701 experiments that have some specific purpose or use some data set 0:43:09.701,0:43:15.071 Things like that. Apart from that, the other side of this is, if you are running 0:43:15.071,0:43:19.871 kind of things for training machine learning applications and you 0:43:19.871,0:43:23.981 are not already using some sort of cluster, like university or your 0:43:23.981,0:43:28.301 company is providing and you're just kind of manually sshing, like a lot of 0:43:28.301,0:43:31.231 labs do, because that's kind of the easy way 0:43:31.231,0:43:36.671 It's worth automating a lot of that job because it might not seem like it but 0:43:36.671,0:43:40.601 manually doing a lot of these operations takes away a lot of your time and also 0:43:40.601,0:43:45.031 kind of your mental energy for running these things 0:43:48.551,0:43:51.691 Anymore vim tips? 0:43:51.691,0:43:56.771 I have one. So in the vim lecture we tried not to link you to too many different 0:43:56.771,0:44:00.131 vim plugins because we didn't want that lecture to be overwhelming but I think 0:44:00.131,0:44:02.921 it's actually worth exploring vim plugins because there are lots and lots 0:44:02.921,0:44:07.091 of really cool ones out there. One resource you can use is the 0:44:07.091,0:44:10.571 different instructors dotfiles like a lot of us, I think I use like two dozen 0:44:10.571,0:44:14.321 vim plugins and I find a lot of them quite helpful and I use them every day 0:44:14.321,0:44:18.311 we all use slightly different subsets of them. So go look at what we use or look 0:44:18.311,0:44:22.131 at some of the other resources we've linked to and you might find some stuff useful 0:44:22.791,0:44:26.951 A thing to add to that is, I don't think we went into a lot detail in the 0:44:27.041,0:44:31.571 lecture, correct me if I'm wrong. It's getting familiar with the leader key 0:44:31.571,0:44:35.021 Which is kind of a special key that a lot of programs will 0:44:35.021,0:44:39.081 especially plugins, that will link to and for a lot of the common operations 0:44:39.081,0:44:44.661 vim has short ways of doing it, but you can just figure out like quicker 0:44:44.661,0:44:50.031 versions for doing them. So for example, like I know that you can do like semicolon WQ 0:44:50.031,0:44:55.521 to save and exit or that you can do like capital ZZ but I 0:44:55.521,0:44:59.241 just actually just do leader (which for me is the space) and then W. And I have 0:44:59.241,0:45:04.131 done that for a lot of a lot of kind of common operations that I keep doing all 0:45:04.131,0:45:08.091 the time. Because just saving one keystroke for an extremely common operation 0:45:08.091,0:45:11.371 is just saving thousands a month 0:45:11.371,0:45:12.951 Yeah just to expand a little bit 0:45:12.951,0:45:17.031 on what the leader key is so in vim you can bind some keys I can do like ctrl J 0:45:17.031,0:45:20.481 does something like holding one key and then pressing another I can bind that to 0:45:20.481,0:45:23.781 something or I can bind a single keystroke to something. What the leader 0:45:23.781,0:45:26.031 key lets you do, is bind 0:45:26.031,0:45:28.311 So you can assign any key to be the leader key and 0:45:28.311,0:45:32.841 then you can assign leader followed by some other key to some action so for 0:45:32.841,0:45:36.831 example like Jose's leader key is space and they can combine space and then 0:45:36.831,0:45:41.601 releasing space followed by some other key to an arbitrary vim command so it 0:45:41.601,0:45:45.631 just gives you yet another way of binding like a whole set of key combinations. 0:45:45.631,0:45:49.751 Leader key plus kind of any key on the keyboard to some functionality 0:45:49.751,0:45:53.751 I think I've I forget whether we covered macros in the vim 0:45:53.751,0:45:58.581 uh sure but like vim macros are worth learning they're not that complicated 0:45:58.581,0:46:03.141 but knowing that they're there and knowing how to use them is going to save 0:46:03.141,0:46:09.501 you so much time. The other one is something called marks. So in vim you can 0:46:09.501,0:46:13.491 press m and then any letter on your keyboard to make a mark in that file and 0:46:13.491,0:46:18.021 then you can press apostrophe on the same letter to jump back to the same 0:46:18.021,0:46:21.801 place. This is really useful if you're like moving back and forth 0:46:21.801,0:46:25.491 between two different parts of your code for example. You can mark one as A and 0:46:25.491,0:46:29.611 one as B and you can then jump between them with tick A and tick B. 0:46:29.611,0:46:34.851 There's also Ctrl+O which jumps to the previous place you were in the file no matter 0:46:34.851,0:46:40.611 what caused you to move. So for example if I am in a some line and then I jump 0:46:40.611,0:46:45.201 to B and then I jump to A, Ctrl+O will take me back to B and then back to the 0:46:45.201,0:46:48.831 place I originally was. This can also be handy for things like if you're doing a 0:46:48.831,0:46:52.671 search then the place that you started the search is a part of 0:46:52.671,0:46:56.211 that stack. So I can do a search I can then like step through the results 0:46:56.211,0:47:00.801 and like change them and then Ctrl+O all the way back up to the search 0:47:00.801,0:47:06.201 Ctrl+O also lets you move across files so if I go from one file to somewhere else in 0:47:06.201,0:47:09.681 different file and somewhere else in the first file Ctrl+O will move me back 0:47:09.681,0:47:15.261 through that stack and then there's Ctrl+I to move forward in that 0:47:15.261,0:47:20.841 stack and so it's not as though you pop it and it goes away forever 0:47:20.841,0:47:26.541 The command colon earlier is really handy. So, colon earlier gives you an earlier 0:47:26.541,0:47:32.870 version of the same file and it it does this based on time not based on actions 0:47:32.870,0:47:36.651 so for example if you press a bunch of like undo and redo and make some changes 0:47:36.651,0:47:42.561 and stuff, earlier will take a literally earlier as in time version of your file 0:47:42.561,0:47:46.971 and restore it to your buffer. This can sometimes be good if you like undid and 0:47:46.971,0:47:50.841 then rewrote something and then realize you actually wanted the version that was 0:47:50.841,0:47:55.100 there before you started undoing earlier let's you do this. And there's a plug-in 0:47:55.100,0:48:01.971 called undo tree or something like that There are several of these, 0:48:01.971,0:48:05.781 that let you actually explore the full tree of undo history the vim keeps 0:48:05.781,0:48:09.201 because it doesn't just keep a linear history it actually keeps the full tree 0:48:09.201,0:48:12.771 and letting you explore that might in some cases save you from having to 0:48:12.771,0:48:16.461 re-type stuff you typed in the past or stuff you just forgot exactly what you 0:48:16.461,0:48:21.081 had there that used to work and no longer works. And this is one final one I 0:48:21.081,0:48:26.751 want to mention which is, we mentioned how in vim you have verbs and nouns 0:48:26.751,0:48:33.201 right to your verbs like delete or yank and then you have nouns like next of 0:48:33.201,0:48:37.401 this character or percent to swap brackets and that sort of stuff the 0:48:37.401,0:48:44.571 search command is a noun so you can do things like D slash and then a string 0:48:44.571,0:48:50.261 and it will delete up to the next match of that pattern this is extremely useful 0:48:50.261,0:48:54.251 and I use it all the time 0:48:58.500,0:49:03.520 One another neat addition on the undo stuff that I find incredibly valuable in 0:49:03.520,0:49:08.201 an everyday basis is that like one of the built-in functionalities of vim 0:49:08.201,0:49:13.510 is that you can specify an undo directory and if you have a specified an 0:49:13.510,0:49:17.620 undo directory by default vim, if you don't have this enabled, whenever you 0:49:17.620,0:49:23.091 enter a file your undo history is clean, there's nothing in there 0:49:23.091,0:49:26.371 and as you make changes and then undo them you kind of create this 0:49:26.380,0:49:32.800 history but as soon as you exit the file that's lost. Sorry, as soon 0:49:32.800,0:49:37.181 as you exit vim, that's lost. However if you have an undodir, vim is 0:49:37.181,0:49:41.651 gonna persist all those changes into this directory so no matter how many 0:49:41.651,0:49:45.580 times you enter and leave that history is persisted and it's incredibly 0:49:45.580,0:49:48.191 helpful because even like 0:49:48.191,0:49:50.290 it can be very helpful for some files that you modify 0:49:50.290,0:49:54.760 often because then you can kind of keep the flow. But it's also sometimes really 0:49:54.760,0:50:00.010 helpful if you modify your bashrc see and something broke like five days later and 0:50:00.010,0:50:03.070 then you've vim again. Like what actually did I change ,if you don't 0:50:03.070,0:50:06.760 have say like version control, then you can just check the undos and 0:50:06.760,0:50:10.661 that's actually what happened. And the last one, it's also really 0:50:10.661,0:50:14.891 worth familiarizing yourself with registers and what different special 0:50:14.891,0:50:20.380 registers vim uses. So for example if you want to copy/paste really that's 0:50:20.380,0:50:26.201 gone into in a specific register and if you want to for example use the a OS a copy 0:50:26.201,0:50:30.040 like the OS clipboard, you should be copying or yanking 0:50:30.040,0:50:36.250 copying and pasting from a different register and there's a lot of them and yeah 0:50:36.251,0:50:41.310 I think that you should explore, there's a lot of things to know about registers 0:50:42.271,0:50:45.070 The next question is asking about two-factor authentication and I'll just give 0:50:45.070,0:50:48.490 a very quick answer to this one in the interest of time. So it's worth using two 0:50:48.490,0:50:52.480 factor auth for anything security sensitive so I use it for my GitHub 0:50:52.480,0:50:56.710 account and for my email and stuff like that. And there's a bunch of different 0:50:56.710,0:51:01.360 types of two-factor auth. From SMS based to factor auth where you get special 0:51:01.360,0:51:04.630 like a number texted to you when you try to log in you have to type that number 0:51:04.630,0:51:08.710 and to other tools like universal to factor this is like those Yubikeys 0:51:08.710,0:51:11.350 that you plug into your you have to tap it every time you login 0:51:11.350,0:51:18.130 so not all, (yeah Jon is holding a Yubikey), not all two-factor auth is 0:51:18.130,0:51:22.240 created equal and you really want to be using something like U2F rather than SMS 0:51:22.240,0:51:25.300 based to factor auth. There something based on one-time pass codes that you 0:51:25.300,0:51:28.810 have to type in we don't have time to get into the details of why some methods 0:51:28.810,0:51:32.020 are better than others but at a high level use U2F and the Internet has 0:51:32.020,0:51:37.560 plenty of explanations for why other methods are not a great idea 0:51:37.711,0:51:41.851 Last question, any comments on differences between web browsers? 0:51:48.171,0:51:50.171 Yes 0:51:54.711,0:52:00.451 Differences between web browsers, there are fewer and fewer differences between 0:52:00.461,0:52:06.000 web browsers these day. At this point almost all web browsers are chrome 0:52:06.000,0:52:09.580 Either because you're using Chrome or because you're using a browser that's 0:52:09.580,0:52:15.550 using the same browser engine as Chrome. It's a little bit sad, one might say, but 0:52:15.550,0:52:20.511 I think these days whether you choose 0:52:20.511,0:52:24.451 Chrome is a great browser for security reasons 0:52:24.451,0:52:28.471 if you want to have something that's more customizable or 0:52:28.471,0:52:39.490 you don't want to be tied to Google then use Firefox, don't use Safari it's a 0:52:39.490,0:52:45.701 worse version of Chrome. The new Internet Explorer edge is pretty decent and also 0:52:45.701,0:52:50.820 uses the same browser engine as Chrome and that's probably fine 0:52:50.820,0:52:54.641 although avoid it if you can because it has some like legacy modes you don't 0:52:54.641,0:52:58.064 want to deal with. I think that's 0:52:58.064,0:53:03.091 Oh, there's a cool new browser called flow 0:53:03.091,0:53:05.500 that you can't use for anything useful yet but they're actually writing 0:53:05.500,0:53:08.693 their own browser engine and that's really neat 0:53:08.693,0:53:14.951 Firefox also has this project called servo which is they're really implementing their browser engine 0:53:14.951,0:53:19.570 in Rust in order to write it to be like super concurrent and what they've done 0:53:19.570,0:53:24.961 is they've started to take modules from that version and port them 0:53:24.961,0:53:29.041 over to gecko or integrate them with gecko which is the main browser engine 0:53:29.041,0:53:32.221 for Firefox just to get those speed ups there as well 0:53:32.221,0:53:37.031 and that's a neat neat thing you can be watching out for 0:53:39.231,0:53:41.851 That is all the questions, hey we did it. Nice 0:53:41.851,0:53:50.751 I guess thanks for taking the missing semester class and let's do it again next year ================================================ FILE: static/files/subtitles/2020/shell-tools.sbv ================================================ 0:00:00.400,0:00:02.860 Okay, welcome back. 0:00:02.860,0:00:05.920 Today we're gonna cover a couple separate 0:00:05.920,0:00:07.620 two main topics related to the shell. 0:00:07.620,0:00:11.240 First, we're gonna do some kind of shell scripting, mainly related to bash, 0:00:11.240,0:00:14.160 which is the shell that most of you will start 0:00:14.160,0:00:18.520 in Mac, or like in most Linux systems, that's the default shell. 0:00:18.520,0:00:22.720 And it's also kind of backward compatible through other shells like zsh, it's pretty nice. 0:00:22.740,0:00:25.940 And then we're gonna cover some other shell tools that are really convenient, 0:00:26.060,0:00:29.320 so you avoid doing really repetitive tasks, 0:00:29.320,0:00:31.580 like looking for some piece of code 0:00:31.580,0:00:33.420 or for some elusive file. 0:00:33.420,0:00:36.160 And there are already really nice built-in commands 0:00:36.160,0:00:40.960 that will really help you to do those things. 0:00:40.960,0:00:43.260 So yesterday we already kind of introduced 0:00:43.260,0:00:46.160 you to the shell and some of it's quirks, 0:00:46.160,0:00:48.720 and like how you start executing commands, 0:00:48.720,0:00:50.600 redirecting them. 0:00:50.600,0:00:52.400 Today, we're going to kind of cover more about 0:00:52.460,0:00:56.120 the syntax of the variables, the control flow, 0:00:56.120,0:00:57.720 functions of the shell. 0:00:57.720,0:01:02.700 So for example, once you drop into a shell, say you want to 0:01:02.760,0:01:06.360 define a variable, which is one of the first things you 0:01:06.360,0:01:09.340 learn to do in a programming language. 0:01:09.340,0:01:12.740 Here you could do something like foo equals bar. 0:01:12.860,0:01:18.400 And now we can access the value of foo by doing "$foo". 0:01:18.460,0:01:21.400 And that's bar, perfect. 0:01:21.400,0:01:24.480 One quirk that you need to be aware of is that 0:01:24.480,0:01:27.900 spaces are really critical when you're dealing with bash. 0:01:27.900,0:01:33.380 Mainly because spaces are reserved, and that will be for separating arguments. 0:01:33.380,0:01:36.700 So, for example, something like foo equals bar 0:01:36.700,0:01:42.000 won't work, and the shell is gonna tell you why it's not working. 0:01:42.000,0:01:46.280 It's because the foo command is not working, like foo is non-existent. 0:01:46.280,0:01:47.780 And here what is actually happening, we're not assigning foo to bar, 0:01:47.780,0:01:52.260 what is happening is we're calling the foo program 0:01:52.260,0:01:57.520 with the first argument "=" and the second argument "bar". 0:01:57.520,0:02:03.880 And in general, whenever you are having some issues, like some files with spaces 0:02:03.880,0:02:06.160 you will need to be careful about that. 0:02:06.160,0:02:10.620 You need to be careful about quoting strings. 0:02:10.640,0:02:16.480 So, going into that, how you do strings in bash. There are two ways that you can define a string: 0:02:16.540,0:02:24.720 You can define strings using double quotes and you can define strings using single, 0:02:24.720,0:02:26.540 sorry, 0:02:26.540,0:02:28.880 using single quotes. 0:02:29.140,0:02:32.760 However, for literal strings they are equivalent, 0:02:32.760,0:02:35.460 but for the rest they are not equivalent. 0:02:35.460,0:02:42.980 So, for example, if we do value is $foo, 0:02:43.440,0:02:48.480 the $foo has been expanded like a string, substituted to the 0:02:48.480,0:02:50.820 value of the foo variable in the shell. 0:02:50.960,0:02:58.940 Whereas if we do this with a simple quote, we are just getting the $foo as it is 0:02:58.940,0:03:02.280 and single quotes won't be replacing. Again, 0:03:02.280,0:03:07.290 it's really easy to write a script, assume that this is kind of like Python, that you might be 0:03:07.290,0:03:10.860 more familiar with, and not realize all that. 0:03:10.860,0:03:14.180 And this is the way you will assign variables. 0:03:14.180,0:03:17.849 Then bash also has control flow techniques that we'll see later, 0:03:17.849,0:03:24.440 like for loops, while loops, and one main thing is you can define functions. 0:03:24.440,0:03:27.820 We can access a function I have defined here. 0:03:28.220,0:03:34.220 Here we have the MCD function, that has been defined, and the thing is 0:03:34.220,0:03:38.400 so far, we have just kind of seen how to execute several commands by piping 0:03:38.400,0:03:40.720 into them, kind of saw that briefly yesterday. 0:03:40.940,0:03:44.980 But a lot of times you want to do first one thing and then another thing. 0:03:44.980,0:03:47.580 And that's kind of like the 0:03:47.740,0:03:50.880 sequential execution that we get here. 0:03:50.880,0:03:54.260 Here, for example, we're calling the MCD function. 0:03:56.860,0:03:57.800 We, first, 0:03:57.800,0:04:02.960 are calling the makedir command, which is creating this directory. 0:04:02.960,0:04:05.600 Here, $1 is like a special variable. 0:04:05.600,0:04:07.440 This is the way that bash works, 0:04:07.440,0:04:12.160 whereas in other scripting languages there will be like argv, 0:04:12.160,0:04:16.620 the first item of the array argv will contain the argument. 0:04:16.620,0:04:19.160 In bash it's $1. And in general, a lot 0:04:19.160,0:04:21.640 of things in bash will be dollar something 0:04:21.640,0:04:26.680 and will be reserved, we will be seeing more examples later. 0:04:26.680,0:04:30.290 And once we have created the folder, we CD into that folder, 0:04:30.290,0:04:34.687 which is kind of a fairly common pattern that you will see. 0:04:34.687,0:04:39.060 We will actually type this directly into our shell, and it will work and 0:04:39.120,0:04:45.260 it will define this function. But sometimes it's nicer to write things in a file. 0:04:45.260,0:04:50.040 What we can do is we can source this. And that will 0:04:50.080,0:04:53.960 execute this script in our shell and load it. 0:04:53.960,0:04:59.340 So now it looks like nothing happened, but now the MCD function has 0:04:59.340,0:05:03.460 been defined in our shell. So we can now for example do 0:05:03.463,0:05:09.150 MCD test, and now we move from the tools directory to the test 0:05:09.160,0:05:14.200 directory. We both created the folder and we moved into it. 0:05:15.760,0:05:18.820 What else. So a result is... 0:05:18.820,0:05:22.160 We can access the first argument with $1. 0:05:22.160,0:05:26.100 There's a lot more reserved commands, 0:05:26.100,0:05:30.020 for example $0 will be the name of the script, 0:05:30.020,0:05:35.260 $2 through $9 will be the second through the ninth arguments 0:05:35.260,0:05:38.070 that the bash script takes. Some of these reserved 0:05:38.070,0:05:43.080 keywords can be directly used in the shell, so for example 0:05:43.420,0:05:50.300 $? will get you the error code from the previous command, 0:05:50.300,0:05:53.580 which I'll also explain briefly. 0:05:53.580,0:05:58.320 But for example, $_ will get you the last argument of the 0:05:58.320,0:06:03.460 previous command. So another way we could have done this is 0:06:03.460,0:06:07.380 we could have said like "mkdir test" 0:06:07.380,0:06:12.020 and instead of rewriting test, we can access that last argument 0:06:12.020,0:06:18.400 as part of the (previous command), using $_ 0:06:18.400,0:06:23.160 like, that will be replaced with test and now we go into test. 0:06:25.040,0:06:27.480 There are a lot of them, you should familiarize with them. 0:06:27.480,0:06:32.900 Another one I often use is called "bang bang" ("!!"), you will run into this 0:06:32.910,0:06:37.300 whenever you, for example, are trying to create something and you don't have 0:06:37.320,0:06:41.000 enough permissions. Then, you can do "sudo !!" 0:06:41.010,0:06:43.400 and then that will replace the command in 0:06:43.470,0:06:46.400 there and now you can just try doing 0:06:46.440,0:06:48.380 that. And now it will prompt you for a password, 0:06:48.380,0:06:50.080 because you have sudo permissions. 0:06:53.800,0:06:57.180 Before, I mentioned the, kind of the error command. 0:06:57.180,0:06:59.400 Yesterday we saw that, in general, there are 0:06:59.400,0:07:02.400 different ways a process can communicate 0:07:02.400,0:07:05.091 with other processes or commands. 0:07:05.100,0:07:08.420 We mentioned the standard input, which also was like 0:07:09.160,0:07:11.380 getting stuff through the standard input, 0:07:11.640,0:07:13.840 putting stuff into the standard output. 0:07:13.840,0:07:16.830 There are a couple more interesting things, there's also like a 0:07:16.830,0:07:19.837 standard error, a stream where you write errors 0:07:19.837,0:07:23.900 that happen with your program and you don't want to pollute the standard output. 0:07:23.900,0:07:27.420 There's also the error code, which is like a general 0:07:27.420,0:07:29.520 thing in a lot of programming languages, 0:07:29.520,0:07:34.460 some way of reporting how the entire run of something went. 0:07:34.460,0:07:36.060 So if we do 0:07:36.060,0:07:41.020 something like echo hello and we 0:07:41.580,0:07:43.920 query for the value, it's zero. And it's zero 0:07:43.920,0:07:45.840 because everything went okay and there 0:07:45.840,0:07:49.170 weren't any issues. And a zero exit code is 0:07:49.170,0:07:50.940 the same as you will get in a language 0:07:50.940,0:07:54.980 like C, like 0 means everything went fine, there were no errors. 0:07:54.980,0:07:57.600 However, sometimes things won't work. 0:07:57.600,0:08:04.600 Sometimes, like if we try to grep for foobar in our MCD script, 0:08:04.600,0:08:08.130 and now we check for that value, it's 1. And that's 0:08:08.130,0:08:10.770 because we tried to search for the foobar 0:08:10.770,0:08:13.620 string in the MCD script and it wasn't there. 0:08:13.620,0:08:17.190 So grep doesn't print anything, but 0:08:17.190,0:08:19.950 let us know that things didn't work by 0:08:19.950,0:08:22.260 giving us a 1 error code. 0:08:22.260,0:08:24.420 There are some interesting commands like 0:08:24.420,0:08:29.160 "true", for example, will always have a zero 0:08:29.160,0:08:35.060 error code, and false will always have a one error code. 0:08:35.060,0:08:37.919 Then there are like 0:08:37.919,0:08:40.080 these logical operators that you can use 0:08:40.080,0:08:43.808 to do some sort of conditionals. For example, one way... 0:08:43.808,0:08:47.160 you also have IF's and ELSE's, that we will see later, but you can do 0:08:47.160,0:08:51.920 something like "false", and echo "Oops fail". 0:08:51.920,0:08:56.300 So here we have two commands connected by this OR operator. 0:08:56.300,0:09:00.250 What bash is gonna do here, it's gonna execute the first one 0:09:00.250,0:09:04.450 and if the first one didn't work, then it's 0:09:04.450,0:09:07.380 gonna execute the second one. So here we get it, 0:09:07.380,0:09:12.000 because it's gonna try to do a logical OR. If the first one didn't have 0:09:12.000,0:09:15.960 a zero error code, it's gonna try to do the second one. Similarly, if we 0:09:15.960,0:09:19.580 instead of use "false", we use something like "true", 0:09:19.580,0:09:22.180 since true will have a zero error code, then the 0:09:22.180,0:09:24.700 second one will be short-circuited and 0:09:24.700,0:09:27.500 it won't be printed. 0:09:32.560,0:09:36.970 Similarly, we have an AND operator which will only 0:09:36.970,0:09:39.430 execute the second part if the first one 0:09:39.430,0:09:41.440 ran without errors. 0:09:41.440,0:09:44.820 And the same thing will happen. 0:09:44.820,0:09:50.340 If the first one fails, then the second part of this thing won't be executed. 0:09:50.340,0:09:57.280 Kind of not exactly related to that, but another thing that you will see is 0:10:00.020,0:10:04.120 that no matter what you execute, then you can concatenate 0:10:04.120,0:10:07.120 commands using a semicolon in the same line, 0:10:07.120,0:10:10.300 and that will always print. 0:10:10.300,0:10:13.630 Beyond that, what we haven't seen, for example, is how 0:10:13.630,0:10:19.460 you go about getting the output of a command into a variable. 0:10:19.630,0:10:24.120 And the way we can do that is doing something like this. 0:10:24.120,0:10:29.480 What we're doing here is we're getting the output of the PWD command, 0:10:29.480,0:10:32.720 which is just printing the present working directory 0:10:32.720,0:10:33.740 where we are right now. 0:10:33.740,0:10:37.220 And then we're storing that into the foo variable. 0:10:37.220,0:10:42.279 So we do that and then we ask for foo, we view our string. 0:10:42.280,0:10:48.460 More generally, we can do this thing called command substitution 0:10:50.110,0:10:51.500 by putting it into any string. 0:10:51.500,0:10:55.162 And since we're using double quotes instead of single quotes 0:10:55.162,0:10:57.440 that thing will be expanded and 0:10:57.440,0:11:02.740 it will tell us that we are in this working folder. 0:11:02.740,0:11:09.240 Another interesting thing is, right now, what this is expanding to is a string 0:11:09.400,0:11:10.300 instead of 0:11:11.920,0:11:13.320 It's just expanding as a string. 0:11:13.460,0:11:17.640 Another nifty and lesser known tool is called process substitution, 0:11:17.640,0:11:20.540 which is kind of similar. What it will do... 0:11:24.360,0:11:30.041 it will, here for example, the "<(", some command and another parenthesis, 0:11:30.041,0:11:34.840 what that will do is: that will execute, that will get the output to 0:11:34.840,0:11:39.120 kind of like a temporary file and it will give the file handle to the command. 0:11:39.120,0:11:42.020 So here what we're doing is we're getting... 0:11:42.020,0:11:45.760 we're LS'ing the directory, putting it into a temporary file, 0:11:45.760,0:11:48.040 doing the same thing for the parent folder and then 0:11:48.040,0:11:51.310 we're concatenating both files. And this 0:11:51.310,0:11:55.520 will, may be really handy, because some commands instead of expecting 0:11:55.520,0:11:59.500 the input coming from the stdin, they are expecting things to 0:11:59.500,0:12:03.560 come from some file that is giving some of the arguments. 0:12:04.700,0:12:07.620 So we get both things concatenated. 0:12:12.880,0:12:17.040 I think so far there's been a lot of information, let's see a simple, 0:12:17.040,0:12:22.920 an example script where we see a few of these things. 0:12:23.200,0:12:27.220 So for example here we have a string and we 0:12:27.220,0:12:30.327 have this $date. So $date is a program. 0:12:30.327,0:12:34.540 Again there's a lot of programs in UNIX you will kind of slowly 0:12:34.540,0:12:36.120 familiarize with a lot of them. 0:12:36.120,0:12:42.820 Date just prints what the current date is and you can specify different formats. 0:12:43.800,0:12:48.700 Then, we have these $0 here. $0 is the name 0:12:48.700,0:12:50.540 of the script that we're running. 0:12:50.550,0:12:56.590 Then we have $#, that's the number of arguments that we are giving 0:12:56.590,0:13:01.920 to the command, and then $$ is the process ID of this command that is running. 0:13:01.920,0:13:06.160 Again, there's a lot of these dollar things, they're not intuitive 0:13:06.160,0:13:07.690 because they don't have like a mnemonic 0:13:07.690,0:13:10.450 way of remembering, maybe, $#. But 0:13:10.450,0:13:12.880 it can be... you will just be 0:13:12.880,0:13:14.660 seeing them and getting familiar with them. 0:13:14.660,0:13:19.200 Here we have this $@, and that will expand to all the arguments. 0:13:19.200,0:13:21.480 So, instead of having to assume that, 0:13:21.490,0:13:25.840 maybe say, we have three arguments and writing $1, $2, $3, 0:13:25.840,0:13:29.760 if we don't know how many arguments we can put all those arguments there. 0:13:29.760,0:13:33.670 And that has been given to a for loop. And the for loop 0:13:33.670,0:13:39.020 will, in time, get the file variable 0:13:39.020,0:13:43.880 and it will be giving each one of the arguments. 0:13:43.880,0:13:47.529 So what we're doing is, for every one of the arguments we're giving. 0:13:47.529,0:13:51.699 Then, in the next line we're running the 0:13:51.699,0:13:56.920 grep command which is just search for a substring in some file and we're 0:13:56.920,0:14:01.380 searching for the string foobar in the file. 0:14:01.380,0:14:06.490 Here, we have put the variable that the file took, to expand. 0:14:06.490,0:14:11.559 And yesterday we saw that if we care about the output of a program, we can 0:14:11.560,0:14:15.680 redirect it to somewhere, to save it or to connect it to some other file. 0:14:15.680,0:14:18.939 But sometimes you want the opposite. 0:14:18.939,0:14:21.260 Sometimes, here for example, we care... 0:14:21.260,0:14:25.119 we're gonna care about the error code. About this script, we're gonna care whether the 0:14:25.120,0:14:28.440 grep ran successfully or it didn't. 0:14:28.440,0:14:33.220 So we can actually discard entirely what the output... 0:14:33.220,0:14:37.480 like both the standard output and the standard error of the grep command. 0:14:37.480,0:14:39.970 And what we're doing is we're 0:14:39.970,0:14:43.029 redirecting the output to /dev/null which 0:14:43.029,0:14:46.540 is kind of like a special device in UNIX 0:14:46.540,0:14:49.119 systems where you can like write and 0:14:49.119,0:14:51.129 it will be discarded. Like you can 0:14:51.129,0:14:52.869 write no matter how much you want, 0:14:52.869,0:14:57.730 there, and it will be discarded. And here's the ">" symbol 0:14:57.730,0:15:02.199 that we saw yesterday for redirecting output. Here you have a "2>" 0:15:02.199,0:15:04.689 and, as some of you might have 0:15:04.689,0:15:06.519 guessed by now, this is for redirecting the 0:15:06.519,0:15:08.589 standard error, because those those two 0:15:08.589,0:15:11.709 streams are separate, and you kind of have to 0:15:11.709,0:15:14.639 tell bash what to do with each one of them. 0:15:14.639,0:15:17.529 So here, we run, we check if the file has 0:15:17.529,0:15:20.649 foobar, and if the file has foobar then it's 0:15:20.649,0:15:22.959 going to have a zero code. If it 0:15:22.959,0:15:24.369 doesn't have foobar, it's gonna have a 0:15:24.369,0:15:26.980 nonzero error code. So that's exactly what we 0:15:26.980,0:15:31.120 check. In this if part of the command we 0:15:31.120,0:15:34.840 say "get me the error code". Again, this $? 0:15:34.840,0:15:37.240 And then we have a comparison operator 0:15:37.240,0:15:41.590 which is "-ne", for "non equal". And some 0:15:41.590,0:15:47.650 other programming languages will have "==", "!=", these 0:15:47.650,0:15:51.070 symbols. In bash there's 0:15:51.070,0:15:53.650 like a reserved set of comparisons and 0:15:53.650,0:15:54.970 it's mainly because there's a lot of 0:15:54.970,0:15:57.520 things you might want to test for when 0:15:57.520,0:15:59.080 you're in the shell. Here for example 0:15:59.080,0:16:03.970 we're just checking for two values, two integer values, being the same. Or for 0:16:03.970,0:16:08.380 example here, the "-F" check will let 0:16:08.380,0:16:10.420 us know if a file exists, which is 0:16:10.420,0:16:12.220 something that you will run into very, 0:16:12.220,0:16:17.530 very commonly. I'm going back to the 0:16:17.530,0:16:23.020 example. Then, what happens when we 0:16:24.400,0:16:28.600 if the file did not have foobar, like there was a 0:16:28.600,0:16:31.990 nonzero error code, then we print 0:16:31.990,0:16:33.400 "this file doesn't have any foobar, 0:16:33.400,0:16:36.400 we're going to add one". And what we do is 0:16:36.400,0:16:40.750 we echo this "# foobar", hoping this 0:16:40.750,0:16:43.200 is a comment to the file and then we're 0:16:43.200,0:16:47.620 using the operator ">>" to append at the end of 0:16:47.620,0:16:50.800 the file. Here since the file has 0:16:50.800,0:16:54.490 been fed through the script, and we don't know it beforehand, we have to substitute 0:16:54.490,0:17:03.430 the variable of the filename. We can actually run this. We already have 0:17:03.430,0:17:05.260 correct permissions in this script and 0:17:05.260,0:17:10.540 we can give a few examples. We have a few files in this folder, "mcd" is the 0:17:10.540,0:17:12.760 one we saw at the beginning for the MCD 0:17:12.760,0:17:15.040 function, some other "script" function and 0:17:15.040,0:17:21.700 we can even feed the own script to itself to check if it has foobar in it. 0:17:21.700,0:17:26.680 And we run it and first we can see that there's different 0:17:26.680,0:17:29.460 variables that we saw, that have been 0:17:29.460,0:17:33.400 successfully expanded. We have the date, that has 0:17:33.400,0:17:36.700 been replaced to the current time, then 0:17:36.700,0:17:39.100 we're running this program, with three 0:17:39.100,0:17:44.560 arguments, this randomized PID, and then 0:17:44.560,0:17:46.510 it's telling us MCD doesn't have any 0:17:46.510,0:17:48.169 foobar, so we are adding a new one, 0:17:48.169,0:17:50.450 and this script file doesn't 0:17:50.450,0:17:52.970 have one. So now for example let's look at MCD 0:17:52.970,0:17:55.820 and it has the comment that we were looking for. 0:17:59.000,0:18:05.619 One other thing to know when you're executing scripts is that 0:18:05.619,0:18:07.759 here we have like three completely 0:18:07.759,0:18:10.279 different arguments but very commonly 0:18:10.279,0:18:12.889 you will be giving arguments that 0:18:12.889,0:18:16.100 can be more succinctly given in some way. 0:18:16.100,0:18:20.179 So for example here if we wanted to 0:18:20.179,0:18:25.429 refer to all the ".sh" scripts we 0:18:25.429,0:18:31.120 could just do something like "ls *.sh" 0:18:31.120,0:18:36.120 and this is a way of filename expansion that most shells have 0:18:36.120,0:18:38.450 that's called "globbing". Here, as you 0:18:38.450,0:18:39.919 might expect, this is gonna say 0:18:39.919,0:18:42.559 anything that has any kind of sort of 0:18:42.559,0:18:45.940 characters and ends up with "sh". 0:18:45.940,0:18:52.159 Unsurprisingly, we get "example.sh" and "mcd.sh". We also have these 0:18:52.159,0:18:54.769 "project1" and "project2", and if there 0:18:54.769,0:19:00.100 were like a... we can do a "project42", for example 0:19:00.620,0:19:04.220 And now if we just want to refer to the projects that have 0:19:04.220,0:19:07.279 a single character, but not two characters 0:19:07.279,0:19:08.720 afterwards, like any other characters, 0:19:08.720,0:19:13.879 we can use the question mark. So "?" will expand to only a single one. 0:19:13.880,0:19:17.360 And we get, LS'ing, first 0:19:17.360,0:19:21.049 "project1" and then "project2". 0:19:21.049,0:19:27.580 In general, globbing can be very powerful. You can also combine it. 0:19:31.880,0:19:35.480 A common pattern is to use what is called curly braces. 0:19:35.480,0:19:39.320 So let's say we have an image, that we have in this folder 0:19:39.320,0:19:43.620 and we want to convert this image from PNG to JPG 0:19:43.620,0:19:46.320 or we could maybe copy it, or... 0:19:46.320,0:19:49.609 it's a really common pattern, to have two or more arguments that are 0:19:49.609,0:19:55.240 fairly similar and you want to do something with them as arguments to some command. 0:19:55.240,0:20:01.290 You could do it this way, or more succinctly, you can just do 0:20:01.290,0:20:08.880 "image.{png,jpg}" 0:20:09.410,0:20:13.590 And here, I'm getting some color feedback, but what this will do, is 0:20:13.590,0:20:17.610 it'll expand into the line above. 0:20:17.610,0:20:23.990 Actually, I can ask zsh to do that for me. And that what's happening here. 0:20:23.990,0:20:26.550 This is really powerful. So for example 0:20:26.550,0:20:29.220 you can do something like... we could do... 0:20:29.220,0:20:34.220 "touch" on a bunch of foo's, and all of this will be expanded. 0:20:35.520,0:20:41.880 You can also do it at several levels and you will do the Cartesian... 0:20:41.880,0:20:49.980 if we have something like this, we have one group here, "{1,2}" 0:20:49.980,0:20:53.310 and then here there's "{1,2,3}", and this is going to do 0:20:53.310,0:20:54.990 the Cartesian product of these 0:20:54.990,0:20:59.920 two expansions and it will expand into all these things, 0:20:59.960,0:21:03.540 that we can quickly "touch". 0:21:03.540,0:21:10.520 You can also combine the asterisk glob with the curly braces glob. 0:21:10.520,0:21:16.840 You can even use kind of ranges. Like, we can do "mkdir" 0:21:16.840,0:21:21.420 and we create the "foo" and the "bar" directories, and then we 0:21:21.420,0:21:25.680 can do something along these lines. This 0:21:25.680,0:21:28.890 is going to expand to "fooa", "foob"... 0:21:28.890,0:21:31.430 like all these combinations, through "j", and 0:21:31.430,0:21:35.250 then the same for "bar". I haven't 0:21:35.250,0:21:38.610 really tested it... but yeah, we're getting all these combinations that we 0:21:38.610,0:21:41.850 can "touch". And now, if we touch something 0:21:41.850,0:21:47.970 that is different between these two [directories], we 0:21:47.970,0:21:55.890 can again showcase the process substitution that we saw 0:21:55.890,0:21:59.610 earlier. Say we want to check what files are different between these 0:21:59.610,0:22:03.400 two folders. For us it's obvious, we just saw it, it's X and Y, 0:22:03.400,0:22:07.410 but we can ask the shell to do this "diff" for us between the 0:22:07.410,0:22:10.200 output of one LS and the other LS. 0:22:10.200,0:22:12.810 Unsurprisingly we're getting: X is 0:22:12.810,0:22:14.700 only in the first folder and Y is 0:22:14.700,0:22:20.970 only in the second folder. What is more 0:22:20.970,0:22:26.519 is, right now, we have only seen bash scripts. If you like other 0:22:26.520,0:22:30.260 scripts, like for some tasks bash is probably not the best, 0:22:30.260,0:22:33.119 it can be tricky. You can actually write scripts that 0:22:33.119,0:22:35.700 interact with the shell implemented in a lot 0:22:35.700,0:22:39.710 of different languages. So for example, let's see here a 0:22:39.710,0:22:43.139 Python script that has a magic line at the 0:22:43.139,0:22:45.539 beginning that I'm not explaining for now. 0:22:45.540,0:22:48.330 Then we have "import sys", 0:22:48.330,0:22:53.629 it's kind of like... Python is not, by default, trying to interact 0:22:53.629,0:22:56.999 with the shell so you will have to import 0:22:56.999,0:22:58.799 some library. And then we're doing a 0:22:58.799,0:23:01.529 really silly thing of just iterating 0:23:01.529,0:23:06.440 over "sys.argv[1:]". 0:23:06.440,0:23:12.809 "sys.argv" is kind of similar to what in bash we're getting as $0, $1, &c. 0:23:12.809,0:23:16.649 Like the vector of the arguments, we're printing it in the reversed order. 0:23:16.649,0:23:21.179 And the magic line at the beginning is 0:23:21.179,0:23:23.999 called a shebang and is the way that the 0:23:23.999,0:23:26.159 shell will know how to run this program. 0:23:26.159,0:23:30.509 You can always do something like 0:23:30.509,0:23:34.379 "python script.py", and then "a b c" and that 0:23:34.379,0:23:36.659 will work, always, like that. But 0:23:36.659,0:23:39.119 what if we want to make this to be 0:23:39.119,0:23:41.309 executable from the shell? The way the 0:23:41.309,0:23:44.190 shell knows that it has to use python as the 0:23:44.190,0:23:48.450 interpreter to run this file is using 0:23:48.450,0:23:52.440 that first line. And that first line is 0:23:52.440,0:23:56.620 giving it the path to where that thing lives. 0:23:58.500,0:23:59.600 However, you might not know. 0:23:59.609,0:24:01.830 Like, different machines will have probably 0:24:01.830,0:24:04.049 different places where they put python 0:24:04.049,0:24:06.090 and you might not want to assume where 0:24:06.090,0:24:08.789 python is installed, or any other interpreter. 0:24:08.789,0:24:16.379 So one thing that you can do is use the 0:24:16.380,0:24:17.720 "env" command. 0:24:18.280,0:24:21.560 You can also give arguments in the shebang, so 0:24:21.570,0:24:23.940 what we're doing here is specifying 0:24:23.940,0:24:29.720 run the "env" command, that is for pretty much every system, there are some exceptions, but like for 0:24:29.720,0:24:31.550 pretty much every system it's is in 0:24:31.550,0:24:33.620 "usr/bin", where a lot of binaries live, 0:24:33.620,0:24:36.200 and then we're calling it with the 0:24:36.200,0:24:38.570 argument "python". And then that will make 0:24:38.570,0:24:42.020 use of the path environment variable 0:24:42.020,0:24:43.580 that we saw in the first lecture. It's 0:24:43.580,0:24:45.680 gonna search in that path for the Python 0:24:45.680,0:24:48.620 binary and then it's gonna use that to 0:24:48.620,0:24:50.480 interpret this file. And that will make 0:24:50.480,0:24:52.490 this more portable so it can be run in 0:24:52.490,0:24:57.520 my machine, and your machine and some other machine. 0:25:08.020,0:25:12.140 Another thing is that the bash is not 0:25:12.140,0:25:14.300 really like modern, it was 0:25:14.300,0:25:16.340 developed a while ago. And sometimes 0:25:16.340,0:25:18.890 it can be tricky to debug. By 0:25:18.890,0:25:21.980 default, and the ways it will fail 0:25:21.980,0:25:24.020 sometimes are intuitive like the way we 0:25:24.020,0:25:26.180 saw before of like foo command not 0:25:26.180,0:25:28.610 existing, sometimes it's not. So there's 0:25:28.610,0:25:31.280 like a really nifty tool that we have 0:25:31.280,0:25:34.310 linked in the lecture notes, which is called 0:25:34.310,0:25:37.580 "shellcheck", that will kind of give you 0:25:37.580,0:25:40.010 both warnings and syntactic errors 0:25:40.010,0:25:43.250 and other things that you might not have quoted properly, 0:25:43.250,0:25:46.040 or you might have misplaced spaces in 0:25:46.040,0:25:50.060 your files. So for example for extremely simple "mcd.sh" 0:25:50.060,0:25:51.980 file we're getting a couple 0:25:51.980,0:25:54.800 of errors saying hey, surprisingly, 0:25:54.800,0:25:56.090 we're missing a shebang, like this 0:25:56.090,0:25:59.060 might not interpret it correctly if you're 0:25:59.060,0:26:02.000 it at a different system. Also, this 0:26:02.000,0:26:05.620 CD is taking a command and it might not 0:26:05.620,0:26:08.960 expand properly so instead of using CD 0:26:08.960,0:26:11.300 you might want to use something like CD 0:26:11.300,0:26:14.540 and then an OR and then an "exit". We go 0:26:14.540,0:26:16.490 back to what we explained earlier, what 0:26:16.490,0:26:18.920 this will do is like if the 0:26:18.920,0:26:21.860 CD doesn't end correctly, you cannot CD 0:26:21.860,0:26:23.720 into the folder because either you 0:26:23.720,0:26:25.250 don't have permissions, it doesn't exist... 0:26:25.250,0:26:28.780 That will give a nonzero error 0:26:28.780,0:26:32.420 command, so you will execute exit 0:26:32.420,0:26:33.920 and that will stop the script 0:26:33.920,0:26:35.810 instead of continue executing as if 0:26:35.810,0:26:37.240 you were in a place that you are 0:26:37.240,0:26:42.900 actually not in. And actually I haven't tested, but I 0:26:42.920,0:26:47.179 think we can check for "example.sh" 0:26:47.179,0:26:50.809 and here we're getting that we should be 0:26:50.809,0:26:55.070 checking the exit code in a different way, because it's 0:26:55.070,0:26:57.710 probably not the best way, doing it this 0:26:57.710,0:27:01.580 way. One last remark I want to make 0:27:01.580,0:27:05.090 is that when you're writing bash scripts 0:27:05.090,0:27:07.159 or functions for that matter, 0:27:07.159,0:27:09.080 there's kind of a difference between 0:27:09.080,0:27:12.590 writing bash scripts in isolation like a 0:27:12.590,0:27:14.149 thing that you're gonna run, and a thing 0:27:14.149,0:27:16.100 that you're gonna load into your shell. 0:27:16.100,0:27:19.850 We will see some of this in the command 0:27:19.850,0:27:23.090 line environment lecture, where we will kind of 0:27:23.090,0:27:29.059 be tooling with the bashrc and the sshrc. But in general, if you make 0:27:29.059,0:27:31.370 changes to for example where you are, 0:27:31.370,0:27:34.009 like if you CD into a bash script and you 0:27:34.009,0:27:36.919 just execute that bash script, it won't CD 0:27:36.919,0:27:39.980 into the shell are right now. But if you 0:27:39.980,0:27:42.980 have loaded the code directly into 0:27:42.980,0:27:45.559 your shell, for example you load... 0:27:45.559,0:27:48.440 you source the function and then you execute 0:27:48.440,0:27:50.269 the function then you will get those 0:27:50.269,0:27:52.000 side effects. And the same goes for 0:27:52.000,0:27:57.220 defining variables into the shell. 0:27:57.220,0:28:03.950 Now I'm going to talk about some tools that I think are nifty when 0:28:03.950,0:28:07.580 working with the shell. The first was 0:28:07.580,0:28:09.799 also briefly introduced yesterday. 0:28:09.799,0:28:13.309 How do you know what flags, or like 0:28:13.309,0:28:15.320 what exact commands are. Like how I am 0:28:15.320,0:28:21.889 supposed to know that LS minus L will list the files in a list format, or that 0:28:21.889,0:28:25.789 if I do "move - i", it's gonna like prom me 0:28:25.789,0:28:28.639 for stuff. For that what you have is the "man" 0:28:28.639,0:28:30.730 command. And the man command will kind of 0:28:30.730,0:28:33.590 have like a lot of information of how 0:28:33.590,0:28:35.809 will you go about... so for example here it 0:28:35.809,0:28:40.340 will explain for the "-i" flag, there are 0:28:40.340,0:28:43.970 all these options you can do. That's 0:28:43.970,0:28:45.620 actually pretty useful and it will work 0:28:45.620,0:28:51.540 not only for really simple commands that come packaged with your OS 0:28:51.540,0:28:55.809 but will also work with some tools that you install from the internet 0:28:55.809,0:28:58.240 for example, if the person that did the 0:28:58.240,0:29:01.390 installation made it so that the man 0:29:01.390,0:29:03.399 package were also installed. So for example 0:29:03.399,0:29:06.490 a tool that we're gonna cover in a bit 0:29:06.490,0:29:12.370 which is called "ripgrep" and is called with RG, this didn't 0:29:12.370,0:29:14.980 come with my system but it has installed 0:29:14.980,0:29:17.230 its own man page and I have it here and 0:29:17.230,0:29:21.700 I can access it. For some commands the 0:29:21.700,0:29:25.029 man page is useful but sometimes it can be 0:29:25.029,0:29:28.270 tricky to decipher because it's more 0:29:28.270,0:29:30.399 kind of a documentation and a 0:29:30.399,0:29:32.679 description of all the things the tool 0:29:32.679,0:29:35.860 can do. Sometimes it will have 0:29:35.860,0:29:37.720 examples but sometimes not, and sometimes 0:29:37.720,0:29:41.620 the tool can do a lot of things so a 0:29:41.620,0:29:45.250 couple of good tools that I use commonly 0:29:45.250,0:29:50.289 are "convert" or "ffmpeg", which deal with images and video respectively and 0:29:50.289,0:29:52.419 the man pages are like enormous. So there's 0:29:52.419,0:29:54.850 one neat tool called "tldr" that 0:29:54.850,0:29:58.240 you can install and you will have like 0:29:58.240,0:30:02.710 some nice kind of explanatory examples 0:30:02.710,0:30:05.470 of how you want to use this command. And you 0:30:05.470,0:30:07.840 can always Google for this, but I find 0:30:07.840,0:30:10.120 myself saving going into the 0:30:10.120,0:30:12.640 browser, looking about some examples and 0:30:12.640,0:30:14.919 coming back, whereas "tldr" are 0:30:14.919,0:30:16.870 community contributed and 0:30:16.870,0:30:19.210 they're fairly useful. Then, 0:30:19.210,0:30:23.020 the one for "ffmpeg" has a lot of 0:30:23.020,0:30:24.940 useful examples that are more nicely 0:30:24.940,0:30:26.799 formatted (if you don't have a huge 0:30:26.799,0:30:30.820 font size for recording). Or even 0:30:30.820,0:30:33.250 simple commands like "tar", that have a lot 0:30:33.250,0:30:35.470 of options that you are combining. So for 0:30:35.470,0:30:37.840 example, here you can be combining 2, 3... 0:30:37.840,0:30:41.710 different flags and it can not be 0:30:41.710,0:30:43.419 obvious, when you want to combine 0:30:43.419,0:30:48.429 different ones. That's how you 0:30:48.429,0:30:54.850 would go about finding more about these tools. On the topic of finding, let's try 0:30:54.850,0:30:58.690 learning how to find files. You can 0:30:58.690,0:31:03.100 always go "ls", and like you can go like 0:31:03.100,0:31:05.950 "ls project1", and 0:31:05.950,0:31:08.559 keep LS'ing all the way through. But 0:31:08.559,0:31:11.740 maybe, if we already know that we want 0:31:11.740,0:31:15.450 to look for all the folders called 0:31:15.450,0:31:19.000 "src", then there's probably a better command 0:31:19.000,0:31:21.400 for doing that. And that's "find". 0:31:21.460,0:31:26.679 Find is the tool that, pretty much comes with every UNIX system. And find, 0:31:26.679,0:31:35.230 we're gonna give it... here we're saying we want to call find in the 0:31:35.230,0:31:37.510 current folder, remember that "." stands 0:31:37.510,0:31:40.149 for the current folder, and we want the 0:31:40.149,0:31:46.539 name to be "src" and we want the type to be a directory. And by typing that it's 0:31:46.539,0:31:49.870 gonna recursively go through the current 0:31:49.870,0:31:52.330 directory and look for all these files, 0:31:52.330,0:31:58.659 or folders in this case, that match this pattern. Find has a lot of useful 0:31:58.659,0:32:01.840 flags. So for example, you can even test 0:32:01.840,0:32:05.440 for the path to be in a way. Here we're 0:32:05.440,0:32:08.230 saying we want some number of folders, 0:32:08.230,0:32:09.909 we don't really care how many folders, 0:32:09.909,0:32:13.179 and then we care about all the Python 0:32:13.179,0:32:17.830 scripts, all the things with the extension ".py", that are within a 0:32:17.830,0:32:19.899 test folder. And we're also checking, just in 0:32:19.899,0:32:21.519 cases really but we're checking just 0:32:21.519,0:32:24.460 that it's also a type F, which stands for 0:32:24.460,0:32:28.710 file. We're getting all these files. 0:32:28.710,0:32:32.169 You can also use different flags for things 0:32:32.169,0:32:34.000 that are not the path or the name. 0:32:34.000,0:32:38.160 You could check things that have been 0:32:38.160,0:32:42.060 modified ("-mtime" is for the modification time), things that have been 0:32:42.070,0:32:44.540 modified in the last day, which is gonna 0:32:44.559,0:32:46.659 be pretty much everything. So this is gonna print 0:32:46.659,0:32:49.029 a lot of the files we created and files 0:32:49.029,0:32:51.850 that were already there. You can even 0:32:51.850,0:32:54.960 use other things like size, the owner, 0:32:54.960,0:32:59.080 permissions, you name it. What is even more 0:32:59.080,0:33:01.870 powerful is, "find" can find stuff 0:33:01.870,0:33:04.269 but it also can do stuff when you 0:33:04.269,0:33:10.690 find those files. So we could look for all 0:33:10.690,0:33:14.080 the files that have a TMP 0:33:14.080,0:33:18.160 extension, which is a temporary extension, and 0:33:18.160,0:33:22.720 then, we can tell "find" that for every one of those files, 0:33:22.720,0:33:26.350 just execute the "rm" command for them. And 0:33:26.350,0:33:29.050 that will just be calling "rm" with all 0:33:29.050,0:33:32.350 these files. So let's first execute it 0:33:32.350,0:33:35.760 without, and then we execute it with it. 0:33:35.760,0:33:38.950 Again, as with the command line 0:33:38.950,0:33:41.470 philosophy, it looks like nothing 0:33:41.470,0:33:48.070 happened. But since we have a zero error code, something 0:33:48.070,0:33:49.540 happened - just that everything went 0:33:49.540,0:33:51.490 correct and everything is fine. And now, 0:33:51.490,0:33:57.810 if we look for these files, they aren't there anymore. 0:33:57.810,0:34:02.950 Another nice thing about the shell in general is that there are 0:34:02.950,0:34:05.890 these tools, but people will keep 0:34:05.890,0:34:08.230 finding new ways, so alternative 0:34:08.230,0:34:12.220 ways of writing these tools. It's nice to know about it. So, for 0:34:12.220,0:34:20.020 example find if you just want to match the things that end in "tmp" 0:34:20.020,0:34:24.190 it can be sometimes weird to do this thing, it has a long command. 0:34:24.190,0:34:27.760 There's things like "fd", 0:34:27.760,0:34:32.320 for example, that is a shorter command that by default will use regex 0:34:32.320,0:34:34.899 and will ignore your gitfiles, so you 0:34:34.899,0:34:38.020 don't even search for them. It 0:34:38.020,0:34:42.879 will color-code, it will have better Unicode support... It's nice to 0:34:42.879,0:34:45.040 know about some of these tools. But, again, 0:34:45.040,0:34:52.149 the main idea is that if you are aware that these tools exist, you can 0:34:52.149,0:34:53.740 save yourself a lot of time from doing 0:34:53.740,0:34:57.660 kind of menial and repetitive tasks. 0:34:57.660,0:35:00.010 Another command to bear in mind is like 0:35:00.010,0:35:01.990 "find". Some of you may be 0:35:01.990,0:35:04.300 wondering, "find" is probably just 0:35:04.300,0:35:06.520 actually going through a directory 0:35:06.520,0:35:09.580 structure and looking for things but 0:35:09.580,0:35:11.260 what if I'm doing a lot of "finds" a day? 0:35:11.260,0:35:12.850 Wouldn't it be better, doing kind of 0:35:12.850,0:35:18.790 a database approach and build an index first, and then use that index 0:35:18.790,0:35:21.520 and update it in some way. Well, actually 0:35:21.520,0:35:23.380 most Unix systems already do it and 0:35:23.380,0:35:28.170 this is through the "locate" command and 0:35:28.170,0:35:31.690 the way that the locate will 0:35:31.690,0:35:35.470 be used... it will just look for paths in 0:35:35.470,0:35:38.680 your file system that have the substring 0:35:38.680,0:35:44.710 that you want. I actually don't know if it will work... Okay, it worked. Let me try to 0:35:44.710,0:35:49.840 do something like "missing-semester". 0:35:51.840,0:35:53.950 You're gonna take a while but 0:35:53.950,0:35:56.109 it found all these files that are somewhere 0:35:56.109,0:35:57.730 in my file system and since it has 0:35:57.730,0:36:01.750 built an index already on them, it's much 0:36:01.750,0:36:05.680 faster. And then, to keep it updated, 0:36:05.680,0:36:11.980 using the "updatedb" command that is running through cron, 0:36:13.840,0:36:18.490 to update this database. Finding files, again, is 0:36:18.490,0:36:23.230 really useful. Sometimes you're actually concerned about, not the files themselves, 0:36:23.230,0:36:26.740 but the content of the files. For that 0:36:26.740,0:36:31.420 you can use the grep command that we 0:36:31.420,0:36:33.880 have seen so far. So you could do 0:36:33.880,0:36:37.740 something like grep foobar in MCD, it's there. 0:36:37.740,0:36:43.690 What if you want to, again, recursively search through the current 0:36:43.690,0:36:45.760 structure and look for more files, right? 0:36:45.760,0:36:48.700 We don't want to do this manually. 0:36:48.700,0:36:51.220 We could use "find", and the "-exec", but 0:36:51.220,0:36:58.920 actually "grep" has the "-R" flag that will go through the entire 0:36:58.920,0:37:03.609 directory, here. And it's telling us 0:37:03.609,0:37:06.579 that oh we have the foobar line in example.sh 0:37:06.579,0:37:09.279 at these three places and in 0:37:09.279,0:37:14.589 this other two places in foobar. This can be 0:37:14.589,0:37:16.900 really convenient. Mainly, the 0:37:16.900,0:37:18.940 use case for this is you know you have 0:37:18.940,0:37:21.910 written some code in some programming 0:37:21.910,0:37:23.859 language, and you know it's somewhere in 0:37:23.859,0:37:26.200 your file system but you actually don't 0:37:26.200,0:37:28.599 know. But you can actually quickly search. 0:37:28.600,0:37:32.980 So for example, I can quickly search 0:37:35.660,0:37:40.320 for all the Python files that I have in my 0:37:40.329,0:37:45.460 scratch folder where I used the request library. 0:37:45.460,0:37:47.589 And if I run this, it's giving me 0:37:47.589,0:37:50.890 through all these files, exactly in 0:37:50.890,0:37:53.650 what line it has been found. And here 0:37:53.650,0:37:56.260 instead of using grep, which is fine, 0:37:56.260,0:37:58.930 you could also do this, I'm using "ripgrep", 0:37:58.930,0:38:05.260 which is kind of the same idea but again trying to bring some more 0:38:05.260,0:38:09.730 niceties like color coding or file 0:38:09.730,0:38:16.480 processing and other things. It think it has, also, unicode support. It's also pretty 0:38:16.480,0:38:22.829 fast so you are not paying like a trade-off on this being slower and 0:38:22.829,0:38:25.420 there's a lot of useful flags. You 0:38:25.420,0:38:27.670 can say, oh, I actually want to get some 0:38:27.670,0:38:30.460 context around those results. 0:38:33.040,0:38:36.400 So I want to get like five lines of context around 0:38:36.400,0:38:42.819 that, so you can see where that import lives and see code around it. 0:38:42.819,0:38:44.170 Here in the import it's not really useful 0:38:44.170,0:38:45.819 but like if you're looking for where you 0:38:45.819,0:38:49.720 use the function, for example, it will 0:38:49.720,0:38:54.010 be very handy. We can also do things like 0:38:54.010,0:38:59.170 we can search, for example here,. 0:38:59.170,0:39:04.839 A more advanced use, we can say, 0:39:04.840,0:39:11.580 "-u" is for don't ignore hidden files, sometimes 0:39:12.520,0:39:16.359 you want to be ignoring hidden files, except if you want to 0:39:16.359,0:39:23.500 search config files, that are by default hidden. Then, instead of printing 0:39:23.500,0:39:28.400 the matches, we're asking to do something that would be kind of hard, I think, 0:39:28.400,0:39:31.380 to do with grep, out of my head, which is 0:39:31.390,0:39:34.569 "I want you to print all the files that 0:39:34.569,0:39:37.750 don't match the pattern I'm giving you", which 0:39:37.750,0:39:40.030 may be a weird thing to ask here but 0:39:40.030,0:39:42.940 then we keep going... And this pattern here 0:39:42.940,0:39:45.790 is a small regex which is saying 0:39:45.790,0:39:48.099 at the beginning of the line I have a 0:39:48.099,0:39:51.190 "#" and a "!", and that's a shebang. 0:39:51.190,0:39:53.470 Like that, we're searching here for all 0:39:53.470,0:39:56.650 the files that don't have a shebang 0:39:56.650,0:39:59.369 and then we're giving it, here, 0:39:59.369,0:40:02.470 a "-t sh" to only look for "sh" 0:40:02.470,0:40:07.660 files, because maybe all your Python or text files are fine 0:40:07.660,0:40:10.000 without a shebang. And here it's telling us 0:40:10.000,0:40:13.020 "oh, MCD is obviously missing a shebang" 0:40:14.760,0:40:16.660 We can even... It has like some 0:40:16.660,0:40:19.119 nice flags, so for example if we 0:40:19.120,0:40:21.360 include the "stats" flag 0:40:28.700,0:40:34.119 it will get all these results but it will also tell us information about all 0:40:34.119,0:40:35.410 the things that it searched. For example, 0:40:35.410,0:40:40.390 the number of matches that it found, the lines, the file searched, 0:40:40.390,0:40:44.040 the bytes that it printed, &c. 0:40:44.040,0:40:47.160 Similar as with "fd", sometimes it's not as useful 0:40:48.400,0:40:50.619 using one specific tool or another and 0:40:50.620,0:40:55.780 in fact, as ripgrep, there are several other tools. Like "ack", 0:40:55.780,0:40:57.700 is the original grep alternative that was 0:40:57.700,0:41:00.670 written. Then the silver searcher, 0:41:00.670,0:41:04.089 "ag", was another one... and they're all 0:41:04.089,0:41:05.589 pretty much interchangeable so 0:41:05.589,0:41:07.630 maybe you're at a system that has one and 0:41:07.630,0:41:09.670 not the other, just knowing that you can 0:41:09.670,0:41:12.040 use these things with these tools can be 0:41:12.040,0:41:15.549 fairly useful. Lastly, I want to cover 0:41:15.549,0:41:19.780 how you go about, not finding files or code, but how you go about 0:41:19.780,0:41:22.540 finding commands that you already 0:41:22.540,0:41:30.160 some time figured out. The first, obvious way is just using the up arrow, 0:41:30.160,0:41:34.540 and slowly going through all your history, looking for these matches. 0:41:34.540,0:41:36.490 This is actually not very efficient, as 0:41:36.490,0:41:42.579 you probably guessed. So the bash has ways to do this more easily. 0:41:42.579,0:41:44.619 There is the "history" command, that will 0:41:44.619,0:41:49.180 print your history. Here I'm in zsh and it only prints some of my history, but 0:41:49.180,0:41:54.069 if I say, I want you to print everything from the beginning of time, it will print 0:41:54.069,0:41:58.220 everything from the beginning of whatever this history is. 0:41:58.220,0:42:00.700 And since this is a lot of results, 0:42:00.700,0:42:02.589 maybe we care about the ones where we 0:42:02.589,0:42:08.490 use the "convert" command to go from some type of file to some other type of file. 0:42:08.490,0:42:12.940 Some image, sorry. Then, we're getting all 0:42:12.940,0:42:15.849 these results here, about all the ones 0:42:15.849,0:42:18.120 that match this substring. 0:42:21.280,0:42:24.609 Even more, pretty much all shells by default will 0:42:24.609,0:42:27.130 link "Ctrl+R", the keybinding, 0:42:27.130,0:42:29.680 to do backward search. Here we 0:42:29.680,0:42:31.569 have backward search, where we can 0:42:31.569,0:42:34.750 type "convert" and it's finding the 0:42:34.750,0:42:36.609 command that we just typed. And if we just 0:42:36.609,0:42:38.619 keep hitting "Ctrl+R", it will 0:42:38.619,0:42:41.740 kind of go through these matches and 0:42:41.740,0:42:44.260 it will let re-execute it 0:42:44.260,0:42:49.240 in place. Another thing that you can do, 0:42:49.240,0:42:51.069 related to that, is you can use this 0:42:51.069,0:42:53.829 really nifty tool called "fzf", which is 0:42:53.829,0:42:56.280 like a fuzzy finder, like it will... 0:42:57.100,0:42:58.480 It will let you do kind of 0:42:58.480,0:43:02.200 like an interactive grep. We could do 0:43:02.200,0:43:06.369 for example this, where we can cat our 0:43:06.369,0:43:10.030 example.sh command, that will print 0:43:10.030,0:43:11.680 print to the standard output, and then we 0:43:11.680,0:43:14.290 can pipe it through fzf. It's just getting 0:43:14.290,0:43:18.490 all the lines and then we can interactively look for the 0:43:18.490,0:43:21.849 string that we care about. And the nice 0:43:21.849,0:43:26.349 thing about fzf is that, if you enable the default bindings, it will bind to 0:43:26.349,0:43:33.670 your "Ctrl+R" shell execution and now 0:43:33.670,0:43:36.490 you can quickly and dynamically like 0:43:36.490,0:43:41.700 look for all the times you try to convert a favicon in your history. 0:43:42.020,0:43:46.375 And it's also like fuzzy matching, whereas like by default in grep 0:43:46.375,0:43:49.420 or these things you have to write a regex or some 0:43:49.420,0:43:52.360 expression that will match within here. 0:43:52.360,0:43:54.609 Here I'm just typing "convert" and "favicon" and 0:43:54.609,0:43:57.369 it's just trying to do the best scan, 0:43:57.369,0:44:01.349 doing the match in the lines it has. 0:44:01.349,0:44:06.190 Lastly, a tool that probably you have already seen, that I've been using 0:44:06.190,0:44:08.410 for not retyping these extremely long 0:44:08.410,0:44:13.080 commands is this "history substring search", where 0:44:13.940,0:44:15.660 as I type in my shell, 0:44:15.670,0:44:19.630 and both F fail to mention but both face 0:44:19.630,0:44:22.760 which I think was originally introduced, this concept, and then 0:44:22.760,0:44:25.760 zsh has a really nice implementation) 0:44:25.760,0:44:26.800 what it'll let you do is 0:44:26.800,0:44:31.300 as you type the command, it will dynamically search back in your 0:44:31.300,0:44:34.420 history to the same command that has a common prefix, 0:44:34.980,0:44:36.900 and then, if you... 0:44:39.100,0:44:42.100 it will change as the match list stops 0:44:42.100,0:44:44.110 working and then as you do the 0:44:44.120,0:44:49.760 right arrow you can select that command and then re-execute it. 0:45:05.800,0:45:09.920 We've seen a bunch of stuff... I think I have 0:45:09.940,0:45:16.180 a few minutes left so I'm going to cover a couple of tools to do 0:45:16.180,0:45:20.060 really quick directory listing and directory navigation. 0:45:20.060,0:45:30.020 So you can always use the "-R" to recursively list some directory structure, 0:45:30.020,0:45:35.160 but that can be suboptimal, I cannot really make sense of this easily. 0:45:36.340,0:45:44.460 There's tool called "tree" that will be the much more friendly form of 0:45:44.460,0:45:47.500 printing all the stuff, it will also color code based on... 0:45:47.500,0:45:50.680 here for example "foo" is blue because it's a directory and 0:45:50.680,0:45:55.100 this is red because it has execute permissions. 0:45:55.100,0:46:00.220 But we can go even further than that. There's really nice tools 0:46:00.220,0:46:04.580 like a recent one called "broot" that will do the same thing but here 0:46:04.580,0:46:07.300 for example instead of doing this thing of listing 0:46:07.300,0:46:09.160 every single file, for example in bar 0:46:09.160,0:46:11.400 we have these "a" through "j" files, 0:46:11.400,0:46:14.260 it will say "oh there are more, unlisted here". 0:46:15.080,0:46:18.200 I can actually start typing and it will again 0:46:18.200,0:46:21.540 again facily match to the files that are there 0:46:21.540,0:46:24.800 and I can quickly select them and navigate through them. 0:46:24.800,0:46:28.380 So, again, it's good to know that 0:46:28.380,0:46:33.340 these things exist so you don't lose a large amount of time 0:46:34.240,0:46:36.180 going for these files. 0:46:37.880,0:46:40.500 There are also, I think I have it installed 0:46:40.500,0:46:44.829 also something more similar to what you would expect your OS to have, 0:46:44.829,0:46:49.960 like Nautilus or one of the Mac finders that have like an 0:46:49.960,0:46:59.260 interactive input where you can just use your navigation arrows and quickly explore. 0:46:59.260,0:47:03.849 It might be overkill but you'll be surprised how quickly you can 0:47:03.849,0:47:07.839 make sense of some directory structure by just navigating through it. 0:47:07.840,0:47:12.780 And pretty much all of these tools will let you edit, copy files... 0:47:12.780,0:47:16.880 if you just look for the options for them. 0:47:17.600,0:47:20.100 The last addendum is kind of going places. 0:47:20.100,0:47:24.480 We have "cd", and "cd" is nice, it will get you 0:47:26.120,0:47:30.060 to a lot of places. But it's pretty handy if 0:47:30.069,0:47:33.190 you can like quickly go places, 0:47:33.190,0:47:36.730 either you have been to recently or that 0:47:36.730,0:47:40.599 you go frequently. And you can do this in 0:47:40.599,0:47:42.520 many ways there's probably... you can start 0:47:42.520,0:47:44.319 thinking, oh I can make bookmarks, I can 0:47:44.319,0:47:46.660 make... I can make aliases in the shell, 0:47:46.660,0:47:49.020 that we will cover at some point, 0:47:49.020,0:47:53.020 symlinks... But at this point, 0:47:53.020,0:47:54.910 programmers have like built all these 0:47:54.910,0:47:56.799 tools, so programmers have already figured 0:47:56.799,0:47:59.520 out a really nice way of doing this. 0:47:59.520,0:48:01.930 One way of doing this is using what is 0:48:01.930,0:48:05.760 called "auto jump", which I think is not loaded here... 0:48:14.140,0:48:20.100 Okay, don't worry. I will cover it in the command line environment. 0:48:21.960,0:48:25.579 I think it's because I disabled the "Ctrl+R" and that also 0:48:25.579,0:48:31.309 affected other parts of the script. I think at this point if anyone has 0:48:31.309,0:48:35.480 any questions that are related to this, I'll be more than happy to answer 0:48:35.480,0:48:37.509 them, if anything was left unclear. 0:48:37.509,0:48:42.859 Otherwise, a there's a bunch of exercises that we wrote, kind of 0:48:42.859,0:48:46.549 touching on these topics and we encourage you to try them and 0:48:46.549,0:48:48.559 come to office hours, where we can help 0:48:48.559,0:48:54.569 you figure out how to do them, or some bash quirks that are not clear.