I have been deep into Latex for the last little while, pounding out my thesis. It is a long process that involves a lot of editing, rearranging, and tweaking. Every time I revise another paragraph, I give thanks that I am using fast and powerful BBEdit, and not something slow and cumbersome like Word. But something has been missing.
A couple weeks ago, I read a post by Brandon Rhodes titled “One sentence per line”. He quotes the following advice from “UNIX for beginners” [PDF] by Brian Kernighan. Kerninghan is the K in AWK, so pay attention.
Start each sentence on a new line. Make lines short, and break lines at natural places, such as after commas and semicolons, rather than randomly. Since most people change documents by rewriting phrases and adding, deleting and rearranging sentences, these precautions simplify any editing you have to do later.
One common feature of Latex, HTML, Markdown, and many other markup languages is that they ignore single line feeds in the document source. Only my collaborators and I will be reading the source, so I can format it however I want. I might as well make life easy on myself.
Return policies
If line feeds do not affect the final document, when should I end a line?
On the one hand, I could press return only when necessary, for example, to end a paragraph. My document will contain long lines, but my text editor can wrap them for me. On the down side, the long lines could cause some trouble when I work from the command line, which generally won’t wrap lines intelligently. Also, programs which compare two versions of a file will usually be less helpful since they report differences on a line-by-line basis. (Using a word-by-word diff program will improve this, but the results are still inferior to a line-based diff of a file with short lines.)
On the other hand, I could wrap text using hard returns. Most editors will do this for me, so I’m not actually pressing return at the end of each line. This will result in shorter lines, which will make the text easier to read from the command line. In Latex I will get more precise error location, since Latex reports errors by the line on which they occur.
One disadvantage of hard wrapping is that it makes editing awkward. If I add one word to one line, I will end up pushing a single word down to the next line. Then my obsessiveness requires me to “reflow” the paragraph, removing line breaks and inserting new ones so that the lines are the same length. Even though most text editors automate the reflow process, making changes becomes quite tedious—edit, reflow, edit, reflow, edit, reflow, etc. Furthermore, each edit results in many lines changing, so comparing two documents is not much easier than when I soft wrap. Even worse, reflowing paragraphs often removes line breaks that I put in on purpose.
For example, if I make a change to the beginning of a paragraph that ends with the following text:
which can be found using the formula
\begin{equation}
a^2 + b^2 = c^2 .
\end{equation}
This is called the Pythagorean theorem.
Reflowing is often overagressive, resulting in
which can be found using the formula \begin{equation} a^2 +
b^2 = c^2 . \end{equation} This is called the Pythagorean
theorem.
The reflowed text is still correct, and will produce the same output, but is much harder to read. This kind of thing is especially a problem in a Latex math document, where I am often switching back and forth between prose and structured mathematical formulas.
A better way
Kerninghan’s suggestion to end lines at natural breaks solves many of these issues. Documents will have short lines, which is helpful for when I need to work in a terminal or send parts of a document by email. Edits will generally change a small number of lines, which makes the diff tools work.
I get the added benefit of being able to rearrange sentences and phrases much more easily. For complicated technical statements, I can use newlines and even indentation to make the source easier to parse.
The disadvantage is that I now have to push the return key at the end of every line. For a simple document that won’t go through many revisions, this may not be worth it. But Kerninghan also points out that most documents require more revisions than we initially expect. For a dissertation, and probably any academic paper, it is worth the extra effort.
Old habits die hard
Even after a few days of ending my lines manually at natural breaks, I often find myself getting close to the “right margin”. Usually, there is a perfectly placed comma a few words back where I should have ended the line.
I made an AppleScript for BBEdit called “New line after punctuation”
that looks for the last punctuation mark on the line and inserts a hard return immediately after it. I have it assigned to the keyboard shortcut Control+Return.
I know I’m extremely late to the party, but I am finally using FastScripts.
Until now, most of the scripts I use are either shell scripts that I activate from inside Terminal, or scripts that I use inside an application that supports them, like BBEdit or BibDesk.
I had tried before to find a way to run scripts from arbitrary applications, but I was never happy with what I found.
Then I found FastScripts, which of course has been around forever. It does everything I want it to. My favorite part is the ability to create application specific scripts and keyboard shortcuts, which are only shown or run when a certain application is running. That way the menu (and more importantly, the keyboard shortcut space) isn’t cluttered with irrelevant scripts.
While I’m on the subject of clutter, I recommend removing system scripts from the FastScripts menu.
Here are some scripts I collected to get my FastScripts library started.
Global
Finder
Safari
Terminal
AppleScript is a great tool. It is awesome to be able to get the selected text from an application, grab the current URL from Safari, ask the user to choose a file, or show a dialog box requesting text. But writing AppleScript scripts is usually painful.
For anything mildly complicated, I would much rather write something in Python. So a lot of my AppleScripts look like this:
- Get information from the user or currently open application
do shell script some_python_or_bash_script
- Do something with the result
For yet another time, I recently found myself making an AppleScript where part 3 of the process involved composing an email to someone. It is difficult to take the result of the shell script (which is just a single, structureless string) and parse out multiple fields (body, subject, recipient) to pass to a complicated make new message command.
So instead, I made a Python wrapper around the make new message AppleScript command. Yes, that means I am using AppleScript to call a shell script which runs an AppleScript, but I’m okay with that. (Others have done the same thing, but not with the full set of options that I wanted.)
Why use Mail.app?
There are already command line mail programs. Why not just use one of them? Two reasons.
First, getting mail to transfer properly is always a pain. Comcast won’t let you use their SMTP, and if they did, your message would probably be marked as spam. So you have to figure out how to hook authenticated SMTP up to Google, and then it breaks, and you just get sick of it. Currently, my best solution to this has been to pipe a message over SSH to my work computer, which has a fully functional transfer agent, just to send an email to myself!
Second, and more important, often you want to see the message and maybe edit it a little before you send it. This also minimizes the chance that a script will screw up and either not send the mail or send duplicates.
Create the AppleScript
AppleScript to create a mail message looks about like this:
tell application "Mail"
make new outgoing message with properties {visible:true,¬
subject:"Happy Birthday!",content:"The big 60!"}
tell result
make new to recipient with properties {address:"mom@gmail.com"}
make new attachment with properties {file name:"cake.jpg"}
end tell
end tell
The first half of the Python script does nothing more than create an AppleScript and feed it to the osascript command.
#!/usr/bin/python
import sys
import argparse
import os.path
from subprocess import Popen,PIPE
def escape(s):
"""Escape backslashes and quotes to appease AppleScript"""
s = s.replace("\\","\\\\")
s = s.replace('"','\\"')
return s
def make_message(content,subject=None,to_addr=None,from_addr=None,
send=False,cc_addr=None,bcc_addr=None,attach=None):
"""Use applescript to create a mail message"""
if send:
properties = ["visible:false"]
else:
properties = ["visible:true"]
if subject:
properties.append('subject:"%s"' % escape(args.s))
if from_addr:
properties.append('sender:"%s"' % escape(args.r))
if len(content) > 0:
properties.append('content:"%s"' % escape(content))
properties_string = ",".join(properties)
template = 'make new %s with properties {%s:"%s"}'
make_new = []
if to_addr:
make_new.extend([template % ("to recipient","address",
escape(addr)) for addr in to_addr])
if cc_addr:
make_new.extend([template % ("cc recipient","address",
escape(addr)) for addr in cc_addr])
if bcc_addr:
make_new.extend([template % ("bcc recipient","address",
escape(addr)) for addr in bcc_addr])
if attach:
make_new.extend([template % ("attachment","file name",
escape(os.path.abspath(file))) for addr in to_addr])
if send:
make_new.append('send')
if len(make_new) > 0:
make_new_string = "tell result\n" + "\n".join(make_new) + \
"\nend tell\n"
else:
make_new_string = ""
script = """tell application "Mail"
make new outgoing message with properties {%s}
%s end tell
""" % (properties_string, make_new_string)
# run applescript
p = Popen('/usr/bin/osascript',stdin=PIPE,stdout=PIPE)
p.communicate(script) # send script to stdin
return p.returncode
Dr. Drang recently complained about how inconvenient it is to send data to a subprocess in Python. I feel his pain, because I have spent plenty of time and trial and error to figure out how Popen and communicate work. The official documentation is no help, either.
In the end, though, there is nothing terribly ugly about the three lines that run the AppleScript. If you want to send anything to the subprocess’s stdin, you need the argument stdin=PIPE (or =subprocess.PIPE, depending on your import statement). Running communicate returns a tuple with the subprocess’s stdout and stderr, but only if you use the arguments stdout=PIPE and stderr=PIPE. So my script, communicate only returns the stdout (which I discard).
When you don’t specify stderr=PIPE, the error output is just passed along to the main process’s stderr (and so also with stdout). If you run my script from the command line, any errors from the osascript command will just be printed on your screen (unless, of course, you do something like 2>foo).
Use argparse
My newest rule to myself is “Never parse your own command line arguments.” Especially when I make something that I only ever plan to call from other scripts, and nobody but me is ever going to see, it is very tempting to do something stupid like require 8 positional arguments in a specific order.
Then you change some script somewhere and everything breaks. Or you want to use the script again and there is no --help. So you have to jump into source that you wrote a year ago just to figure out what to do. Not good.
The argparse library is new and replaces the short-lived and now depreciated optparse. But it has lots of useful bells and whistles. For example, with the type=argparse.FileType() option, you can add an argument that expects a filename and automatically opens the file for you. It also creates a --help option automatically.
Here is the second half of the script.
def parse_arguments():
parser = argparse.ArgumentParser(
description="Create a new mail message using Mail.app")
parser.add_argument('recipient',metavar="to-addr",nargs="*",
help="message recipient(s)")
parser.add_argument('-s',metavar="subject",help="message subject")
parser.add_argument('-c',metavar="addr",nargs="+",
help="carbon copy recipient(s)")
parser.add_argument('-b',metavar="addr",nargs="+",
help="blind carbon copy recipient(s)")
parser.add_argument('-r',metavar="addr",help="from address")
parser.add_argument('-a',metavar="file",nargs="+",
help="attachment(s)")
parser.add_argument('--input',metavar="file",help="Input file",
type=argparse.FileType('r'),default=sys.stdin)
parser.add_argument('--send',action="store_true",
help="Send the message")
return parser.parse_args()
if __name__ == "__main__":
args = parse_arguments()
content = args.input.read()
code = make_message(content,
subject = args.s,
to_addr = args.recipient,
from_addr = args.r,
send = args.send,
cc_addr = args.c,
bcc_addr = args.b,
attach = args.a)
sys.exit(code)
When you run parse_args, it returns a special Namespace object, which has the parsed arguments as attributes. (Why didn’t they use a dictionary?) In my script, “recipient”, which is a positional argument because it lacks a leading hyphen, is stored in args.recipient. The subject is stored in args.s. If I wanted to, I could pass ["--subject","-s"] to add_argument, and then the subject would be stored in args.subject, but could be specified on the command line as either -s subject or --subject subject. With the action="store_true" argument, args.send will be true if the user gives the --send option, and false otherwise.
I call the script mailapp. Just run
$ ls | mailapp -s "Here's how my home directory looks"
I have been using Jekyll to generate both this blog and my academic website for the past year, and I can confidently say that it has solved more problems for me than it has created. (This may sound like faint praise, but I assure you that it is not.)
Recently I have been annoyed at how long it takes to deploy updates to my website due to the way that Jekyll mangles timestamps, which rsync depends heavily on. I finally broke down and spent some time improving the process by tweaking rsync to work better with my Jekyll setup.
The Jekyll timestamp problem
It has always bothered me that Jekyll mangles timestamps. When you run jekyll to regenerate your site, all timestamps are updated to the current time. (This is because all pages are regenerated—a separate and also annoying issue.) So to anything that uses timestamps to determine when a page has changed, it appears that every page changes whenever a single page changes.
There is no solution to this problem within the Jekyll framework. Each output file is created from several input files, so you could imagine setting the timestamp of each output file to be the maximum timestamp from all of the input files. But the input files often live on several computers and/or in a git repository, which makes the timestamp of the input files both ambiguous and worthless. In these circumstances, the timestamp of a file is not the same as the last modified time of the actual data. The only way to preserve the latter is through some external database, the avoidance of which is essentially Jekyll’s raison d’être.
Rsync complications
I can overlook the fact that the file metadata on my web server is meaningless, but I have a harder time ignoring the slow deployment this causes. My academic website currently has 43 megabytes in 434 files, all but 400 kilobytes is archival stuff that never changes, and usually I am only changing a few files at a time. Nevertheless, rsync usually takes 15 seconds, even if I am transferring within the campus network.
I have two sets of files. I want to take all the differences from my local set and send them to the server set. For each pair of files, rsync checks that the sizes and modification times match, and if not, it copies the local file to the server. It has an efficient copy mechanism, so if the files are identical despite having different modification times, very little data is sent. If a large file has only changed in a few places, only the changed chunks are sent.
If you use Jekyll, the modification times never match, so all files are always copied, albeit in an efficient manner. Despite the efficient transfer mechanism, this is slow.
The correct way to use rsync with Jekyll
What you want is for rsync to compute and compare checksums for each pair of files, and only transfer files which have different checksums. You can do this by using the --checksum (or -c) option. Despite a warning from the rsync manual that “this option can be quite slow”, it reduced my transfer time from 15 seconds to 2 seconds.
Here is the command I recommend to deploy a Jekyll site:
rsync --compress --recursive --checksum --delete _site/ user@host.tld:public_html/
Or, if you prefer the short version:
rsync -crz --delete _site/ user@host.tld:public_html/
More meaningful timestamps on the server
A side benefit of this tweak is that server timestamps have meaning again. If the local and server files have the same checksum, nothing is copied. The timestamp of the file on the server is now the time the file was last copied to the server.
If you use the --times (or -t) option, the server timestamps are manipulated to match the (meaningless) local file timestamps. This is not what you want.
If you use the --archive (or -a) option, which is recommended by almost every rsync tutorial out there, you are implicitly using the --times option, as -a is equivalent to -rlptgoD. This is also not what you want. For a Jekyll site, the only part of -a that you care about is the -r. So don’t use -a.
Miscellaneous notes on rsync options
- The
--itemize-changes (-i) option is a useful way of seeing what is transferred.
- The
--ignore-times (-I) option ignores timestamps, but not in the way you want. It simply copies all files no matter what (but still using the efficient transfer mechanism).
- If you leave off the
--times option and don’t use --checksum, then all files which have matching timestamps are skipped, and all other files are transferred, which changes their timestamp on the server to the current time. If you continue this over time, more and more files have different timestamps even though they are the same, which means they are copied every time.
- There is a
--size-only option which skips files if they have the same size on the local computer and the server, even if they have different modification times. You are tempting fate if you use this option.