Photos Library backup using python and launchd

The problem at hand

We use a large, external drive to store the household Photos library (Photos as in Apple Photos.app). The external drive is already backed up using one of the popular cloud backup services. I want to have a more convenient backup in case of catastrophic damage to the drive, which would reduce the time it takes to recover all our photos.

Assess the options and devise a plan

Given that the drive is already backed up to the cloud, we have some freedom. We could just connect another drive and do some kind of back up; we could set up Time Machine to do something (maybe); we could try to move the library to the computer's drive and then use the external drive as a back up; we could occasionally copy the photos to another drive or computer. I happen to have a Synology NAS ready for this type of thing, so I will use that. Instead of relying on my memory or even an automated reminder to go and sit in front of my computer and do periodic backups, I will automate the backup.

Here's the plan: write a script that will do the backup of the Photos library to the NAS, and then schedule that script to run regularly. Easy.


1. Enable SSH using keys on Synology

My Synology NAS is obviously under utilized, since I had never bothered to set up SSH keys before. It turns out to be easy, but only because others have already documented the process.

Follow instuctions posted here:

Note, at first I did all the steps except setting the home directory permissions. Turns out that post is correct that it is necessary. You need to change the permissions on the home directory or else SSH will still just prompt for the user password even though the keys are present.

Once this step is done, you can ssh into the NAS without entering a password.

2. Initial backup of the Photos library

Again, this road has been taken. For example:


In that example they use rsync, and I don't see any reason that isn't a great way to go for my purposes. I'm taking most of that rsync command, but I'm removing the "--delete" just in case (I have room, so no need to worry about it).

On the NAS, make a location for the backup:
cd /volume1/some_place
mkdir photos_library_backup

Just as a note for those who haven't looked at this before, it important to realize that the Photos library is called something like
/Volumes/external_drive/Photos\ Library.photoslibrary

I don't know any details about this, but I do know that it isn't a "file", but more like a directory. You can even just cd into it and look around.

The command that I'm using will be something like this:

rsync -Phca --stats --include="/Photos Library.photoslibrary/" --include="/Photos Library.photoslibrary/***" --exclude="*" -e "ssh" "/Volumes/external_drive/" my_name@external_drive.local:/photos_library_backup/

3. Set up automation with launchd

The post from Kevin Goedecke uses a shell (sh) script and crontab, but this is not the Apple way. We should use launchd.

I'm no expert at launchd, so I looked at a bunch of examples. Here are a few:

A pretty nice quick overview:

See also:


The bottom line is that you have to choose if you want the job to run only when you are logged in, or allow it to run as root. I want it to go ahead and run as root, since the photos library might be updated by other uses when I am not around. To do that, we just need to put a proper "plist" file into /Library/LauchDaemons. A plist file is just an xml file, but we have to follow some conventions which all the above links describe. Since this is a super simple job, my plist file is 20 lines.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
Essentially, the file says to run a python script at regular intervals. Arbitrarily, I'll just do it every morning at 3am, but it could have been less frequent.

4. Write a script to do the backup

The plist just schedules a job. That job is actually to run a python script. There are so many things that you could do here, but I'm just interested in doing one thing withoug incident: run rsync to backup my whole Photos Library.

There's not going to be anything fancy, but I think this does the job correctly. I will use subprocess.call() to invoke rsync, exactly as I did for my initial backup in step 2. That could have been put into a shell script, or even a python script, using very few lines.

I was slightly worried about what would happen if the external drive were unmounted/removed when the script was run. This could definitely happen; I think all the external drives get unmounted when no one is logged in to the system. So I put in a check, just to see if the path to the external drive seems valid:

from pathlib import Path
if local_loc.exists() and local_loc.is_dir():
    # proceed
    raise IOError("Something is wrong")
I did notice that in python 3.7 there is a path.is_mount() function, but the machine running the script is still on 3.6. After a minute of thinking about it, I decided that it does not really matter if if the location is a mount, what matters is whether it is there, so I went with this version.

The script also defines where to point the rsync command, which is just a string. So the work is basically accomplished by

prc = subprocess.run(["rsync", "-Phca", "--stats", '--include="/Photos Library.photoslibrary/"',
                  '--include="/Photos Library.photoslibrary/***"', '--exclude="*"', '-e', '"ssh"', local_loc, remot_plc],
where local_loc is the path to the library on the external drive and remot_plc is the path on the NAS. The stdout and stderr keyword arguments put the output of rsync into one attribute of prc.

I used that stdout/stderr in order to write a useful log file. I am just using the standard library logging module, and setting a log file for each daily run using a simple timestamp:

import logging
from datetime import datetime
# construct the log file name using today's date
logloc = Path("/HOME/logs/")
lognam = datetime.now().strftime('photo_backup_log_%Y%m%d_%H%M.log')
logfil = logloc/lognam
logging.basicConfig(level=logging.INFO, filename=logfil)
The subprocess.run() function uses stdout=subprocess.PIPE to put the stdout of rsync into an attribute of prc called stdout as a bytes object. I have a little function that parses that bytes object into a list of strings split on the newline character:

def log_subprocess_output(p):
 lines = p.stdout.decode("utf-8").split("\n")
 for line in lines:
This is called right after the subprocess.call(); it just puts all the output from rsync into the log file.

That's it. We now have a script that will try to run rsync to backup any changes we have made to the Photos library to a designated place on the NAS. If the external drive isn't available, it will raise an exception and exit. It logs all the steps to a timestamped file. The script is run by a root process at 3am every day. Not bad for 50 lines of python with no dependencies and a 20 line plist file.


Trimming time series with xarray

There are many times when a data set starts and/or ends at an inconvenient time. My most common experience with this is obtaining an observational data set that begins and ends at months in the middle of the year, but I want to either look at annual means or derive anomalies. These dangling months can be cumbersome, and most of the time I find that the easiest way to deal with them is to trim them off.

Just to be explicit, say I have data in which the time coordinate is something like:
time = ["1920-03-15", "1920-04-15", "1920-05-15", ... , "1987-09-15", "1987-10-15"]
It's monthly data, and I know that there are no gaps, but it starts in March and ends in October.

To deal with this easily, I use the argmax function in xarray. This function returns the index of the maximum of the argument, and in case of multiple equal maxima, the index of the first occurrence. It is worth noting that argmax is directly using numpy's argmax function. So, I just construct a test for what I want for the beginning and ending and look for the first true value (=1). Here's an example in which I have some xarray DataSet called ds that has a time coordinate; note that I've already made sure that the time coordinate can be decoded so we can use the 'dt' accessor.

time = ds_smpl['time']
months = time.dt.month
first_january =  np.asscalar((time.dt.month == 1).argmax())  
# argmax will return the index of the first True (= 1) value
last_december = np.asscalar(-1 - (time.dt.month == 12)[::-1].argmax())
ds_trim = ds_smpl.isel(time=slice(first_january, last_december+1)) 
# +1 b/c slice is exclusive on end
print(f"Index of first January is {first_january} and last december is {last_december}")

The same strategy can be used for other kinds of trimming of time series. If you know the times that you want a priori, this is overkill because you could directly slice that data out of your data set, but when you just want to get a whole number of years or similar, this is the easiest method I've found.


Best Python resources for 2019

After listening to Talk Python to Me 194, I started thinking about what I would recommend to people who are getting started with python. It's a complicated question, actually, because it depends on what this hypothetical neophyte intends to do, and more importantly, whether she/he is interested in getting nerdy about python or just wants to know enough to get it done. 
In any case, it doesn't really matter, because lists like this aren't really for this hypothetical person. This is a list of stuff I like, and that I recommend. So let's not get caught up in some conceit, and just get to a list of good python resources.
First, I find a lot of utility in the posts at RealPython.com. Each post is based on a python package or a feature and gives a fairly detailed walk through of that topic. They aren't always relevant for my own uses, but I even read some of those to get a better idea of how the other half live. Other good sites that are worth visiting include FullStackPython.comPython.org, and SciPy.org. The latter two lead to the official documentation, which is sometimes (not always) useful for learning about the details. I've also got a news feed set up for python-related content (using Feedly), and much of the content actually comes through PlanetPython.org; it's more of a mixed bag but I find and read enough content through it that it is worth mentioning.
I listen to two podcasts focused on python, both hosted by Michael Kennedy: Talk Python To Me and Python Bytes. The first is an interview show, usually with a developer or someone who works in some corner of the python world. The second is a weekly python news show, and they really do a great job of finding the breaking python news and reporting on new and interesting projects. I recommend both, but Python Bytes seems nearly essential at this point. There are other python podcasts, but I haven't gotten into regularly listening to any of them. I think that's because the ones I've heard get a little too into details that I don't care about, especially about web development (a topic that I know nothing about and I get lost very quickly).
On the video front, there are a couple of content providers worth subscribing to. First is Dan Bader's videos; he's the now runs RealPython.com, too, and his videos are well produced, reasonable length tutorials that are usually on some feature of standard python. I think his publishing frequency as decreased since moving to RealPython.com, but some of his videos are worth revisiting. I also highly recommend the videos by Corey Schafer. He provides more step-by-step tutorials, and does a great job of stepping through and building up lessons. I also keep an eye on Coding Tech, which often has videos on programming that are interesting. I also watch a lot of the recorded presentations from python conferences: PyCon, PyData, SciPy especially, but others as well (listPyVideo.org). I also think the Socratica python videos are good, if a bit hoakey. Finally, Enthought publishes a lot of useful videos, and seem to be responsible for the SciPy conference talks.
Finally, tools for leveling up your python skills. There are tons of resources out there, and I have not tried all of them. I have experimented, and I've returned to a few of them. First is ProjectEuler.net. It isn't a python site; it really isn't even a coding site, so much as an algorithm and math site. It's just a list of problems, you have to solve the problem in input the answer; that unlocks a discussion board about that problem where you can see other solutions and sometimes detailed "official" solutions with explanation. I haven't done very many of the problems still, but I find them to be more fun to do than other similar sites. A more sophisticated version of the same concept is CodeSgnal, which used to be called CodeFights. The problems are much more computer science focused, and geared toward developers and want-to-be developers getting ready for interviews. The fun thing is that the site provides a full programming environment and your whole code is supplied and run to get the solutions. It is well designed and many of the problems that I've done have been fun. You can choose from many languages, so it is not a python-specific resource. CodeSignal seems to be nearly the same idea as CodeWars, which I have not tried. A new one to keep an eye on is Coder.com; this is VSCode in the browser with access to computational resources. I'm not sure where this is going, but it seems well-made and interesting. 
Okay, finally, finally is the omnipresent internet resource for coding things: stackoverflow. I don't really like SO in many ways. The main ones are (a) I don't think people do a good job of asking their questions, and (b) people get pedantic (or just jerky) about answering questions. That said, a lot of the answers are very useful, and the best ones provide great guidance and insight about how things work.


In-place upgrade from python 3.6 to 3.7

Based on reports Python Bytes and from [here] and [there] it seems like 3.7 is generally faster than 3.6. So, I decided to try it. On one machine, I set up a fresh conda environment with 3.7 and installed all the packages I typically use. The first time I did that, which was months ago, not everything was working, and I put this upgrade plan on hold. Later, I re-tested, and all my packages seemed to be playing nicely with 3.7. I worked in that environment for a while with no problems.

During down time today, I thought it might be good to move another machine to 3.7. This time I decided to take the leap and move my base environment to python 3.7 from 3.6.6. Why not?

There is one step:
$ conda install python=3.7

This takes some time for conda to "solve" the environment. I'm not sure what this actually does, but since it checks for dependencies, it is no wonder that it will take a while because essentially every installed package will need to be removed and reinstalled.

One potential gotcha with this approach is that anything that was pip installed will need to be reinstalled. I think there are a couple of these, but I don't know how to tell which are which. Oh well, I guess I'll find out when something breaks. 

Eventually the environment does get solved, and a plan is constructed. Answer 'y' and conda dutifully downloads and extracts many packages.

Conda does all the work:
Preparing transaction step happens with the spinning slash, and finishes.
Verification step happens with the spinning slash, and finishes.
Removing some deprecated stuff (jupyter js widgets) ...  And then enabling notebook extensions and validating. Give the OK.
Prints done, returns the prompt.

Did it work?

$which python

$python --version
Python 3.7.1

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
[Clang 9.0.0 (clang-900.0.37)] :: Anaconda custom (64-bit) on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(2.**8)
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib as mpl
>>> import xarray as xr
>>> xr.DataArray(np.random.randn(2, 3))

array([[-0.355778,  0.836539,  0.210377],
       [ 0.480935,  0.469618, -0.101545]])
Dimensions without coordinates: dim_0, dim_1
>>> data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))
>>> xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))

array([0, 1, 2])
  * dim_0    (dim_0) object 'a' 'b' 'c'

Okay, this seems to be working. Repeated similar interactive test with ipython. So far, so good.

Lesson: conda is kind of amazing.


On Microsoft acquiring GitHub

In the news today is that Microsoft is acquiring GitHub (Pocket Link to Verge). That's a big deal for a lot of big (and small) open source projects. It's definitely going to rub a lot of open-source developers the wrong way, as many are motivated to contribute to open source projects as a direct response to decades of difficulties with Microsoft.

I am not a Microsoft user, nor a fan. But I do acknowledge that there's been some apparent improvements over the past few years under CEO Satya Nadella. They have some open projects that are notable, especial VS Code, which is becoming very popular. From what I hear, Windows is still a monstrosity that should not be used, but it is clear that Windows 10 is improved over the last few versions; note, last time I really used windows was still Windows 98, I think.

My suspicion is that Microsoft wants GitHub because they want to use it internally for very large projects. Recently Microsoft has been working to make git more useful for humungous projects, specifically with the virtual file system (techcrunch). By controlling GitHub, Microsoft becomes the biggest player in git as well, which I'm sure won't please some.

It's also worth noting that GitHub developed the text editor Atom, which is pretty similar to their aforementioned VS Code. Atom is an open source project, but it will be interesting to see what happens to Atom development going forward.

Finally, I'll also mention that GitHub has adopted Markdown as it's documentation markup language of choice. There's already a GitHub flavor of Markdown, which I think is probably the dominant version. Now that Microsoft owns GitHub, I wonder if it will impact the use of Markdown, and especially if GitHub-flavored Markdown will further evolve away from the original?


What is aerosol radiative forcing?

There is a lot of research about the climate impact of aerosols. One of the fundamental measures of the climate impact is the "radiative forcing" associated with aerosols. It's not obvious what exactly aerosol radiative forcing is, however, so here we begin our examination of this question.

The IPCC AR4 provides a nearly useless description: [LINK]

We can discern two important facets of aerosol radiative forcing from that description:
  1. It is measured based on top-of-atmosphere (TOA) radiative fluxes. 
  2. It includes the impact of aerosol on clouds. 
It is also useful to look at other parts of AR4, where better text describes aerosol effects. I started at the link above because when I search for "aerosol radiative forcing" that is one of the top hits I get. That's an unfortunate hit because the text surrounding that small section is much more informative.  

The first thing that can be clarified is that #1 above is part of the definition of radiative forcing. As far as IPCC reports go, radiative forcing is the impact that a forcing agent has on the net TOA fluxes. The concept is useful because it is derived from the basic physics of conservation of energy and thermodynamics. In equilibrium the net TOA flux is zero (averaged over a year, or many years). When a forcing agent is applied to they system, such as anthropogenic aerosol, the energetic consequence may be a change in that TOA balance (i.e., a radiative forcing), and having a TOA imbalance causes the system to respond. We deduce that if the forcing is negative the system will cool to achieve a new balance, but if the forcing is positive (i.e., more energy is entering the system than leaving) the system will warm to achieve a new balance. Aerosols typically fall into the negative forcing category, and so cause a cooling, but the story is not really so simple.

In particular, it is helpful to split aerosol effects into two pieces:
  1. direct effects of aerosol particles on radiative transfer through the atmosphere (scattering and absorption) (aka, aerosol-radiation interaction, ari)
  2. indirect effects of aerosol that change the radiative properties of clouds, or change the lifetime of clouds (aka, aerosol-cloud interaction, aci)
The IPCC AR5 [LINK] includes a lot of treatment of aerosol radiative forcing. Since it's newer, perhaps we should focus there for some clarity on this issue. An important distinction is drawn in AR5 between radiative forcing (RF) and effective radiative forcing (ERF). While RF is just what we were describing, namely the change in the TOA net flux (allowing adjustment of the stratosphere), but ERF allows the troposphere to also adjust to the forcing agent. To establish ERF is tricky because it does not allow the global average surface temperature to adjust; the idea is that ERF includes tropospheric "rapid adjustments" to occur, while RF only allows for the rapid stratospheric adjustment to occur. Confused yet? 

We will return to this distinction in another post. For now, we need to consider that both direct and indirect effects have a RF but also an ERF. This complicates the picture because it further muddies the water with respect to how we describe how aerosols effect the climate system. Mostly AR5 seems to deal with RF associated direct aerosol effects and ERF for indirect effects.  For now, though, let's return to our basic question of what is aerosol radiative forcing.

Based on IPCC AR4 and AR5, along with a lot of literature reviewed therein, and also my own literature review that spans from the 1980s to today, the easiest way to express the meaning of aerosol radiative forcing is:
Aerosol radiative forcing is the change in TOA radiative fluxes between the preindustrial period and the present day. The aerosol radiative forcing can be divided into direct effects in which aerosol effects radiative transfer and indirect effects in which aerosol interacts with clouds. 
Estimates of the total direct aerosol radiative forcing is around -0.35 (-0.85 to +0.15) W m-2. Including indirect effects switches to using the ERF concept, which we will examine in another post, but the AR5 bottom line is that the total aerosol effect is a negative forcing of about -1 W m-2, but that is basically plus or minus 1 W m-2

What I want to point out before closing is that I described RF in the beginning as fundamental, but the definition that I've just provided seems far from fundamental. When we use this definition of aerosol radiative forcing, we need to define what pre-industrial means and what present day means. We know intuitively what both are supposed to mean, but quantitatively this is ambiguous. Particularly troublesome is that we do not have adequate observations from pre-industrial times to really know what the aerosol concentrations or emissions were. This provides an irreducible uncertainty for aerosol radiative forcing using this definition.  We will revisit some of these concepts in future posts, and we will return to the difficulties associated with this definition of aerosol radiative forcing.


Joseph Romm raves about Reagan, balks at Barrack: Figures of speech make and break communication

I have recently read Joseph Romm's new book, Language Intelligence,
which is really a brief review of rhetoric. It introduces modern readers to
the age-old topic of eloquent language intended to persuade
audiences. Romm uses just a few prime examples for each of the several
topics covered, from the ancient Greek greats to medieval masters who
wrote the King James Bible to modern practitioners such as Lady
Gaga. The point is to expose the principles of rhetorical discourse,
such as the various forms of repetition, irony, metaphor, and
seduction, and provide readers with some of the tools necessary to
build an effective argument as well as to erect a wall to defend
against the constant bombardment by advertisers, politicians, and
other persuaders.

The lessons are clear and well illustrated by examples. Especially
useful are the examples from recent political figures such as both
George Bushes, Bill Clinton, Barrack Obama, and Mitt Romney. Several
Republican strategists are pointed out for their cunning use of
rhetorical devices (Luntz and Rove, especially). Scientists (climate
scientists, especially) are singled out for their clumsy attempts to
communicate, usually avoiding rhetorical figures of speech. The
use of the figures being discussed occasionally becomes too blatant,
often in the final paragraphs of sections, but it is pleasing as a
reader to see such employment as sections close because it reinforces the
lesson. I am convinced that this brief introduction should be standard
reading for college students across disciplines, and those in the
sciences should pay careful attention to the lessons and employ more
intelligent language when describing their own work. Older readers
might pick up some new tricks, too, if they choose to read the book.