Photos Library backup using python and launchd

The problem at hand

We use a large, external drive to store the household Photos library (Photos as in Apple Photos.app). The external drive is already backed up using one of the popular cloud backup services. I want to have a more convenient backup in case of catastrophic damage to the drive, which would reduce the time it takes to recover all our photos.

Assess the options and devise a plan

Given that the drive is already backed up to the cloud, we have some freedom. We could just connect another drive and do some kind of back up; we could set up Time Machine to do something (maybe); we could try to move the library to the computer's drive and then use the external drive as a back up; we could occasionally copy the photos to another drive or computer. I happen to have a Synology NAS ready for this type of thing, so I will use that. Instead of relying on my memory or even an automated reminder to go and sit in front of my computer and do periodic backups, I will automate the backup.

Here's the plan: write a script that will do the backup of the Photos library to the NAS, and then schedule that script to run regularly. Easy.


1. Enable SSH using keys on Synology

My Synology NAS is obviously under utilized, since I had never bothered to set up SSH keys before. It turns out to be easy, but only because others have already documented the process.

Follow instuctions posted here:

Note, at first I did all the steps except setting the home directory permissions. Turns out that post is correct that it is necessary. You need to change the permissions on the home directory or else SSH will still just prompt for the user password even though the keys are present.

Once this step is done, you can ssh into the NAS without entering a password.

2. Initial backup of the Photos library

Again, this road has been taken. For example:


In that example they use rsync, and I don't see any reason that isn't a great way to go for my purposes. I'm taking most of that rsync command, but I'm removing the "--delete" just in case (I have room, so no need to worry about it).

On the NAS, make a location for the backup:
cd /volume1/some_place
mkdir photos_library_backup

Just as a note for those who haven't looked at this before, it important to realize that the Photos library is called something like
/Volumes/external_drive/Photos\ Library.photoslibrary

I don't know any details about this, but I do know that it isn't a "file", but more like a directory. You can even just cd into it and look around.

The command that I'm using will be something like this:

rsync -Phca --stats --include="/Photos Library.photoslibrary/" --include="/Photos Library.photoslibrary/***" --exclude="*" -e "ssh" "/Volumes/external_drive/" my_name@external_drive.local:/photos_library_backup/

3. Set up automation with launchd

The post from Kevin Goedecke uses a shell (sh) script and crontab, but this is not the Apple way. We should use launchd.

I'm no expert at launchd, so I looked at a bunch of examples. Here are a few:

A pretty nice quick overview:

See also:


The bottom line is that you have to choose if you want the job to run only when you are logged in, or allow it to run as root. I want it to go ahead and run as root, since the photos library might be updated by other uses when I am not around. To do that, we just need to put a proper "plist" file into /Library/LauchDaemons. A plist file is just an xml file, but we have to follow some conventions which all the above links describe. Since this is a super simple job, my plist file is 20 lines.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
Essentially, the file says to run a python script at regular intervals. Arbitrarily, I'll just do it every morning at 3am, but it could have been less frequent.

4. Write a script to do the backup

The plist just schedules a job. That job is actually to run a python script. There are so many things that you could do here, but I'm just interested in doing one thing withoug incident: run rsync to backup my whole Photos Library.

There's not going to be anything fancy, but I think this does the job correctly. I will use subprocess.call() to invoke rsync, exactly as I did for my initial backup in step 2. That could have been put into a shell script, or even a python script, using very few lines.

I was slightly worried about what would happen if the external drive were unmounted/removed when the script was run. This could definitely happen; I think all the external drives get unmounted when no one is logged in to the system. So I put in a check, just to see if the path to the external drive seems valid:

from pathlib import Path
if local_loc.exists() and local_loc.is_dir():
    # proceed
    raise IOError("Something is wrong")
I did notice that in python 3.7 there is a path.is_mount() function, but the machine running the script is still on 3.6. After a minute of thinking about it, I decided that it does not really matter if if the location is a mount, what matters is whether it is there, so I went with this version.

The script also defines where to point the rsync command, which is just a string. So the work is basically accomplished by

prc = subprocess.run(["rsync", "-Phca", "--stats", '--include="/Photos Library.photoslibrary/"',
                  '--include="/Photos Library.photoslibrary/***"', '--exclude="*"', '-e', '"ssh"', local_loc, remot_plc],
where local_loc is the path to the library on the external drive and remot_plc is the path on the NAS. The stdout and stderr keyword arguments put the output of rsync into one attribute of prc.

I used that stdout/stderr in order to write a useful log file. I am just using the standard library logging module, and setting a log file for each daily run using a simple timestamp:

import logging
from datetime import datetime
# construct the log file name using today's date
logloc = Path("/HOME/logs/")
lognam = datetime.now().strftime('photo_backup_log_%Y%m%d_%H%M.log')
logfil = logloc/lognam
logging.basicConfig(level=logging.INFO, filename=logfil)
The subprocess.run() function uses stdout=subprocess.PIPE to put the stdout of rsync into an attribute of prc called stdout as a bytes object. I have a little function that parses that bytes object into a list of strings split on the newline character:

def log_subprocess_output(p):
 lines = p.stdout.decode("utf-8").split("\n")
 for line in lines:
This is called right after the subprocess.call(); it just puts all the output from rsync into the log file.

That's it. We now have a script that will try to run rsync to backup any changes we have made to the Photos library to a designated place on the NAS. If the external drive isn't available, it will raise an exception and exit. It logs all the steps to a timestamped file. The script is run by a root process at 3am every day. Not bad for 50 lines of python with no dependencies and a 20 line plist file.


Trimming time series with xarray

There are many times when a data set starts and/or ends at an inconvenient time. My most common experience with this is obtaining an observational data set that begins and ends at months in the middle of the year, but I want to either look at annual means or derive anomalies. These dangling months can be cumbersome, and most of the time I find that the easiest way to deal with them is to trim them off.

Just to be explicit, say I have data in which the time coordinate is something like:
time = ["1920-03-15", "1920-04-15", "1920-05-15", ... , "1987-09-15", "1987-10-15"]
It's monthly data, and I know that there are no gaps, but it starts in March and ends in October.

To deal with this easily, I use the argmax function in xarray. This function returns the index of the maximum of the argument, and in case of multiple equal maxima, the index of the first occurrence. It is worth noting that argmax is directly using numpy's argmax function. So, I just construct a test for what I want for the beginning and ending and look for the first true value (=1). Here's an example in which I have some xarray DataSet called ds that has a time coordinate; note that I've already made sure that the time coordinate can be decoded so we can use the 'dt' accessor.

time = ds_smpl['time']
months = time.dt.month
first_january =  np.asscalar((time.dt.month == 1).argmax())  
# argmax will return the index of the first True (= 1) value
last_december = np.asscalar(-1 - (time.dt.month == 12)[::-1].argmax())
ds_trim = ds_smpl.isel(time=slice(first_january, last_december+1)) 
# +1 b/c slice is exclusive on end
print(f"Index of first January is {first_january} and last december is {last_december}")

The same strategy can be used for other kinds of trimming of time series. If you know the times that you want a priori, this is overkill because you could directly slice that data out of your data set, but when you just want to get a whole number of years or similar, this is the easiest method I've found.


Best Python resources for 2019

After listening to Talk Python to Me 194, I started thinking about what I would recommend to people who are getting started with python. It's a complicated question, actually, because it depends on what this hypothetical neophyte intends to do, and more importantly, whether she/he is interested in getting nerdy about python or just wants to know enough to get it done. 
In any case, it doesn't really matter, because lists like this aren't really for this hypothetical person. This is a list of stuff I like, and that I recommend. So let's not get caught up in some conceit, and just get to a list of good python resources.
First, I find a lot of utility in the posts at RealPython.com. Each post is based on a python package or a feature and gives a fairly detailed walk through of that topic. They aren't always relevant for my own uses, but I even read some of those to get a better idea of how the other half live. Other good sites that are worth visiting include FullStackPython.comPython.org, and SciPy.org. The latter two lead to the official documentation, which is sometimes (not always) useful for learning about the details. I've also got a news feed set up for python-related content (using Feedly), and much of the content actually comes through PlanetPython.org; it's more of a mixed bag but I find and read enough content through it that it is worth mentioning.
I listen to two podcasts focused on python, both hosted by Michael Kennedy: Talk Python To Me and Python Bytes. The first is an interview show, usually with a developer or someone who works in some corner of the python world. The second is a weekly python news show, and they really do a great job of finding the breaking python news and reporting on new and interesting projects. I recommend both, but Python Bytes seems nearly essential at this point. There are other python podcasts, but I haven't gotten into regularly listening to any of them. I think that's because the ones I've heard get a little too into details that I don't care about, especially about web development (a topic that I know nothing about and I get lost very quickly).
On the video front, there are a couple of content providers worth subscribing to. First is Dan Bader's videos; he's the now runs RealPython.com, too, and his videos are well produced, reasonable length tutorials that are usually on some feature of standard python. I think his publishing frequency as decreased since moving to RealPython.com, but some of his videos are worth revisiting. I also highly recommend the videos by Corey Schafer. He provides more step-by-step tutorials, and does a great job of stepping through and building up lessons. I also keep an eye on Coding Tech, which often has videos on programming that are interesting. I also watch a lot of the recorded presentations from python conferences: PyCon, PyData, SciPy especially, but others as well (listPyVideo.org). I also think the Socratica python videos are good, if a bit hoakey. Finally, Enthought publishes a lot of useful videos, and seem to be responsible for the SciPy conference talks.
Finally, tools for leveling up your python skills. There are tons of resources out there, and I have not tried all of them. I have experimented, and I've returned to a few of them. First is ProjectEuler.net. It isn't a python site; it really isn't even a coding site, so much as an algorithm and math site. It's just a list of problems, you have to solve the problem in input the answer; that unlocks a discussion board about that problem where you can see other solutions and sometimes detailed "official" solutions with explanation. I haven't done very many of the problems still, but I find them to be more fun to do than other similar sites. A more sophisticated version of the same concept is CodeSgnal, which used to be called CodeFights. The problems are much more computer science focused, and geared toward developers and want-to-be developers getting ready for interviews. The fun thing is that the site provides a full programming environment and your whole code is supplied and run to get the solutions. It is well designed and many of the problems that I've done have been fun. You can choose from many languages, so it is not a python-specific resource. CodeSignal seems to be nearly the same idea as CodeWars, which I have not tried. A new one to keep an eye on is Coder.com; this is VSCode in the browser with access to computational resources. I'm not sure where this is going, but it seems well-made and interesting. 
Okay, finally, finally is the omnipresent internet resource for coding things: stackoverflow. I don't really like SO in many ways. The main ones are (a) I don't think people do a good job of asking their questions, and (b) people get pedantic (or just jerky) about answering questions. That said, a lot of the answers are very useful, and the best ones provide great guidance and insight about how things work.


In-place upgrade from python 3.6 to 3.7

Based on reports Python Bytes and from [here] and [there] it seems like 3.7 is generally faster than 3.6. So, I decided to try it. On one machine, I set up a fresh conda environment with 3.7 and installed all the packages I typically use. The first time I did that, which was months ago, not everything was working, and I put this upgrade plan on hold. Later, I re-tested, and all my packages seemed to be playing nicely with 3.7. I worked in that environment for a while with no problems.

During down time today, I thought it might be good to move another machine to 3.7. This time I decided to take the leap and move my base environment to python 3.7 from 3.6.6. Why not?

There is one step:
$ conda install python=3.7

This takes some time for conda to "solve" the environment. I'm not sure what this actually does, but since it checks for dependencies, it is no wonder that it will take a while because essentially every installed package will need to be removed and reinstalled.

One potential gotcha with this approach is that anything that was pip installed will need to be reinstalled. I think there are a couple of these, but I don't know how to tell which are which. Oh well, I guess I'll find out when something breaks. 

Eventually the environment does get solved, and a plan is constructed. Answer 'y' and conda dutifully downloads and extracts many packages.

Conda does all the work:
Preparing transaction step happens with the spinning slash, and finishes.
Verification step happens with the spinning slash, and finishes.
Removing some deprecated stuff (jupyter js widgets) ...  And then enabling notebook extensions and validating. Give the OK.
Prints done, returns the prompt.

Did it work?

$which python

$python --version
Python 3.7.1

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
[Clang 9.0.0 (clang-900.0.37)] :: Anaconda custom (64-bit) on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(2.**8)
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib as mpl
>>> import xarray as xr
>>> xr.DataArray(np.random.randn(2, 3))

array([[-0.355778,  0.836539,  0.210377],
       [ 0.480935,  0.469618, -0.101545]])
Dimensions without coordinates: dim_0, dim_1
>>> data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))
>>> xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))

array([0, 1, 2])
  * dim_0    (dim_0) object 'a' 'b' 'c'

Okay, this seems to be working. Repeated similar interactive test with ipython. So far, so good.

Lesson: conda is kind of amazing.