box.matto.nl

home/

Awk script to parse Kindle clippings

Last edited

Kindle Highlights and Notes

The Amazon Kindle e-readers allow you to highlights text and to take notes.

Everty time you highlight a piece of text, this text is copied into a file, called "My Clippings.txt". As we can see from the space in the filename, this was probably build by some coder who lives in the DOS- or Windows world.

When you take notes, the note is copied to the "My Clippings.txt" file.

Every highlight and every note has a reference to the book and the location in it, where it was created.

I own a Kindle Paperwhite 2015 and use highlighting a lot.

I don't use the feature to take notes a lot, as the e-ink screen is terribly slow.

Highlighting is not that much of a fuzz, although getting the highlight boundaries right (first and last word) can sometimes be awkward.

awkiawki as a personal wiki

Awkiawki is a wiki that uses awk as cgi. This is a very fast wiki, that even performes great on a Raspberry Pi.

I have been using this wiki for two years now as a personal wiki and I love it. See also my other awkiawki pages.

Awkiawki is a very simple wiki, that uses CamelCase to create links from one page to the other.

Before I started to use awkiawki, I had been using Vimwiki for many years. Vimwiki uses its own format, which is not pure Markdown.

To ease the migration from Vimwiki to awkiawki I tweaked awkiawki to use the Vimwiki format and I still use this format today.

The combined information from my previous Vimwiki and the new entries made in my awkiawki has become an awesome personal knowledge base and my poor mans Zettelkasten implementation, which is becoming more valuable each and every day.

Convert Kindle Highlights to wiki pages

In order to make the Kindle Highlights useful, these highlights have to be taken out of the Kindle e-reader and made available and searchable. Otherwise, highlighting and taking notes would not be that useful.

For me, the only logical way to go, is to convert the highlights to wiki pages. This way the highlights end up among all my other notes in my personal knowledge base.

The Kindle stores the "My Clippings.txt" file in the documents directory, so the only thing you have to do to get this file on your machine, is to hook up an USB-cable between your laptop and the Kindle and copy this file to your home diretory.

Plan

The conversion that is want to do is the following:

  • In my awkiawki I have a pointer to a file called "KindleHighlights". Remember that awkiawki uses CamelCase to generate links to other files.
  • The conversion script creates this file and adds a link to the page per book in this file
  • The conversion script creates for every book a seperate file, with all the highlights and notes from that book, ordered by location

The file "KindleHighlights" functions as an index page to the pages per book. To link the index page to these pages, the filename of the pages has to be in CamelCase. Unfortunately, awkiawki only accepts alfabetical characters in the CamelCase filenames, and not numerical characters. So Catch22 will not be a legitimate filename.

The file "My Clippings.txt" starts each highlight and each note with a line containing the title of the book and the name of the author. The name of the author is between round brackets. This is an example:

As a Man Thinketh (James Allen)

This is the starting line for every note and highlight from the book with the title "As a Man Thinketh", written by James Allen. We can use this line to create a CamelCase name for the wiki-file which will contain the notes and highlights. For this, we drop all spaces and all non alfabetical characters from this line. To make sure we end up with CamelCase, every word is first capatalised, which is to say that each word starts with a uppercase character, followed only by lowercase character. This will make sure that booktitles in all caps will end up as CamelCase.

Book titles can be very long, so we truncate the CamelCase filename to a maximum length of 21 characters. In the end, the wiki-page for the book above will get the following as filename:

AsAManThinkethjamesAl

Every page per book ends with a back-reference to the "KindleHighlights" index page, so that I can easely jump between the individual book-pages and the index of all the books with highlights or notes.

awk for fun and profit

Although Perl is what comes to mind when one wants to create such a script, I decided to give awk a try, just because that seemed like fun.

I wrote an awk script that can do the conversion. One problem that remained, was that I wanted the pages per book to be sorted by location. As I didn't know how to do this with awk, I wrote a small work-around. See the script below.

awk script to create markdown pages from Kindle Highlights and Notes

The script uses the following directory structure:

`-- import
    `-- tmp
  • The root of the starting directory holds the file "My Clippings.txt".
  • The sub-directory import will be populated with all the files that have to be uploaded to the awkiawki files directory.
  • The sub-sub-diretory is ment for intermediate files (see below).

Here comes the awk-script:

BEGIN {
    line=0;
    tmpheading=""
    saveheading=""
    heading=""
    pheading=""
    lloc=0;
    maxlloc=0;
    lnr=1;
    savelloc=0;
    path="import/tmp/"
    print "= Kindle Highlights and Notes \n\n----\n\n" > ("import/KindleHighlights");
}
{
    if (line<1) 
    { 
        for(j=1;j<=NF;j++){ tmpheading=tmpheading toupper(substr($j,1,1)); tmpheading = tmpheading  tolower(substr($j,2) )}; heading = substr(tmpheading, 1, 30); 
        if ( heading!= saveheading ) 
        { 
            saveheading=heading;
            if ( pheading != "" ) { print "\n" maxlloc+99 ":KindleHighlights\n"  > (pheading); pheading=""; }
            print $0 "\n\n" > ("import/KindleHighlights");
            gsub(/[^A-Za-zt]/, "", heading);
            pheading = toupper(substr(heading,1,1)) substr(heading, 2, 20);
            print pheading "\n" > ("import/KindleHighlights"); 
            print "----\n" > ("import/KindleHighlights");
            pheading = path pheading;
            print "0:== " $0 "\n1:\n" > (pheading);
        }
    }
    if ( line>0 ) 
    {
        if ( $0~ /^- / ) 
        { 
            locatie = $3 " on " $5 " " $6; split($6,a,"-"); lloc=100*a[1];
            if ( lloc > maxlloc ) { maxlloc = lloc; }
        }
        else 
        {
            if (line>0 && $0 !~ /^==========/){ print lloc+lnr++ ":" $0 > (pheading) }
        }
    }
    line++; 
    if ( lloc!=savelloc ) { lnr=1;savelloc=lloc; }
    if ( $0~ /^==========/ ) 
    { 
        print lloc+lnr++ ":\n" lloc+lnr++ ":" locatie "\n" lloc+lnr++ ":\n" lloc+lnr++ ":----\n" lloc+lnr++ ":\n" > (pheading); line=0; heading=""; tmpheading="";
    }
}
END {
if ( pheading != "" ) { print "\n" maxlloc+99 ":KindleHighlights\n"  > (pheading); }
print "\nBoekenPagina\n" > ("import/KindleHighlights");  
}

This last part prints a CamelCase word "BoekenPagina" at the end of the KindleHighlights page. This is a back-reference to my awkiawki page called "BoekenPagina", which funtions as an index page to all the wiki pages related to books. So the KindleHighlights page is a child of this page, just as the per book pages created by the scripts are childpages of the KindleHighlights page.

Lines with four dashes ("----") will be converted by awkiawki to horizontal rules in the final html.

In order to be able to sort the per book pages on location, the awk script creates intermediate files. These are placed in the import/tmp subdirectory. Every line starts with the location of the highlight or note, followed by a colon (:). This makes it easy to order these lines in the second step. This is done in the shell script (see below), that sorts the lines and when that is done, strips this part of the line.

shell script to complete the process

#!/bin/csh

mkdir -p import/tmp
awk -f parse_kindle.awk My\ Clippings.txt

foreach file (`ls import/tmp`)
    echo $file
    /usr/bin/sort -n import/tmp/$file | sed "s/^[0-9]*://" > import/$file
end

This script is run by the user (me that is :) to start the process.

This will do the following:

  • Create the directory structure
  • Run awk with the awk script on the file "My Clippings.txt".
  • Sort the intermediate files on the order of the locations and strip the location numbers at the begining of the lines and put the result in the import directory.

After the script is finished, the files in import/ can be scp-ed to the wiki server. When this is done, the whole import subdirectory can be removed with rm -rf, because it can be re-created at any time bu just running the shell script.

The shell script uses csh as shell. This makes it possible to run this script on an OpenBSD laptop (bash is not installed by default in OpenBSD and I can live with that).

End result

The end result is an index file called KindleHighlights with the following format:

Header:

= Kindle Highlights and Notes 

----

This will be converted when this page is requested from awkiawki to a h1 header followed by a horizontal rule.

For each book a block with the following format:

Letters from a Stoic: All Three Volumes (Seneca)


LettersFromAStoicAllT

----

This is a line with the title of the book and its author, followed by a CamelCase word that will be converted by awkiawki to a link to the book-page in the wiki, followed by a horizontal rule.

For every book there will be a book-page, with the following format:

Header:

== Letters from a Stoic: All Three Volumes (Seneca)

This is a h2 header with the title of the book and its author.

For each highlight in this book a block with the following format:

It is not the man who has too little, but the man who craves more, that is poor.

Highlight on Location 164-165

----

This is the contents of the highlight, followed by the indication "Highlight" (to distinguish highlights from notes) and the location in the book.

After this, followes a horizontal rule (the line with the four dashes).

Valuable

At the moment I have the highlights (and some notes) of over 30 books in "My Clippings.txt". The total processing time to convert these to over 30 wikipages and the wiki index page is less then a second.

The conversion of the Kindle highlights and notes to wiki pages in my personal wiki have proofed to be very valuable. The result is that while reading I am very aware of this and use highlighting a lot.

Often, just one glance on the wiki page of a book brings back the ideas and knowledge covered in that particular book. This is very helpful to absorb the contents of the book. And that is just awesome.