Extracting Highlights from eBooks

This past year I read 40 books. Few of these books were hard copies, instead they mainly consisted of Kindle eBooks and eBooks in the Adobe Digital Edition format. Most everyone knows about the Amazon Kindle, however, the Adobe Edition Digital format is a permission/access format that many academic libraries use.

When I read these eBooks I often highlight important content so that I can return later and brief these notes. To aid in this, I wrote two R programs to extract the highlighted text and format it for PDF using Sweave (although RMarkdown could just as easily be used).

Kindle Extract

One of the things to be aware is that all the notes and highlighted text for the eBooks are stored in one file. Therefore, you will need to know the name of the eBook you want to extract.

On your Kindle (plug it into your computers USB port – I am using a Kindle 6″ PaperWhite) and make a copy to your local computer. The Kindle notes/highlighted text can be found in:

documents\My Clippings.txt

What I wanted was to output the page # and the Kindle position for each highlight or notes. This way I would be able to locate the context of the highlight.

p5[68] - “What important truth do very few people agree with you on?”

I then wanted all of these notes/highlights to be output as a PDF – so I created a LaTeK Sweave document as seen below:

Enhancements for the future:

  • TBD

The following is the R code to simply read in the annotation text file and output the notes and highlighted text:

# filename <- "c:\\users\\computer\\Desktop\\Kindle-2019-06-30-ANSI.txt"
# ebook <- "Zero to One- Notes on Startups, or How to Build the Future (Peter Thiel,Blake Masters)"

# The function kindleANSIToSweave depends on the following Global Variables:
sweave.title <- ""
sweave.highlights <- ""
sweave.notes <- ""

kindleANSIToSweave <- function(filepath, ebook) {
  library(stringr)  
  
  title <- ""
  lastOutputString <- ""
  highlights <- ""
  notes <- "\n\n"
  sweave.title <<- ""
  sweave.highlights <<- ""
  sweave.notes <<- ""
  
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
   
    if ( length(line) == 0 ) {
      break
    }
    
    if (title=="") {
      title <- line
      titleOutput <- title
      titleOutput <- str_replace_all(titleOutput,";",",")     
      titleOutput <- str_replace_all(titleOutput,":","-")
      print(titleOutput)
    }

    if (str_detect(line, regex("=========="))) {
      line = readLines(con, n = 1)
      if ( length(line) == 0 ) {
        break
      }
      
      line <- str_replace_all(line,"\\?","")           
      if (str_sub(line, start=1, end=1)=="?")
        line <- str_sub(line, start=2)
            
      if (title != line) {
        title <- line
        titleOutput <- title
        titleOutput <- str_replace_all(titleOutput,";",",")     
        titleOutput <- str_replace_all(titleOutput,":","-")
        print("===============================================")
        print(titleOutput)
      }
    }

    if (titleOutput == ebook) {
      sweave.title <<- titleOutput
      # Highlights
      if (str_detect(line, regex("(?:Your Highlight on page).(\\d+)"))) {
        highlightPageString <- str_extract(line, regex("(?:Your Highlight on page).(\\d+)"))
        highlightPage <- str_extract(highlightPageString, regex("(\\d+)"))
        
        locationString <- str_extract(line, regex("(?:Location).(\\d+)"))
        locationNumber <- str_extract(locationString, regex("(\\d+)"))
        
        line = readLines(con, n = 1)
        line = readLines(con, n = 1)      
        line <- str_replace_all(line,"","")        
        line <- str_replace_all(line,"’","'")    
        line <- str_replace_all(line,"“","")          
        line <- str_replace_all(line,"â€\u009d"," ")     
        line <- str_replace_all(line,"—"," ")           

        outputString <- paste0("p",highlightPage, "[",locationNumber,"] - ",line)
        
        if (outputString != lastOutputString) {
          lastOutputString <- outputString          
          print(outputString)
          outputString <- str_replace_all(outputString,"[$]","X")
          outputString <- str_replace_all(outputString,"[&]","X")                            
          highlights <- paste0(highlights, "\n\n", outputString)
        }      
      }
    
      # Notes
      if (str_detect(line, regex("(?:Your Note on page).(\\d+)"))) {
        highlightPageString <- str_extract(line, regex("(?:Your Note on page).(\\d+)"))
        highlightPage <- str_extract(highlightPageString, regex("(\\d+)"))
        
        locationString <- str_extract(line, regex("(?:Location).(\\d+)"))
        locationNumber <- str_extract(locationString, regex("(\\d+)"))
        
        line = readLines(con, n = 1)
        line = readLines(con, n = 1)      
        line <- str_replace_all(line,"","")           
        
        outputString <- paste0("Note: p",highlightPage, "[",locationNumber,"] - ",line)
        
        outputString <- str_replace_all(outputString,"[$]","X")        
        notes <- paste0(notes, "\n\n", outputString)
        
      }
    }
  }

  sweave.highlights <<- highlights
  sweave.notes <<- notes
  
  close(con)
}

Adobe Digital Edition Extract

For reading eBooks with Adobe Digital Edition, I use an Amazon Fire HD 10 (10.1″) tablet. This allows me to highlight text easily with just my finger.

The file(s) on my tablet are located under:

Main Storage / Digital Editions / Annotations / sdcard / Digital Editions

Every eBook you take out will create a .annot file. The filename seems to be the title of the eBook with the .annot extension. The annotation file uses a XML format to save highlighted text and notes.

I should also note that if you take out the same book again, it will create a new .annot file and add a sequential number to the title.

Enhancements for the future:

  • Understand the Adobe position numbers and sort this in order of page number.

The following is the R code to simply read in the annotation text file and output the notes and highlighted text:

# extractAdobeDigitalHighlights("Healthcare Analytics.annot")

extractAdobeDigitalHighlights <- function (filename) {
  library(XML)
  library(stringr)

  title <- as.character(strsplit(filename, "[.]")[[1]][1])

  xml_data <- xmlTreeParse(filename)  
  rootnode <- xmlRoot(xml_data)
  rootsize <- xmlSize(rootnode)

  datafile <- paste(title,"\n\n")
  separatorTitle <- ""
  for (i in 1:str_length(title)) {
    separatorTitle <- paste0(separatorTitle,"-")
  }
  datafile <- paste0(datafile,separatorTitle,"\n\n")
  
  for (i in 1:rootsize) {
    print(xmlValue(rootnode[[i]][["target"]][["fragment"]][["text"]]))
    datafile <- paste(datafile, xmlValue(rootnode[[i]][["target"]][["fragment"]][["text"]]), sep="\n\n")
  }
  
  outputFilename <- paste0(title,".txt")
  fileConn <- file(outputFilename)
  writeLines(datafile, fileConn)
  close(fileConn)

}